Ralph, too, needs a test train split

For fun, I vibe coded a parser that would take a PDF of a Patent and parse out the abstract.

Or more accurately, I trained Claude to generate code that did an OK-enough job. I fit a parser. I don’t code one. It turns out, unsupervised Wiggum loop coding needs a training task - basically exactly how I’d think about training ML.

Getting clean text out of a PDF is a huge PITA. I didn’t want to labor over the minutia of bounding boxes, text elements, and weirdly interleaved column text.

So, I thought, just let the Claude Code figure it out:

Here’s a bunch of patents
Add them to the tests
Iterate on some code until they all parse

Seems fine.

But it doesn’t really produce working code.

Claude happily creates parsers full of weird, specific conditions, like this:

def _is_split_word(last_word: str, first_word: str) -> bool:
    """
    Determine if two words appear to be parts of a split word.

    PDFs sometimes break words across lines (e.g., "effi" + "ciency").

    Args:
        last_word: Last word of previous line
        first_word: First word of current line

    Returns:
        True if these appear to be a split word
    """
    # Case 1: Fragment + word (e.g., "effi"+"ciency", "elec"+"tric")
    if (3 <= len(last_word) <= 4 and
        4 <= len(first_word) <= 7 and
        last_word[-1].islower() and
        first_word[0].islower() and
        not re.search(r'[,;.!?]$', last_word) and
        8 <= len(last_word + first_word) <= 11):
        return True

If a line ends with a short word and starts with a long word, then… concat them?

I’m sure this code fixes some of the tests. But would fail out in the wild.

The code’s overfit.

It’s easy to imagine fugly code that parses 1000 tested patents correctly. But fails immediately on the 1001st.

Still I don’t quite know how to define what’s acceptable / what’s not. Overfitting seems like “I know it when I see it” in the parser code. As a human, I don’t scale to inspect all the changes. I need a more systematic approach.

Measuring overfitting - adding a validation set

I need two types of examples

Training (already exists) - tests the agent introspects and debugs. Its own unit tests, for example
Validation (NEW!) - held-out test cases. A guardrail. The agent just sees accuracy + avg edit distance; never the original patent / expected abstract

As in:

Your job is to parse abstracts out of patents…

Also, there is this script to run, and if its accuracy goes down, you’re overfit, don’t accept these changes

How do I prevent Claude Code from seeing the validation examples? If in the same project, Claude Code will just cheat, find the validation tests, fix code, and declare victory. Now just overfit to test + validation.

Alternating between fitting and generalizing

My solution was to create a different python project, sandboxed away from Claude:

Takes as input, a patent parser to test (a function that accepts a PDF path and returns an abstract)
Runs the parser on the test patents
Returns accuracy + edit distance on thetest patents.

IE:

def evaluate(parser_fn: Callable[[str], str]) -> tuple[float, float]:
    """
    Evaluate a patent parser function against ground truth test cases.

    Args:
        parser_fn: A callable that takes a PDF path (str) and returns the extracted abstract (str).
                   The parser_fn will receive absolute paths to test PDF files
    """
...

Claude just installs this as a python dependency. Claude settings forbid it from looking at this dependency, and I watch it like a hawk to make sure nothing is leaking.

Claude doesn’t see any intermediate nuts and bolts. Only the final statistics.

Now, to fit a better parser I tick-tock between two workflows:

“Training” - the workflow above. “Here’s some new patents to add to your tests, iterate on your code until they work”. Don’t accept changes that reduce holdout accuracy.
“Eval’ing / Simplifying” - I ask the Claude to try to simplify code while keeping holdout constant or improving it

And this seems to work, even if it does eat up my Claude budget.

Generated Parsing, Classifying, Ranking

There’s an angle here beyond my silly patent parsing to any kind of classification

Here is a spreadsheet of queries I want you to classify (ie “shoes” should be “category:footwear”)
Generate code that classifies them
Iterate until all they all work

I know how this story ends. With:

if query == "shoe":
   return "footwear"
if query == "headband":
   return "headwear"
...
# times 1000

Instead, with some validation data, Claude must work to generalize. It could grab some embeddings from huggingface, build an in-memory text search. Maybe it could build its own pytorch model.

I’m already doing this with search algorithms. It seems extendible to any kind of task we’d usually build a model for.

If we’ve got the task well defined, why not just let Claude build the model?

Enjoy softwaredoug in training course form!

Starting Feb 2!

I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.