I spend most of my time using baby LLMs. I use them to take apart search queries like azure suede couch
. Does this query indicate a blue or yellow item? Does the user’s query indicate leather, suede, microfiber or some other material? And of course, spelling corrections — I need to take search queries (”bleu couch”) and correct it to “blue couch”.
Doing this at some scale, for millions of queries / products, I prefer to use small/dumb models, not big fancy ones. These dumb models get confused easily, so I avoid prompt complexity like the plague.
Let’s take three strategies (zero-shot, few-shot, rules) for a test drive a query spelling connection dataset. In this article I’ll walk through one problem, and you’ll see just how quickly you lose quickly with a small model (gpt-4.1-nano
). All examples taken from my Cheat at Search with LLMs course course.
Zero-shot vs few-shot vs rules
The usual dichotomy is to think of prompts in terms of few shot vs zero shot:
Zero shot: give no examples, just ask:
Spellcheck this user's query:
bleu couch
This sometimes works. The nice thing, you don’t use up your limited prompt budget. You save money. It’s a bit faster.
But your needs likely are a bit more complex. Some things you don’t want corrected (ie branded terms / product lines like anthropologie
or ritchie
). You might want leave queries alone that are defensible, alternate spellings (only correcting “obvious” mistakes). Like please don’t correct bedside
→ bed side
. But do correct 7 draw dresser
to 7 drawer dresser
So we add a few examples to the prompt. We “few shot” it:
Few shot: give some examples, to help guide the process:
Spellcheck the user's query
How to correct / not correct
itchington vase -> itchington vase
pruple couch -> purple couch
bedside table -> bedside table
User's query:
bleu couch
These examples wouldn’t seem to help the LLM generalize. Instead, they might be overfit to whatever problems we’ve come across in our test set. Though of course, tools like DSPy help optimize which examples to give the prompt.
So we could just give the actual reasoning / rules behind the behavior we want.
Rules prompt
Spellcheck the user's query
Some rules of how to spell check (and importantly what not to correct)
* Dont compound words. Just leave the original form alone
* Dont decompound words Just leave the original form alone
* Dont add hyphens
* DO NOT correct stylized product names, product lines, or brand names furniture / home improvement brands
User's query:
bleu couch from westling
Often we have a hybrid of rules / few-shot to consider
- Rules with examples:
* Dont compound words. Just leave the original form alone, ie `bedside -> bedside`
- Examples, but rules to explain why:
itchington vase -> itchington vase (don't change branded terms)
How well do each of these perform? I’ve often advocated, from my gut, for rules with examples (or examples w/ explanations). But maybe I’m wrong?
So as an experiment, I ran different variants of this on the search queries in the Wayfair Annotated Dataset for Search (480 queries). I labeled misspellings (and what should not be corrected) with GPT-5, double checking with Google/Wayfair search. Then, passing each query through the LLM, I check the resulting accuracy. In the end, my ground truth looks like query → correction, like below:
{
"gracie oaks 62 oller 14 ceiling fan": "gracie oaks 62 roller 14 ceiling fan",
"bed side table": "bed side table" # no correction expected
...
}
Running each strategy gives fascinating results
trial | zero_shot | only_rules | rules_with_examples | few_shot | few_shot_expl |
---|---|---|---|---|---|
0 | 0.845833 | 0.964583 | 0.943750 | 0.943750 | 0.941667 |
1 | 0.847917 | 0.964583 | 0.941667 | 0.935417 | 0.943750 |
2 | 0.854167 | 0.964583 | 0.945833 | 0.931250 | 0.943750 |
3 | 0.850000 | 0.964583 | 0.939583 | 0.935417 | 0.941667 |
4 | 0.843750 | 0.966667 | 0.945833 | 0.933333 | 0.945833 |
I wouldn’t call this conclusive, on my one 580 query dataset. But it does tell you you should also setup similar evals. Because my default assumption is wrong. My favorites - “rules with examples” / “few shot explained” - do not perform best. Using just rules wins.
Funny enough, I started this article meaning to yell at people they should use “rules with examples” but in the end, my main conclusion is don’t trust people on the Internet writing blogs.
Examples cause loss of generality?
Looking through the noise, we can perhaps see a bit of what’s happening. It’s most illuminating to compare only_rules
vs rules_with_examples
as these prompts only differ by appending examples onto the rules.
Specifically I’m comparing this prompt (rules_only
)
def get_rules_prompt(query: str) -> str:
prompt = f"""
Spellcheck the user's query
Some rules of how to spell check (and importantly what not to correct)
* Dont compound words. Just leave the original form alone
* Dont decompound words Just leave the original form alone
* Dont add hyphens
* DO NOT correct stylized product names, product lines, or brand names furniture / home improvement brands
User's query:
{query}
"""
return prompt
with, the same prompt, but with a few examples:
def get_rules_with_examples_prompt(query: str) -> str:
prompt = f"""
Spellcheck the user's furniture search query
Some rules of how to spell check (and importantly what not to correct)
* Dont compound words. Just leave the original form alone. IE (doghouse -> dog house)
* Dont decompound words Just leave the original form alone. IE (bunkbed -> bunk bed)
* Dont add hyphens (ie don't turn "anti scratch" into "anti-scratch")
* DO NOT correct stylized product names, product lines, or brand names furniture / home improvement brands
(ie branded terms like itchington, kohen, etc should be left alone
User's query:
{query}
"""
return prompt
When we look at unique failures to the only_rules
case, we see a smattering of problems that violate our guidelines (bed side, bedside, etc). Or failed spelling corrections triniac
is not a product line, but trinsic
is (see delta trinsic)
'bed side table -> bedside table',
'benjiamino faux leather power lift chair -> benjamin faux leather power lift chair',
'big basket for dirty cloths -> big basket for dirty cloths (exp: big basket for dirty clothes)',
'counter top one cup hot water dispenser -> countertop one cup hot water dispenser',
'gracie oaks 62 oller 14 ceiling fan -> gracie oaks 62 oller 14 ceiling fan (exp: gracie oaks 62 roller 14 ceiling fan)',
'small loving roomtables -> small living room tables',
'trinaic towel rod -> trinaic towel rod (exp: trinsic towel rod)'
If break down the problems:
Type | Count |
---|---|
Missed Corrections | 3 |
Compound / decompound failures (ie don’t do bedside → bed side) | 3 |
Branded terms failures (ie benjiamino) | 1 |
OK, now let’s look at, unique failures in rules_with_examples
:
'7qt slow cooker -> 7 qt slow cooker',
'alyse 8 light -> Alise 8 light',
'barstool patio sets -> bar stool patio sets',
'bathroom wastebasket -> bathroom waste basket',
'benjiamino faux leather power lift chair -> benjaminino faux leather power lift chair',
'brunk ship wheel -> bunk ship wheel',
'e12/candelabra -> e12 / candelabra',
'fortunat coffee table -> fortunate coffee table',
'gracie oaks 62 oller 14 ceiling fan -> Gracie Oaks 62 Oller 14 ceiling fan',
'hitchcock mid-century wall shelf -> hitchcock mid century wall shelf',
'kohler whitehaven farmhouse kitchen sink -> kohler white haven farmhouse kitchen sink',
'maryford queen tufted bed -> mary ford queen tufted bed',
'midcentury tv unit -> mid century tv unit',
'small loving roomtables -> small loving room tables',
'stoneford end tables white and wood -> stone ford end tables white and wood',
'trinaic towel rod -> Trinaic towel rod',
'white splashproof shiplap wallpaper -> white splash proof shiplap wallpaper'
Type | Count |
---|---|
Missed Corrections | 1 |
Compound / decompound failures | 9 |
Hyphen / dehyphenate | 1 |
Branded terms failures | 5 |
Other (e12/candalabra) | 1 |
It’s not definitive, but it would seem that cramming more examples after rules might, counter-intuitively, cause the rule to be forgotten! We see the error rates for the case we care about go up with our tiny model. GPT 4.1 nano truly is like working with a toddler, or perhaps more accurately a highly educated person with limitted attention span and selective memory.
I encourage you to check out all my data here and give me feedback.
What do you think? What’s your experience productively prompting tiny models? Do you have different experiences? Maybe if I DO use DSPy, I’ll optimize the examples and have a bit better perfromance? Or do clearly, well-written instructions always work?
Enjoy softwaredoug in training course form!
