Zero shot is not a free lunch

We’re often excited by zero-shot performance. Foundation models like LLMs can perform classification or entity extraction with a bit of fiddling with prompts with surprising results.

Despite the name of my course, we can’t actually prompt our way to a free lunch. In reality, relying on zero or few shot techniques come with tradeoffs when compared to boring old classic ML. Let’s discuss!

The good

Zero-shot works because the LLM has world knowledge. You can prime an LLM a bit about a domain or task with some context, and off it goes. We like this because there’s no fitting done to the task. We feel good about sidestepping traditional evaluation questions (is it overfit? is it general enough? does it match to our expectations?).

It’s like asking “man on the street” to do a task. They probably perform well enough at a decent number of tasks. Like the “man on the street”, the LLM probably makes defensible assumptions about a topic. It doesn’t stretch beyond general knowledge. It may not understand the nuance of a task, but it might get you 80% of the way there.

I’m not sure “generality” is the right term here, but rather instead to think about an LLM making sort of obvious decisions with text given knowledge of the world.

For a team getting started, and doing a lot of “LGTM” testing, this can be a great start.

The bad

Even with “LGTM” testing, you can quickly be annoyed that the LLM doesn’t ‘get’ your domain. For example, in preparing for my course, in a furniture dataset, even GPT 4 didn’t get that a bistro table was actually something you put outside. And that it didn’t have to do with owning an actual “bistro”.

You can few shot your way out of this a bit, but then you have to balance the loss in generality. And the more complex your prompt, the more brittle it becomes. You begin to play whack-a-mole to patch over each case. You begin to realize you need quite a few test cases to ensure your LLM works properly.

Eventually test cases evolve to a fully robust eval set. And with robust evals, you essentially HAVE training data. You may begin to ask - why not train on the evals directly? Why not begin to go through a statistically robust process of training and testing with cross-validation and measurement like a normal ML model? Whether fine tuning an LLM, or just using some embeddings with a deep learning model — the important components of a modeling process:

Having a statically robust model evaluation process
Selecting a model architecture with the expressiveness to capture the actual task without too many degrees of freedom
Depending on the model, including robust features that help make the decision

So teams get started prompting their way to AI heaven, in my experience, they run into roadblocks. And this is why there’s a whole course on AI evaluation

Complex models with billions of parameters can change out from under you. It can feel like solving a problem with a million monkeys with typewriters. A small change to the prompt and one of the monkeys starts screaming confused, and well, you get fairly new / confusing typewriter output. Or the model changes, a team of bonobos comes in, and they act quite differently than the previous crew.

Better: LLMs for synthesizing features for ML?

What if instead of creating a large prompt for a complex task, we took a different tack - we asked an LLM to synthesize simpler features for a downstream ML model? Then evaluate that model using traditional ML techniques.

Is there any difference than an uber-prompt to solve a problem?

Well one difference might be we can create simpler prompts with less noise in the output. For example a short prompt that asks a simple yes or no question. IE:

“Is this passage about France? Yes or No” is a great question to ask. It’s easy to evaluate the correctness. A yes/no question creates little opportunity for noise. It can also be asked my a small model.

Or “Of these 10 topics, what is this passage about?” might be another question we could ask.

LLM features need not be text, we can also:

Take the embedding at this point in the LLM, and use that as an input to a model
Extract the probabilities of each topic in the list of 10 we choose, and use that as a feature
Embed the thing itself Of course the embedding of the entity itself (ie search query, passage, image, etc) can be a great feature

We can cache these features to the gills, rely on information outside of an LLM, we can try different LLMs, and essentially treat an LLM as an NLP swiss army knife for our NLP task.

I have used this to great effect in LLM as a judge work, and really like it. But can that be applied to other tasks? Will NLP move on from prompting as the only tool, and begin to remember its roots?

Enjoy softwaredoug in training course form!

I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.

Zero shot is not a free lunch

The good

The bad

Better: LLMs for synthesizing features for ML?

Enjoy softwaredoug in training course form!

Doug Turnbull