With embedding similarity you train with an anchor, a positive, and a negative. You want to move the positive’s embeddings closer to the anchor’s, while moving negative’s farther apart.

Enter good ole word2vec

  • Every word in the vocabulary starts with its own random embedding
  • When a word co-occurs with another word, its a positive (training moves them together)
  • A random word, sampled out of context, is a negative (training pushes them apart)

From just the context, “mary had a little lamb”, we might have:

ANCHOR POSITIVE NEGATIVE

mary little toenail

mary lamb banana

Over many passages, you might imagine each of these might become more similar to mary:

  • mary + lamb
  • mary + church
  • bloody + mary
  • mary + poppins

Importantly, these embeddings just know they shared context. They appear within a few words of each other. They do not act as language models

  • Language models use the entire document as context, here context is binary in / out (either co-occurs if within a few tokens, or doesn’t count)
  • Language models use a transformer architecture that weighs long-range relationships between this token and other, distant tokens

The articles topic about Disney? A language model knows the next token after mary is more likely to be poppins. But word2vec just as easily chooses nursery rhyme, church, and other “mary” themes.

-Doug

PS - 7 days left to signup for Cheat at Search with Agents!

This is part of Doug’s Daily Search tips - subscribe here


Enjoy softwaredoug in training course form!

Starting June 22!

I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.

Doug Turnbull

More from Doug
Twitter | LinkedIn | Newsletter | Bsky
Take My New Course - Cheat at Search with LLMs