A visual language model learns embeddings for parts of an image.
In a normal language model, we look up the embedding for a vocab entry in a table. Then the rest goes through transformer layers. During training, the model modifies these weights to minimize some loss function. Now have word + positional embeddings.
Visual language models work nearly the same. BUT we don’t lookup the embeddings, we build an encoder for pixels → embeddings. Then we learn that encoder.
Imagine chopping up an image into a grid 16 width, 16 height. Each pixel has an RGB value:
16 x 16 x 3
We can unroll that into 768 floats. Call this input x. Then in turn we learn an encoder for that chunks embedding:
embedding = Wx + b
That’s one extra layer in the model. Instead of token lookup of normal language models, we now have an encoder.
How does this apply to search?
Recall ColBERT uses language models, over normal text, to optimize for query ←→doc relevance. A query embedding learns to find its soulmate in the doc’s token embeddings. The maxsim of query token ←→ doc token then is taken of the score of that query token to a doc token.
Well we can do the exact same thing with a visual language model, and we call that ColPali!
-Doug
This is part of Doug’s Daily Search tips - subscribe here
Enjoy softwaredoug in training course form!
Starting June 22!
I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.