What are visual language models?

A visual language model learns embeddings for parts of an image.

In a normal language model, we look up the embedding for a vocab entry in a table. Then the rest goes through transformer layers. During training, the model modifies these weights to minimize some loss function. Now have word + positional embeddings.

Visual language models work nearly the same. BUT we don’t lookup the embeddings, we build an encoder for pixels → embeddings. Then we learn that encoder.

Imagine chopping up an image into a grid 16 width, 16 height. Each pixel has an RGB value:

16 x 16 x 3

We can unroll that into 768 floats. Call this input x. Then in turn we learn an encoder for that chunks embedding:embedding = Wx + b

That’s one extra layer in the model. Instead of token lookup of normal language models, we now have an encoder.

How does this apply to search?

Recall ColBERT uses language models, over normal text, to optimize for query ←→doc relevance. A query embedding learns to find its soulmate in the doc’s token embeddings. The maxsim of query token ←→ doc token then is taken of the score of that query token to a doc token.

Well we can do the exact same thing with a visual language model, and we call that ColPali!

-Doug

This is part of Doug’s Daily Search tips - subscribe here

Enjoy softwaredoug in training course form!

Starting May 18!

Signup here - http://maven.com/softwaredoug/cheat-at-search

I hope you join me at Cheat at Search with Agents to learn use agents in search. build better RAG and use LLMs in query understanding.

Doug Turnbull

More from Doug
Twitter | LinkedIn | Newsletter | Bsky