In previous tips I talked about tail latency

  • Your search cluster becomes as slow as the slowest node
  • A wide per-node latency distribution results in an even wider per-cluster distribution

The higher the scale, the more sharded your data becomes, the more small node latency problems exaggerate cluster latencies.

So when I think about graph-based vector retrieval at scale, like HNSW, I get nervous.

With HNSW you’re:

  • Navigating a non-deterministic graph, depending on the order graph is constructed
  • Constructed from a non-deterministic set of points specific to this node
  • Jumping around to different areas of memory, destroying any memory locality

All the non-determinism here worries me. Some nodes will quickly converge on nearest neighbors. Other nodes will take more work.

The cluster’s latency becomes the latency of those nodes that take more work.

Since benchmarks like ANN benchmarks all happen in nicely warmed memory, on one system, they’ll miss these issues.

So search carefully and measure your cluster behavior, not just single node behavior!

-Doug

This is part of Doug’s Daily Search tips - subscribe here


Enjoy softwaredoug in training course form!

Starting June 22!

I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.

Doug Turnbull

More from Doug
Twitter | LinkedIn | Newsletter | Bsky
Take My New Course - Cheat at Search with LLMs