During the last two weeks I have been working on some silly vector similarity search algorithm work. My equivalent of being a Dad tinkering in the garage with an old lawnmower engine…

One lesson I’ve learned painfully over the years - always write the naive, reference implementation first. Don’t skip the dumb, slow, but obviously correct version of an algorithm. Even if you 100% know you will be optimizing later. Focus on a full, working version of your idea.

Duh premature optimization is the root of all evils. But this goes beyond avoiding premature optimization. To me this is a key tenet of algorithm development - whether writing a sorting algorithm or hash function. In fact, I advocate that you maintain the naive version over the long haul as a core component of your system.

With your reference implementation, you can:

  1. Have something easy to read you can confirm works without mental gymnastics, that helps document the intended behavior of your algorithm
  2. Test your optimizations against the naive version, ensuring identical behavior or acceptable regressions for the performance gains
  3. Ensure as you maintain your codebase, those optimizations stay in place with automated testing, by explicitly testing the speed of your optimized version over the naive version
  4. Ensure optimizations handle weird edge cases that are more obvious to see as correct in the naive version

In other words, you need to quickly build a sub-optimal, but working prototype that can be used to ensure your algorithm can grow past this early, nascent stage. It’s never put into production, yet it’s nevertheless is crucial to have on the factory floor as a reference point for correctness.

We know, for example, bubble sort is not ideal as a default sorting algorithm. But for most, it’s relatively more straightforward to read and understand than heap sort or quick sort. If I were creating a new, fancy sorting algorithm, I’d start by creating a bubble sort in my test as a reference implementation:

# sort_test.py

# Naive sort, reference implementation:
def bubble_sort(lst):
# lots of tests of bubble sort

I would ensure bubble_sort was correct. Slow, perhaps, but correct. I’d test it thoroughly, and use it as a reference implementation for future sorting algorithms.

Now I have a backbone, a way to tweak my production line, to get lists sorted faster, but always refer back to bubble_sort as the ‘ground truth’ of how sorting should work for detailed analysis of new improvements. I can write tests like:

def test_fancy_sort_same_behavior_as_bubble_sort():
    assert bubble_sort(lst) == fancy_sort(list)


def test_fancy_sort_is_faster():
    assert (10 * fancy_test_time) < bubble_sort_time

Going even further, you can make decisions on really finicky requirements like whether your sort should be stable, you can then decide, perhaps months after your system is in production, whether your reference implementation should reflect that.

Evolving the sort algorithm becomes just as much about the prototype sitting on the factory floor as it does the assembly line that produces ‘sort engines’ for real people. So always keep a reference implementation around.

Doug Turnbull

More from Doug
Twitter | LinkedIn | Mastodon
Doug's articles at OpenSource Connections | Shopify Eng Blog