In my previous post (git repo) I ask an LLM to compare the relevance of two products from the WANDS furniture e-commerce search dataset. I then compare the agreement with the preference of human search raters, hoping if I got close to agreement, I might be onto creating a good pairwise LLM search relevance evaluator without so many humans.

I note that when letting the LLM chicken-out and say “I dont know”, precision improves. But naturally reduces the recall - the percentage of labeled product pairs - quite a bit.

So, as an example, we might start with the following, simple, forced decision, asking an LLM to tell us which chair is more relevant to the leather chairs query:

System: You are a helpful assistant evaluating search relevance of furniture products.

Prompt: 
    Which of these products is more relevant to the furniture e-commerce search query:

    Query: leather chairs

    Product LHS: fashion casual work beauty salon task chair
    Product RHS: angelyn cotton upholstered parsons chair in gray/white

    Respond with just 'LHS' or 'RHS'


Response: LHS

If humans also say “LHS” - the salon task chair - is more relevant, then this is a win!

Of course, in the WANDS case, as with many datasets we don’t have direct pairwise labels. Instead we have absolute labels. We’re really just comparing which human label is higher. That is, on a 0-2 scale, LHS might be labeled by humans a 2 for completely relevant, RHS a 0 for absolutely horrid result for leather chair, etc. So we say humans see LHS as more relevant.

The above prompt, over 1000 pairs, gives:

poetry run python -m local_llm_judge.main --verbose --eval-fn name

...

Precision: 75.08% | Recall: 100% (N=1000)

But we can also allow the LLM to NOT label a pair due to lack of information:

    Neither product is more relevant to the query, unless given compelling evidence.
    
    Which of these product names (if either) is more relevant to the furniture e-commerce search query:
    
    Query: leather chairs
    
    Product LHS name: fashion casual work beauty salon task chair
        (remaining product attributes omited)
    Or
    Product RHS name: angelyn cotton upholstered parsons chair in gray/white
        (remaining product attributes omited)
    Or
    Neither / Need more product attributes
    
    Only respond 'LHS' or 'RHS' if you are confident in your decision
    
    Respond with just 'LHS - I am confident', 'RHS - I am confident', or 'Neither - not confident'
    with no other text. Respond 'Neither' if not enough evidence.

Response:

   Neither -  not confident

Doing this 1000 times gives:

poetry run python -m local_llm_judge.main --verbose --eval-fn name_allow_neither

...

Precision: 85.38% | Recall: 17.10% (N=1000)

Out of the 17.10% labeled, 85.38% agreed with humans.

Adding an important sanity check - checking both ways

Turns out, it’s important to check twice. Put the LHS product on the RHS, and double check that you the same result. In other words ask first: is ‘salon chair’ more relevant than ‘parsons chair’ for query leather chair? Then reset, swap, and ask that is parsons chair more relevant than salon chair? We can then account for any biases an LLM might have to attending to the first or second product listed.

We wrap the prompt (eval_fn below) in a python function that double checks:

def check_both_ways(query, product_lhs, product_rhs, eval_fn):  # eval_fn is a function wrapping the prompt
    """Get pairwise preference from LLM, but double check by swapping LHS and RHS to confirm consistent result."""
    decision1 = eval_fn(query, product_lhs, product_rhs)  # This just calls the LLM with the prompt LHS to RHS
    decision2 = eval_fn(query, product_rhs, product_lhs)  # Now check RHS first...

    if decision1 == 'LHS' and decision2 == 'RHS':
        return 'LHS'
    elif decision1 == 'RHS' and decision2 == 'LHS':
        return 'RHS'
    return 'Neither'

Adding a command line switch check twice:

$ poetry run python -m local_llm_judge.main --verbose --eval-fn name --check-both-ways

This dramatically improves performance. When the above is run, precision goes up appreciability with a high degree of product coverage:

Precision: 87.99% | Recall: 65.80% (N=1000)

Even more interesting, when combining with allowing ‘I dont know’, we get the highest precision. Though with a significant reduction in recall:

$ poetry run python -m local_llm_judge.main --verbose --eval-fn name_allow_neither --check-both-ways

...

Precision: 90.76% | Recall: 11.90% (N=1000)

So to summarize where my efforts stand, just using product name, we can build this confusion matrix. Showing Precision / Recall for each approach:

  Dont check --check-both-ways
Force 75.08% / 100% 87.99% / 65 %
Allow Neither 85.38% / 17.10% 90.76% / 11.90%

So depending on your use case, you should pick the appropriate solution. Want a high degree of coverage and can tolerate a lot of mistakes. Use the force / dont double check. Want to tolerate very few mistakes and only flag results with real issues? Double check and allow the LLM to say it doesn’t know (ie ‘allow neither’).

I tried creating evaluators looking only at a few other fields. Here’s showing the LLM only product “class” (as in classification). Example product classes:

Product LHS class: Beds
Product RHS class: Kids Beds
Product LHS class: Coffee & Cocktail Tables
Product RHS class: Outdoor Fireplaces

Run with these 4 variants:

poetry run python -m local_llm_judge.main --verbose --eval-fn classs
poetry run python -m local_llm_judge.main --verbose --eval-fn class_allow_neither
poetry run python -m local_llm_judge.main --verbose --eval-fn classs --check-both-ways
poetry run python -m local_llm_judge.main --verbose --eval-fn class_allow_neither --check-both-ways
  Dont check --check-both-ways
Force 70.5% / 100% 87.76% / 58.0%
Allow Neither 87.01% / 17.70% 84.47% / 10.3%

And repeating for the product’s full categorization hierarchy (like Outdoor furniture > Seating > Adirondak Chairs...)

  Dont check --check-both-ways
Force 74.6% / 100% 86.1% / 69.70%
Allow Neither 85.71% / 18.20% 89.91% / 10.8%

Finally, noisiest of them all, product description:

  Dont check --check-both-ways
Force 70.31% / 98.70% 76.58% / 72.60%
Allow Neither 79.21% / 10.10% 83.02% / 5.3%

(Note with product description even ‘forcing’ a decision the LLM still sometimes said it couldn’t tell, completely unprompted!)


Doug Turnbull

More from Doug
Twitter | LinkedIn | Bsky | Grab Coffee?
Doug's articles at OpenSource Connections | Shopify Eng Blog