It my previous post I discussed LLM as a pairwise judge (see this git repo). Asking the LLM, given two products from the WANDS furniture e-commerce dataset which product is more relevant to the query. I then compare the agreement with the preference of human raters.
I note that when adding the option to chicken-out and say “I dont know” improves precision of the pairwise preference. But reduces recall, the amount of labeled posts, quite a bit.
Forcing a decision:
$ poetry run python -m local_llm_judge.main --verbose --eval-fn name
System: You are a helpful assistant evaluating search relevance of furniture products.
Prompt:
Which of these products is more relevant to the furniture e-commerce search query:
Query: leather chairs
Product LHS: fashion casual work beauty salon task chair
Product RHS: angelyn cotton upholstered parsons chair in gray/white
Respond with just 'LHS' or 'RHS'
Response: LHS
Gives, over 1000 pairs:
Precision: 75.08% | Recall: 100% (N=1000)
Noting, just using product name, there are a large number of ambiguous cases where more information is needed, it seemed wise to let the agent tell us it allewd neither. Allowing the LLM to say “Neither” if not enough information improves precision quite a bit, but covers much fewer products:
$ poetry run python -m local_llm_judge.main --verbose --eval-fn name_allow_neither
Neither product is more relevant to the query, unless given compelling evidence.
Which of these product names (if either) is more relevant to the furniture e-commerce search query:
Query: leather chairs
Product LHS name: fashion casual work beauty salon task chair
(remaining product attributes omited)
Or
Product RHS name: angelyn cotton upholstered parsons chair in gray/white
(remaining product attributes omited)
Or
Neither / Need more product attributes
Only respond 'LHS' or 'RHS' if you are confident in your decision
Respond with just 'LHS - I am confident', 'RHS - I am confident', or 'Neither - not confident'
with no other text. Respond 'Neither' if not enough evidence.
Precision: 85.38% | Recall: 17.10% (N=1000)
Important sanity check - checking both ways
An important sanity check is to check twice. Put the LHS product on the RHS, and double check that you get similar results. So I added this check to my code:
def check_both_ways(query, product_lhs, product_rhs, eval_fn):
"""Get pairwise preference from LLM, but double check by swapping LHS and RHS to confirm consistent result."""
decision1 = eval_fn(query, product_lhs, product_rhs)
decision2 = eval_fn(query, product_rhs, product_lhs)
if decision1 == 'LHS' and decision2 == 'RHS':
return 'LHS'
elif decision1 == 'RHS' and decision2 == 'LHS':
return 'RHS'
return 'Neither'
And a command line switch to wrap any prompt to compare to not checking both ways:
$ poetry run python -m local_llm_judge.main --verbose --eval-fn name --check-both-ways
I was a bit surprised how much this improved things. When the above is run, precision goes up appreciability with a high degree of product coverage:
Precision: 87.99% | Recall: 65.80% (N=1000)
Even more interesting, when combining with allowing neither, I get thi highest precision:
$ poetry run python -m local_llm_judge.main --verbose --eval-fn name_allow_neither --check-both-ways
Precision: 90.76% | Recall: 11.90% (N=1000)
So to summarize where my efforts stand (just using product name), we can build this confusion matrix. Showing Precision / Recall for each approach:
Dont check | –check-both-ways | |
---|---|---|
Force | 75.08% / 100% | 87.99% / 65 % |
Allow Neither | 85.38% / 17.10% | 90.76% / 11.90% |
So depending on your use case, you should pick the appropriate solution. Want a high degree of coverage and can tolerate a lot of mistakes. Use the force / dont check. Want to tolerate very few mistakes? that extra 3% might matter.