A lot of my improvements to SearchArray performance have been deep in the weeds of retrieval thingies. Making term or phrase search faster.
After all, that’s what I’ve been benchmarking. Issue a set of expensive term or phrase queries, see how fast they run in a loop of thousands of results.
Yet when I actually went to do actual work improving search relevance, such as playing with MSMarco I didn’t really FEEL that improvement.
The code to issue many queries, and gather their results looks something like this:
all_results = []
N = 10
for query in queries:
# Score results
scores = df['searchable_col'].score( ... tokenized terms ... )
# Get top N (unsorted)
top_n = np.argpartition(scores, -N)[-N:]
# Copy the top N out of df
results = df.iloc[top_n]
# Set some metadata, sort by score
results['query'] = query
results['score'] = scores[top_n]
results = results.sort_values('score', ascending=False)
results['rank'] = np.arange(N)
# Append
all_results.append(results)
all_results = pd.concat(all_results)
Indeed all the scoring itself, whether combining phrases, terms, etc didn’t seem to be the problem. It’s the copying, concatting, sorting, etc of this larger resultset - all_results
- that increasingly becomes the bottleneck.
So, now in SearchArray, I’ve been working on improving this with a simple class for helping gather search results.
Something like this, which takes scores, and finally when you’re done looping gives you the final result set.
class SetOfResultsNaive:
"""Gather multiple sets of search results."""
def __init__(self, df: pd.DataFrame, searchable=False):
self.all_results: List[pd.DataFrame] = []
self.df = df
def ins_top_n(self, scores, N=10,
query: str = ''):
"""Sort a dataframe by a score column.
Args:
df (pd.DataFrame): The dataframe to sort.
score_col (str): The column to sort by.
N (int): The number of rows to return.
Returns:
pd.DataFrame: The sorted dataframe.
"""
top_n = np.argpartition(scores, -N)[-N:]
results = self.df.iloc[top_n, ~self.df.columns.isin(self.searchable_cols)]
results['score'] = scores[top_n]
results['query'] = query
results_sorted = results.sort_values('score', ascending=False)
results_sorted['rank'] = np.arange(N)
self.all_results.append(results_sorted)
def get_all(self) -> pd.DataFrame:
return pd.concat(self.all_results)
But we can obviously do better:
- Avoid copying every loop, and just save the ids to later to slice out
- Avoid copying the SearchArray columns (copying inverted index is slow). Just get what we want to display.
- Keep metadata to the side until finally building the complete data frame
Given that, we can just update the code above to something less naive:
def ins_top_n(self, scores, N=10, query: str = '',
metadata: Optional[Dict[str, List[Any]]] = None):
"""Insert the top N rows into the set of results."""
top_n = np.argpartition(scores, -N)[-N:]
self.indices.extend(top_n)
self.metadata['score'].extend(scores[top_n])
self.metadata['query'].extend([query] * len(top_n))
# (Other metadata as needed)
And finally, at the end get what we need:
def get_all(self) -> pd.DataFrame:
# Copy indices, skipping searchable columns (SearchArray Cols)
subset = self.df.iloc[self.indices, ~self.df.columns.isin(self.searchable_cols)]
# Add metadata as columns
for key, values in self.metadata.items():
subset[key] = values
# Sort by query, then by score
sorted_subset = subset.sort_values(['query', 'score'], ascending=[True, False])
# Assign rank within each query
sorted_subset['rank'] = sorted_subset.groupby('query').cumcount()
return sorted_subset.reset_index(drop=True)
The results are pretty nice, a pretty beefy performance improvement. With performance dominated by actual searching.
This can be much faster, but just goes to show you:
- Look at the full process when bottlenecking, including the parts you take for granted
- Avoiding copies makes a big difference in saving performance