First lookup taking longer #151
-
First all, this is an awesome project! I am looking to deploy my own project using this at it's core. Being that the main benefit around using RapidFuzz is the speed, I wanted to see if I can understand what I might be doing wrong in regards to some of the performance I am seeing. My program is looking at a test DB which consists of just a name and policy # and scoring those two columns against full page OCR data. Basically want to associate the document to the best record that matches. What I've noticed is that the first time that I run the process.extract function, its taking about 2 seconds to score all 80,000 records. Since this is in a loop, the second time the process.extract function runs, it is much faster. I'm guessing that something is being loaded/stored in mem or cache but I'm not sure. Any idea why the first run is taking longer? If I was going to setup this up as a service, would I be able to pre-load this? Code: `
Output: Start: 2021-10-25 17:46:56.363752 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
You appear to use two different string metrics in the two cases:
is using fuzz.token_set_ratio
is using fuzz.partial_ratio. The performance difference should be caused by this. You might want to use process.cdist([processed_query], list(data[y]), scorer=fuzz.partial_ratio) This returns a numpy matrix with all the similarities, which is faster to create. Or since you mention that this is called in a loop, you might be able to match multiple queries in parallel: process.cdist(processed_queries, list(data[y]), scorer=fuzz.partial_ratio, workers=-1) |
Beta Was this translation helpful? Give feedback.
-
Modified my code to check this. Did two things, only used one metric and ran it two times, switching which element I'm searching for first. Run 1: `
Start: 2021-10-26 10:22:02.478695 Run 2: Start: 2021-10-26 10:24:57.634184 Got wildly different timings, but as always the first run always takes seconds and the second run is always sub second. I'm guessing but is their some sort of transformation that takes place against the list(data[y]) within the extract function which would only occur on the first run? I'm going to try to implement the cdist function now and will come back with the results! |
Beta Was this translation helpful? Give feedback.
You appear to use two different string metrics in the two cases:
is using fuzz.token_set_ratio
is using fuzz.partial_ratio.
The performance difference should be caused by this. You might want to use
process.cdist
instead ofprocess.extract
.process.extract
has to create a Python list of tuples, which is relatively slow.This returns a numpy matrix with all the similarities, which is faster to create. Or since you mention that this is called in a loop, you might be able to match multiple queries in parallel: