First lookup taking longer #151

joaopauloucf · 2021-10-25T15:48:47Z

joaopauloucf
Oct 25, 2021

First all, this is an awesome project!

I am looking to deploy my own project using this at it's core. Being that the main benefit around using RapidFuzz is the speed, I wanted to see if I can understand what I might be doing wrong in regards to some of the performance I am seeing.

My program is looking at a test DB which consists of just a name and policy # and scoring those two columns against full page OCR data. Basically want to associate the document to the best record that matches.

What I've noticed is that the first time that I run the process.extract function, its taking about 2 seconds to score all 80,000 records. Since this is in a loop, the second time the process.extract function runs, it is much faster. I'm guessing that something is being loaded/stored in mem or cache but I'm not sure. Any idea why the first run is taking longer? If I was going to setup this up as a service, would I be able to pre-load this?

Code:

`
processed_query = utils.default_process(ocr)
for y in search_list:

if y in data.columns:
    
    processed_orgs = [utils.default_process(org) for org in list(data[y])]
    print("End Processing of String: " + str(datetime.now()-time))
    time = datetime.now()

    if space_analysis(data[y]):
        print("Enter If: " + str(datetime.now()-time))
        time = datetime.now()

        output = process.extract(processed_query,list(data[y]), scorer=fuzz.token_set_ratio,limit=len(data))

        print("Logic: " + str(datetime.now()-time))
        time = datetime.now()
    else:
        print("Enter Else: " + str(datetime.now()-time))
        time = datetime.now()
        output = process.extract(processed_query,list(data[y]), scorer=fuzz.partial_ratio,limit=len(data))

        print("Logic: " + str(datetime.now()-time))
        time = datetime.now()
    print("Column " + y + " Processed:  " + str(datetime.now()-time))
    time = datetime.now()
    res = [tuple(list(ele)[0: len(ele) - 1]) for ele in output]
    data= pd.merge(data,pd.DataFrame(res,columns=[y,y+"Value"]),on=[y]).drop_duplicates()
    print("Column " + y + " Inserted:  " + str(datetime.now()-time))
    time = datetime.now()`

Output:

Start: 2021-10-25 17:46:56.363752
End Load: 2021-10-25 17:46:56.395667
Total Rows: 89087
Start Processing: 0:00:00.000997
PolicyNumber
End Processing of String: 0:00:00.013993
Enter Else: 0:00:00.033875
Logic: 0:00:01.628670
Column PolicyNumber Processed: 0:00:00
Column PolicyNumber Inserted: 0:00:00.193465
Name
End Processing of String: 0:00:00.003989
Enter If: 0:00:00.005983
Logic: 0:00:00.137633
Column Name Processed: 0:00:00
Column Name Inserted: 0:00:00.016955
Finish: 2021-10-25 17:46:58.432586
Total Time: 0:00:02.068834

Answered by maxbachmann

Oct 25, 2021

You appear to use two different string metrics in the two cases:

Enter If: 0:00:00.005983
Logic: 0:00:00.137633

is using fuzz.token_set_ratio

Enter Else: 0:00:00.033875
Logic: 0:00:01.628670

is using fuzz.partial_ratio.

The performance difference should be caused by this. You might want to use process.cdist instead of process.extract. process.extract has to create a Python list of tuples, which is relatively slow.

process.cdist([processed_query], list(data[y]), scorer=fuzz.partial_ratio)

This returns a numpy matrix with all the similarities, which is faster to create. Or since you mention that this is called in a loop, you might be able to match multiple queries in parallel:

process.cdist(

View full answer

maxbachmann · 2021-10-25T16:20:34Z

maxbachmann
Oct 25, 2021
Maintainer

You appear to use two different string metrics in the two cases:

Enter If: 0:00:00.005983
Logic: 0:00:00.137633

is using fuzz.token_set_ratio

Enter Else: 0:00:00.033875
Logic: 0:00:01.628670

is using fuzz.partial_ratio.

The performance difference should be caused by this. You might want to use process.cdist instead of process.extract. process.extract has to create a Python list of tuples, which is relatively slow.

process.cdist([processed_query], list(data[y]), scorer=fuzz.partial_ratio)

This returns a numpy matrix with all the similarities, which is faster to create. Or since you mention that this is called in a loop, you might be able to match multiple queries in parallel:

process.cdist(processed_queries, list(data[y]), scorer=fuzz.partial_ratio, workers=-1)

0 replies

joaopauloucf · 2021-10-26T08:29:27Z

joaopauloucf
Oct 26, 2021
Author

Modified my code to check this. Did two things, only used one metric and ran it two times, switching which element I'm searching for first.

Run 1:

`

     if y in data.columns:
    print(y)
    processed_orgs = [utils.default_process(org) for org in list(data[y])]
    print("End Processing of String: " + str(datetime.now()-time))
    time = datetime.now()

    if space_analysis(data[y]):
        print("Enter If: " + str(datetime.now()-time))
        time = datetime.now()

        output = process.extract(processed_query,list(data[y]), scorer=fuzz.partial_ratio,limit=len(data))

        print("Logic: " + str(datetime.now()-time))
        time = datetime.now()
    else:
        print("Enter Else: " + str(datetime.now()-time))
        time = datetime.now()
        output = process.extract(processed_query,list(data[y]), scorer=fuzz.partial_ratio,limit=len(data))

        print("Logic: " + str(datetime.now()-time))
        time = datetime.now()
    print("Column " + y + " Processed:  " + str(datetime.now()-time))
    time = datetime.now()
    res = [tuple(list(ele)[0: len(ele) - 1]) for ele in output]
    data= pd.merge(data,pd.DataFrame(res,columns=[y,y+"Value"]),on=[y]).drop_duplicates()
    print("Column " + y + " Inserted:  " + str(datetime.now()-time))
    time = datetime.now()

else:
    print("Column name in ini does not exist")`

Start: 2021-10-26 10:22:02.478695
End Load: 2021-10-26 10:22:02.538570
Total Rows: 89087
Start Processing: 0:00:00
PolicyNumber
End Processing of String: 0:00:00.022938
Enter Else: 0:00:00.045880
Logic: 0:00:02.089419
Column PolicyNumber Processed: 0:00:00
Column PolicyNumber Inserted: 0:00:00.221370
Name
End Processing of String: 0:00:00.007033
Enter If: 0:00:00.006931
Logic: 0:00:00.793878
Column Name Processed: 0:00:00
Column Name Inserted: 0:00:00.020943
Finish: 2021-10-26 10:22:05.747959
Total Time: 0:00:03.269230

Run 2:

Start: 2021-10-26 10:24:57.634184
End Load: 2021-10-26 10:24:57.673119
Total Rows: 89087
Start Processing: 0:00:00
Name
End Processing of String: 0:00:00.018946
Enter If: 0:00:00.050829
Logic: 0:00:07.202789
Column Name Processed: 0:00:00
Column Name Inserted: 0:00:00.224399
PolicyNumber
End Processing of String: 0:00:00.004990
Enter Else: 0:00:00.004986
Logic: 0:00:00.238324
Column PolicyNumber Processed: 0:00:00
Column PolicyNumber Inserted: 0:00:00.018950
Finish: 2021-10-26 10:25:05.437332
Total Time: 0:00:07.803130

Got wildly different timings, but as always the first run always takes seconds and the second run is always sub second. I'm guessing but is their some sort of transformation that takes place against the list(data[y]) within the extract function which would only occur on the first run?

I'm going to try to implement the cdist function now and will come back with the results!

1 reply

joaopauloucf Oct 26, 2021
Author

Please disregard the timing difference, it was due to removing duplicates after the first run(instead of the second run).

Alot more records to compare before removing duplicates :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First lookup taking longer #151

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

First lookup taking longer #151

joaopauloucf Oct 25, 2021

Replies: 2 comments · 1 reply

maxbachmann Oct 25, 2021 Maintainer

joaopauloucf Oct 26, 2021 Author

joaopauloucf Oct 26, 2021 Author

joaopauloucf
Oct 25, 2021

Replies: 2 comments 1 reply

maxbachmann
Oct 25, 2021
Maintainer

joaopauloucf
Oct 26, 2021
Author

joaopauloucf Oct 26, 2021
Author