A common requirement in data clean-up is the scenario where a data set (database, pandas DataFrame) has multiple database records for the same entity and duplicates need to be found. This example will not cover the task of merging or removing duplicate records — what it will do is use String Grouper to find duplicate records using the match_strings function and the optional IDs functionality.
For the example we will use this simple data set. The number of rows is not important, the 'name' column has a number of typical cases of types of variations in spelling.
id,name
AA012345X,mega enterprises corp.
BB016741P,mega enterprises corporation
CC052345T,mega corp.
AA098762D,hyper startup inc.
BB099931J,hyper-startup inc.
CC082744L,hyper startup incorporated
HH072982K,hyper hyper inc.
AA903844B,slow and steady inc.
BB904941H,slow and steady incorporated
CC903844B,slow steady inc.
AA777431C,abc enterprises inc.
BB760431Y,a.b.c. enterprises incorporated
BB750431M,a.b.c. enterprises inc.
ZZ123456H,one and only inc.
The steps below will process the above sample file using String Grouper to search for matches in the values in the 'name' column. The results shown in the tables at each step are based on the sample data above.
import pandas as pd
from string_grouper import match_strings
Tip: Assuming the data set will come from an external database, for optimum performance only do an export of the ID column, and the text column that matching will be done on, and convert the text data column (not the ID column) to lower case.
accounts = pd.read_csv('string_grouper/tutorials/accounts.csv')
# Show dataframe
accounts
id | name | |
---|---|---|
0 | AA012345X | mega enterprises corp. |
1 | BB016741P | mega enterprises corporation |
2 | CC052345T | mega corp. |
... | ... | ... |
Next, use the match_strings
function and pass the 'name' column as the argument to the master
parameter, and the 'id' column as the argument to the master_id
parameter.
N.B. In production with a real data set, depending on its size, the following command can/may take a number of minutes — no update/progress indicator is shown. This obviously also depends on the performance of the computer used. Memory and hard disk performance are a factor, as well as the CPU. String Grouper uses pandas which, in turn, uses NumPy, so matching is not done by computationally intensive looping, but by array mathematics — but it still may take some time to process large data sets.
matches = match_strings(accounts['name'], master_id = accounts['id'], ignore_index=True)
matches
This will return a pandas DataFrame as below. The values (company) we will focus on in this example will be those that have variations in the name of the fictitious company, 'Hyper Startup Inc.'.
left_id | left_name | similarity | right_name | right_id | |
---|---|---|---|---|---|
... | ... | ... | ... | ... | ... |
3 | AA098762D | hyper startup inc. | 1.00 | hyper-startup inc. | BB099931J |
4 | AA098762D | hyper startup inc. | 1.00 | hyper startup inc. | AA098762D |
5 | BB099931J | hyper-startup inc. | 1.00 | hyper-startup inc. | BB099931J |
6 | BB099931J | hyper-startup inc. | 1.00 | hyper startup inc. | AA098762D |
7 | CC082744L | hyper startup incorporated | 1.00 | hyper startup incorporated | CC082744L |
... | ... | ... | ... | ... | ... |
In a pattern-matching process, each value in a row of the column being matched is checked against every other value in the column.
Processing this using typical Python looping code would mean, in the case of a 100,000 row data set, that the total iterations would be 100,0002 = 10 Billion. Processing that number of iterations might require replacing the CPU of the computer after each investigation! Well maybe not ... but you would have time for a few cups of coffee. String Grouper works in a totally different way.
In the resultant DataFrame above, we see the IDs (AA098762D, BB099931J) having each a group of two values — once where a close match is found, and once where its own record (value) is found. The third ID, CC082744L, is only returned once, even though it is pretty clear that it would be a variation of our fictitious company 'Hyper Startup Inc.'
String Grouper has a number of configuration options (see the kwargs in README.md). The option of interest in the above case is min_similarity
.
The default minimum similarity is 0.8. It can be seen that more matches may be found by reducing the minimum similarity from 0.8 to, for example, 0.7.
matches = match_strings(accounts['name'], master_id = accounts['id'], ignore_index = True, min_similarity = 0.7)
Tip: If the data set being matched is large, and you wish to experiment with the minimum similarity option, it may be helpful to import only a limited data set during testing, and increase to the full data set when ready. The number of rows imported can be specified in this way:
# We only look at the first 50k as an example
accounts = pd.read_csv('/path/to/folder/huge_file.csv')[0:50000]
Back to our example ... changing the option to min_similarity = 0.7
returns this:
left_id | left_name | similarity | right_name | right_id | |
---|---|---|---|---|---|
... | ... | ... | ... | ... | ... |
5 | AA098762D | hyper startup inc. | 1.00 | hyper-startup inc. | BB099931J |
6 | AA098762D | hyper startup inc. | 1.00 | hyper startup inc. | AA098762D |
7 | AA098762D | hyper startup inc. | 0.78 | hyper startup incorporated | CC082744L |
8 | BB099931J | hyper-startup inc. | 1.00 | hyper-startup inc. | BB099931J |
9 | BB099931J | hyper-startup inc. | 1.00 | hyper startup inc. | AA098762D |
10 | BB099931J | hyper-startup inc. | 0.78 | hyper startup incorporated | CC082744L |
11 | CC082744L | hyper startup incorporated | 1.00 | hyper startup incorporated | CC082744L |
12 | CC082744L | hyper startup incorporated | 0.78 | hyper-startup inc. | BB099931J |
13 | CC082744L | hyper startup incorporated | 0.78 | hyper startup inc. | AA098762D |
14 | HH072982K | hyper hyper inc. | 1.00 | hyper hyper inc. | HH072982K |
... | ... | ... | ... | ... | ... |
Now we see the IDs — AA098762D, BB099931J, CC082744L — have further matches. Each 'name' value has two other matching rows (IDs). However, we see that setting minimum similarity to 0.7 has still not matched 'hyper hyper inc.' (ID HH072982K) even though a person would judge that the 'name' is a match. The minimum similarity setting can be adjusted up and down until it is considered that most duplicates are being matched. If so, we can progress.
Once we are happy with the level of matching, we can remove the rows where the IDs are the same. Having the original (database) IDs for the rows means that we can precisely remove identical rows — that is, we are not removing matches based on similar values, but on the exact (database) IDs:
dupes = matches[matches.left_id != matches.right_id]
dupes
And we see the following for the company name we have been following:
left_id | left_name | similarity | right_name | right_id | |
---|---|---|---|---|---|
... | ... | ... | ... | ... | ... |
5 | AA098762D | hyper startup inc. | 1.00 | hyper-startup inc. | BB099931J |
7 | AA098762D | hyper startup inc. | 0.78 | hyper startup incorporated | CC082744L |
9 | BB099931J | hyper-startup inc. | 1.00 | hyper startup inc. | AA098762D |
10 | BB099931J | hyper-startup inc. | 0.78 | hyper startup incorporated | CC082744L |
12 | CC082744L | hyper startup incorporated | 0.78 | hyper-startup inc. | BB099931J |
13 | CC082744L | hyper startup incorporated | 0.78 | hyper startup inc. | AA098762D |
... | ... | ... | ... | ... | ... |
N.B. the pandas index number 14 has gone because the left and right side IDs were identical.
Finally we reduce the data to a pandas Series ready for exporting with one row for each record that has any duplicates.
company_dupes = pd.DataFrame(dupes.left_id.unique()).squeeze().rename('company_id')
company_dupes
This gives the following result:
0 AA012345X
1 BB016741P
2 AA098762D
3 BB099931J
4 CC082744L
5 AA903844B
6 BB904941H
7 AA777431C
8 BB760431Y
9 BB750431M
Name: company_id, dtype: object
How this is processed, as with any database clean-up, is out of the scope of this tutorial. A first step however could be:
- Import the list of database IDs into the relevant database as a temporary table
- Do an inner-join with the original table the data was exported from and sort ascending by the 'name' column
This will return filtered rows with the 'name' field in adjacent rows showing similar matched strings.