Identifying master data duplicates – Simple, isn’t it?
Today we deal with a simple request as identifying master data duplicates. Many of us know the problem with duplicates: actually a customer or vendor should exist only once in a system, in fact there are multiple records, as duplicates were created from data migrations or merging systems. Therefore “Dupli, Cate” can exist more than once.
Vendor No. | Name | Address | TaxID |
1001 | Dupli, Cate | Abbey Road 17, 94469 Langley | CC00000001 |
5311 | Dupli, Cate | Abbey Rd.17, 94469 Langley | AB11223344 |
6442 | Johnny Walker | Kingstr. 13, 55117 Oxford | ED98-76-54-32 |
9876 | Johnnie Walker | Kingstr. 13, 55117 Oxford | ED98765432 |
9877 | John Doe | Forest Lane1, 77788 Hoboken | AB11223344 |
9992 | Charles Little | Abbey Rd.17, 94469 Langley | ZZ55667799 |
Different types of spelling can be a reason: „Johnny Walker“ already exists in the system, the user is searching for „Johnnie Walker“, but will not get a match and consequently enters a second master data record.
Complex problems can result from that: the wrong vendor gets paid (or a reminder is posted to the wrong vendor, which may even lead to a duplicate payment). More errors occur in the processes, or slow down actions.
How can such duplicates be identified? We want to show some aspects of the approach:
- Choice of suitable search criteria
- Selective vs. blurred search
- Data preparation / cleansing
- Problems comparing results
Suitable search criteria
Obviously it would have no effect to compare names, as sign comparing search would find „Dupli, Cate“, but not „Johnny Walker“ , as it is different from “Johnnie Walker“.
Vendor No. | Name | Address | TaxID |
1001 | Dupli, Cate | Abbey Road 17, 94469 Langley | CC00000001 |
5311 | Dupli, Cate | Abbey Rd.17, 94469 Langley | AB11223344 |
6442 | Johnny Walker | Kingstr. 13, 55117 Oxford | ED98-76-54-32 |
9876 | Johnnie Walker | Kingstr. 13, 55117 Oxford | ED98765432 |
9877 | John Doe | Forest Lane1, 77788 Hoboken | AB11223344 |
9992 | Charles Little | Abbey Rd.17, 94469 Langley | ZZ55667799 |
Here we need different approach, e.g. duplicate search on address, tax or phone numbers or banking details. It is important to use search combinations creatively. The more unique search criteria are used, the less „false positives“ (very low potential duplicates) you get.
Selective vs. blurred search
Sometimes it is necessary to do a blurred search in order to find master data, which exists multiple times, that cannot be identified by conventional analysis, e.g. on sound or by looking at the Levenshtein distance. Modern analysis software provides these search features and therefore is able to find special types of duplicates. More matches however can contain more „false positives“, which means more effort to be looked at in detail.
Vendor No. | Name | Address | TaxID |
---|---|---|---|
1001 | Dupli, Cate | Abbey Road 17, 94469 Langley | CC00000001 |
5311 | Dupli, Cate | Abbey Rd.17, 94469 Langley | AB11223344 |
6442 | Johnny Walker | Kingstr. 13, 55117 Oxford | ED98-76-54-32 |
9876 | Johnnie Walker | Kingstr. 13, 55117 Oxford | ED98765432 |
9877 | John Doe | Forest Lane1, 77788 Hoboken | AB11223344 |
9992 | Charles Little | Abbey Rd.17, 94469 Langley | ZZ55667799 |
Data preparation / cleansing
Vendor | Name | Address | TaxID | TaxID clean |
1001 | Dupli, Cate | Abbey Road 17, 94469 Langley | CC00000001 | CC00000001 |
5311 | Dupli, Cate | Abbey Rd.17, 94469 Langley | AB11223344 | AB11223344 |
6442 | Johnny Walker | Kingstr. 13, 55117 Oxford | ED98-76-54-32 | ED98765432 |
9876 | Johnnie Walker | Kingstr. 13, 55117 Oxfordccc | ED98765432 | ED98765432 |
9877 | John Doe | Forest Lane1, 77788 Hoboken | AB11223344 | AB11223344 |
9992 | Charles Little | Abbey Rd.17, 94469 Langley | ZZ55667799 | ZZ55667799 |
Problems comparing results
Running analysis with different criteria it will be difficult to join or compare the results.
A search based on “Name” will find one duplicate (1001 / 5311), based on „Address“ two duplicates (5311 / 9992 und 6442 / 9876). Checking “TaxID” one duplicate is the result. Indirectly 9877 is related to 1001, although they do not share any direct criterion such as Name, Address, TaxID.
Vendor No. | Name | Address | TaxID |
1001 | Dupli, Cate | Abbey Road 17, 94469 Langley | CC00000001 |
5311 | Dupli, Cate | Abbey Rd.17, 94469 Langley | AB11223344 |
6442 | Johnny Walker | Kingstr. 13, 55117 Oxford | ED98-76-54-32 |
9876 | Johnnie Walker | Kingstr. 13, 55117 Oxford | ED98765432 |
9877 | John Doe | Forest Lane1, 77788 Hoboken | AB11223344 |
9992 | Charles Little | Abbey Rd.17, 94469 Langley | ZZ55667799 |
In summary we can say that a trivial request to "identify duplicates" on closer inspection may represent a challenge. Important in the planning of such projects are therefore a stable concept with a clear roadmap in terms of methodology, as well as a plan for the subsequent correction of the duplicates in the system.