07.01.2014

Iden­ti­fy­ing mas­ter da­ta du­pli­ca­tes – Sim­ple, isn’t it?

Today we deal with a simple request as identifying master data duplicates. Many of us know the problem with duplicates: actually a customer or vendor should exist only once in a system, in fact there are multiple records, as duplicates were created from data migrations or merging systems. Therefore “Dupli, Cate” can exist more than once.

 

Vendor No.NameAddressTaxID
1001Dupli, CateAbbey Road 17, 94469 LangleyCC00000001
5311Dupli, CateAbbey Rd.17, 94469 LangleyAB11223344
6442 Johnny Walker Kingstr. 13, 55117 Oxford ED98-76-54-32
9876 Johnnie Walker Kingstr. 13, 55117 OxfordED98765432
9877John Doe Forest Lane1, 77788 HobokenAB11223344
9992 Charles LittleAbbey Rd.17, 94469 Langley ZZ55667799

 

Different types of spelling can be a reason: „Johnny Walker“ already exists in the system, the user is searching for „Johnnie Walker“, but will not get a match and consequently enters a second master data record.

Complex problems can result from that: the wrong vendor gets paid (or a reminder is posted to the wrong vendor, which may even lead to a duplicate payment). More errors occur in the processes, or slow down actions.

How can such duplicates be identified? We want to show some aspects of the approach:

  1. Choice of suitable search criteria
  2. Selective vs. blurred search
  3. Data preparation / cleansing
  4. Problems comparing results

Suit­able search cri­te­ria

Obviously it would have no effect to compare names, as sign comparing search would find „Dupli, Cate“, but not „Johnny Walker“ , as it is different from “Johnnie Walker“.

 

Vendor No.
Name
Address
TaxID
1001 Dupli, Cate Abbey Road 17, 94469 Langley CC00000001
5311 Dupli, Cate Abbey Rd.17, 94469 Langley AB11223344
6442 Johnny Walker Kingstr. 13, 55117 OxfordED98-76-54-32
9876Johnnie WalkerKingstr. 13, 55117 OxfordED98765432
9877John DoeForest Lane1, 77788 Hoboken AB11223344
9992Charles Little Abbey Rd.17, 94469 LangleyZZ55667799

 

Here we need different approach, e.g. duplicate search on address, tax or phone numbers or banking details. It is important to use search combinations creatively. The more unique search criteria are used, the less „false positives“ (very low potential duplicates) you get.

Se­lec­tive vs. blur­red search

Sometimes it is necessary to do a blurred search in order to find master data, which exists multiple times, that cannot be identified by conventional analysis, e.g. on sound or by looking at the Levenshtein distance. Modern analysis software provides these search features and therefore is able to find special types of duplicates. More matches however can contain more „false positives“, which means more effort to be looked at in detail.

 

 

Vendor No.NameAddressTaxID
1001 Dupli, CateAbbey Road 17, 94469 LangleyCC00000001
5311 Dupli, Cate Abbey Rd.17, 94469 LangleyAB11223344
6442Johnny WalkerKingstr. 13, 55117 OxfordED98-76-54-32
9876Johnnie WalkerKingstr. 13, 55117 OxfordED98765432
9877John Doe Forest Lane1, 77788 HobokenAB11223344
9992Charles LittleAbbey Rd.17, 94469 LangleyZZ55667799

Data pre­pa­ra­tion / clean­sing

Vendor
Name
Address
TaxID
TaxID clean
1001 Dupli, CateAbbey Road 17, 94469 LangleyCC00000001CC00000001
5311Dupli, Cate Abbey Rd.17, 94469 LangleyAB11223344AB11223344
6442Johnny WalkerKingstr. 13, 55117 OxfordED98-76-54-32ED98765432
9876 Johnnie WalkerKingstr. 13, 55117 Oxfordccc ED98765432 ED98765432
9877 John DoeForest Lane1, 77788 Hoboken AB11223344AB11223344
9992 Charles LittleAbbey Rd.17, 94469 Langley ZZ55667799 ZZ55667799

Pro­blems com­par­ing re­sults

Running analysis with different criteria it will be difficult to join or compare the results.

A search based on “Name” will find one duplicate (1001 / 5311), based on „Address“ two duplicates (5311 / 9992 und 6442 / 9876). Checking “TaxID” one duplicate is the result. Indirectly 9877 is related to 1001, although they do not share any direct criterion such as Name, Address, TaxID. 

 

Vendor No.
Name
Address
TaxID
1001Dupli, CateAbbey Road 17, 94469 LangleyCC00000001
5311Dupli, CateAbbey Rd.17, 94469 Langley AB11223344
6442Johnny Walker Kingstr. 13, 55117 OxfordED98-76-54-32
9876Johnnie WalkerKingstr. 13, 55117 OxfordED98765432
9877John DoeForest Lane1, 77788 Hoboken AB11223344
9992Charles LittleAbbey Rd.17, 94469 LangleyZZ55667799

 

In summary we can say that a trivial request to "identify duplicates" on closer inspection may represent a challenge. Important in the planning of such projects are therefore a stable concept with a clear roadmap in terms of methodology, as well as a plan for the subsequent correction of the duplicates in the system.


Comments (0)
Be the first who comments this blog entry.
Blog login

You are not logged in. Please log in to comment this blog entry.

go to Login