Portal

Identifying master data duplicates – Simple, isn’t it?

Posted

January 6, 2014

3 min read

Tags

Data Analytics

SAP

Identifying master data duplicates – Simple, isn’t it?

Today we deal with a simple request as identifying master data duplicates. Many of us know the problem with duplicates: actually a customer or vendor should exist only once in a system, in fact there are multiple records, as duplicates were created from data migrations or merging systems. Therefore “Dupli, Cate” can exist more than once.

Vendor No. Name Address TaxID
1001 Dupli, Cate Abbey Road 17, 94469 Langley CC00000001
5311 Dupli, Cate Abbey Rd.17, 94469 Langley AB11223344
6442 Johnny Walker Kingstr. 13, 55117 Oxford ED98-76-54-32
9876 Johnnie Walker Kingstr. 13, 55117 Oxford ED98765432
9877 John Doe Forest Lane1, 77788 Hoboken AB11223344
9992 Charles Little Abbey Rd.17, 94469 Langley ZZ55667799

 

Different types of spelling can be a reason: „Johnny Walker“ already exists in the system, the user is searching for „Johnnie Walker“, but will not get a match and consequently enters a second master data record.

Complex problems can result from that: the wrong vendor gets paid (or a reminder is posted to the wrong vendor, which may even lead to a duplicate payment). More errors occur in the processes, or slow down actions.

How can such duplicates be identified? We want to show some aspects of the approach:

  1. Choice of suitable search criteria
  2. Selective vs. blurred search
  3. Data preparation / cleansing
  4. Problems comparing results

Suitable search criteria

Obviously it would have no effect to compare names, as sign comparing search would find „Dupli, Cate“, but not „Johnny Walker“ , as it is different from “Johnnie Walker“.

Vendor No.
Name
Address
TaxID
1001 Dupli, Cate Abbey Road 17, 94469 Langley CC00000001
5311 Dupli, Cate Abbey Rd.17, 94469 Langley AB11223344
6442 Johnny Walker Kingstr. 13, 55117 Oxford ED98-76-54-32
9876 Johnnie Walker Kingstr. 13, 55117 Oxford ED98765432
9877 John Doe Forest Lane1, 77788 Hoboken AB11223344
9992 Charles Little Abbey Rd.17, 94469 Langley ZZ55667799

 

Here we need different approach, e.g. duplicate search on address, tax or phone numbers or banking details. It is important to use search combinations creatively. The more unique search criteria are used, the less „false positives“ (very low potential duplicates) you get.

Selective vs. blurred search

Sometimes it is necessary to do a blurred search in order to find master data, which exists multiple times, that cannot be identified by conventional analysis, e.g. on sound or by looking at the Levenshtein distance. Modern analysis software provides these search features and therefore is able to find special types of duplicates. More matches however can contain more „false positives“, which means more effort to be looked at in detail.

Vendor No. Name Address TaxID
1001 Dupli, Cate Abbey Road 17, 94469 Langley CC00000001
5311 Dupli, Cate Abbey Rd.17, 94469 Langley AB11223344
6442 Johnny Walker Kingstr. 13, 55117 Oxford ED98-76-54-32
9876 Johnnie Walker Kingstr. 13, 55117 Oxford ED98765432
9877 John Doe Forest Lane1, 77788 Hoboken AB11223344
9992 Charles Little Abbey Rd.17, 94469 Langley ZZ55667799

Data preparation / cleansing

Vendor
Name
Address
TaxID
TaxID clean
1001 Dupli, Cate Abbey Road 17, 94469 Langley CC00000001 CC00000001
5311 Dupli, Cate Abbey Rd.17, 94469 Langley AB11223344 AB11223344
6442 Johnny Walker Kingstr. 13, 55117 Oxford ED98-76-54-32 ED98765432
9876 Johnnie Walker Kingstr. 13, 55117 Oxfordccc ED98765432 ED98765432
9877 John Doe Forest Lane1, 77788 Hoboken AB11223344 AB11223344
9992 Charles Little Abbey Rd.17, 94469 Langley ZZ55667799 ZZ55667799

Problems comparing results

Running analysis with different criteria it will be difficult to join or compare the results.

A search based on “Name” will find one duplicate (1001 / 5311), based on „Address“ two duplicates (5311 / 9992 und 6442 / 9876). Checking “TaxID” one duplicate is the result. Indirectly 9877 is related to 1001, although they do not share any direct criterion such as Name, Address, TaxID. 

Vendor No.
Name
Address
TaxID
1001 Dupli, Cate Abbey Road 17, 94469 Langley CC00000001
5311 Dupli, Cate Abbey Rd.17, 94469 Langley AB11223344
6442 Johnny Walker Kingstr. 13, 55117 Oxford ED98-76-54-32
9876 Johnnie Walker Kingstr. 13, 55117 Oxford ED98765432
9877 John Doe Forest Lane1, 77788 Hoboken AB11223344
9992 Charles Little Abbey Rd.17, 94469 Langley ZZ55667799

 

In summary we can say that a trivial request to "identify duplicates" on closer inspection may represent a challenge. Important in the planning of such projects are therefore a stable concept with a clear roadmap in terms of methodology, as well as a plan for the subsequent correction of the duplicates in the system.

Content Listing

Share This Article

To stay up to date with all the news, subscribe to our newsletter!

Newsletter Subscription