Moritz Lang
Author: Moritz Lang

Data quality management with AI

Identify duplicated and incorrect master data using R and state-of-the-art AI-based methods

There are a variety of reasons why customer and vendor master data are incorrect or contains duplicates: System migrations, dummy and test entries, typos, or failed batch input runs. The resulting poor data quality can have several negative effects: 

  • Additional time spent comparing customers that have been created twice
  • Process errors due to double-posted invoices to the identical but duplicated vendor
  • Bypassed release limits due to duplicated vendors
  • Increased costs for storage space due to incorrectly migrated master records
  • Various fraud scenarios

To ensure high data quality, we have developed an AI-based software in R that checks your vendor and customer master data and identifies duplicated or incorrect master records. In the following, our software "dab:DQM" is compared with the conventional methodology as a benchmark:

Searching fuzzy duplicates in master data

Conventional Method

Most often Levenshtein distance or a similar distance is used for searching fuzzy duplicates. It counts the number of changes for two addresses to convert one address to the other.


The two addresses 

"Hans-ObserStreet 12"
"Hans-Obser Street 12"

have a Levenshtein distance equal to 1, because the second address is obtained from the first address by replacing "-" with " ". So exactly one character is changed. 
Usually all pairs of addresses are interpreted as fuzzy duplicates, which have a small Levenshtein distance. Often a distance less than or equal to 3 is used as a default.

Drawback of this approach:

The two strings: 

"dab:GmbH, Hans-Obser-Street 12, 94469 Deggendorf"
"dab:GmbH, 94469 Deggendorf, Hans-Obser Street 12".

describe an identical address, but have a Levenshtein distance of 36. This pair is not found with the condition that the distance should be less than or equal to 3. If one increases the mentioned threshold, the majority of the determined pairs are no longer fuzzy duplicates.
For example, the two addresses: 

"Hans-Obser-Street 12, 94469 Deggendorf"
"Marienplatz 1, 80331 Munich"

have a Levenshtein distance equal to 30. 
So the following dilemma arises: If one chooses a too small threshold, too many duplicates will be excluded. If one chooses a too large limit, the result contains too many records which are not duplicates.


Based on Natural Language Processing, i.e. techniques for processing natural language, we have programmed a software in R that analyzes addresses through deeper linguistic understanding and is in this way able to find fuzzy duplicates. Thus, our software is not based on syntactic comparisons, such as Levenshtein distance, but on semantic understanding of language. Therefore, the two addresses: 

"Hans-Obser-Street 12, 94469 Deggendorf"
"94469 Deggendorf, Hans-Obser-Street 12".

are recognized as fuzzy duplicate. 
The example based on the two strings

"Hans-Obser-Street 12, 94469 Deggendorf"
"Marienplatz 1, 80331 Munich"

is correctly omitted from being interpreted as a duplicate.


Search for abnormal master records

Conventional Method 

When suspecting dummy entries or general data errors in master data, the conventional approach is to search for master records that contain "XXX", "???" or "dummy".


The query 

Show all records where "Name" contains an entry equal to "XXX".

could identify the following master record:

Analysis of master data from SAP


The following table contains a dummy and a test entry, as well as a master record, which occurs typically during system migrations due to formatting errors. These master records are recognized by "dab:DQM" based on its semantic understanding of natural language and not by static queries like: 

Does column "Name" contain an entry equal to "XXX".

So there is no need to maintain endless lists of queries anymore.


Analysis of master data from SAP

Drawback of this approach:

It is always possible that other master records are filled with entries such as: "XXXX", "Dummy", "test", "asdf", etc. Theoretically, there is an endless number of strings resulting from dummy and test entries, as well as data errors. To determine all these master records, one would need an endless list of queries. Another shortcoming: The queries are usually based on only one language.

If you would like to learn more about the dab:DQM solution approach with artificial intelligence and R, please feel free to contact us at any time. We look forward to answering your questions.

Contact us

Comments (0)
Be the first who comments this blog entry.
Blog login

You are not logged in. Please log in to comment this blog entry.

go to Login