Data quality management with AI

Identify duplicated and incorrect master data using R and state-of-the-art AI-based methods

There are a variety of reasons why customer and vendor master data are incorrect or contains duplicates: System migrations, dummy and test entries, typos, or failed batch input runs. The resulting poor data quality can have several negative effects:

Additional time spent comparing customers that have been created twice
Process errors due to double-posted invoices to the identical but duplicated vendor
Bypassed release limits due to duplicated vendors
Increased costs for storage space due to incorrectly migrated master records
Various fraud scenarios

To ensure high data quality, we have developed an AI-based software in R that checks your vendor and customer master data and identifies duplicated or incorrect master records. In the following, our software "dab:DQM" is compared with the conventional methodology as a benchmark:

Searching fuzzy duplicates in master data

Conventional Method

Most often Levenshtein distance or a similar distance is used for searching fuzzy duplicates. It counts the number of changes for two addresses to convert one address to the other.

Example:

The two addresses

"Hans-Obser–Street 12"
and
"Hans-Obser Street 12"

have a Levenshtein distance equal to 1, because the second address is obtained from the first address by replacing "-" with " ". So exactly one character is changed.
Usually all pairs of addresses are interpreted as fuzzy duplicates, which have a small Levenshtein distance. Often a distance less than or equal to 3 is used as a default.

Drawback of this approach:

The two strings:

"dab:GmbH, Hans-Obser-Street 12, 94469 Deggendorf"
and
"dab:GmbH, 94469 Deggendorf, Hans-Obser Street 12".

describe an identical address, but have a Levenshtein distance of 36. This pair is not found with the condition that the distance should be less than or equal to 3. If one increases the mentioned threshold, the majority of the determined pairs are no longer fuzzy duplicates.
For example, the two addresses:

"Hans-Obser-Street 12, 94469 Deggendorf"
and
"Marienplatz 1, 80331 Munich"

have a Levenshtein distance equal to 30.
So the following dilemma arises: If one chooses a too small threshold, too many duplicates will be excluded. If one chooses a too large limit, the result contains too many records which are not duplicates.

dab:DQM

Based on Natural Language Processing, i.e. techniques for processing natural language, we have programmed a software in R that analyzes addresses through deeper linguistic understanding and is in this way able to find fuzzy duplicates. Thus, our software is not based on syntactic comparisons, such as Levenshtein distance, but on semantic understanding of language. Therefore, the two addresses:

"Hans-Obser-Street 12, 94469 Deggendorf"
and
"94469 Deggendorf, Hans-Obser-Street 12".

are recognized as fuzzy duplicate.
The example based on the two strings

"Hans-Obser-Street 12, 94469 Deggendorf"
and
"Marienplatz 1, 80331 Munich"

is correctly omitted from being interpreted as a duplicate.

Search for abnormal master records

Conventional Method

When suspecting dummy entries or general data errors in master data, the conventional approach is to search for master records that contain "XXX", "???" or "dummy".

Example:

The query

Show all records where "Name" contains an entry equal to "XXX".

could identify the following master record:

dab:DQM

The following table contains a dummy and a test entry, as well as a master record, which occurs typically during system migrations due to formatting errors. These master records are recognized by "dab:DQM" based on its semantic understanding of natural language and not by static queries like:

Does column "Name" contain an entry equal to "XXX".

So there is no need to maintain endless lists of queries anymore.

Drawback of this approach:

It is always possible that other master records are filled with entries such as: "XXXX", "Dummy", "test", "asdf", etc. Theoretically, there is an endless number of strings resulting from dummy and test entries, as well as data errors. To determine all these master records, one would need an endless list of queries. Another shortcoming: The queries are usually based on only one language.

If you would like to learn more about the dab:DQM solution approach with artificial intelligence and R, please feel free to contact us at any time. We look forward to answering your questions.

Contact us

Provider:	TYPO3
Cookiename:	be_, fe_
Runtime:	Current Session

Provider:	Google Inc.
Cookiename:	_ga, _gat, _gid
Runtime:	365
Privacy source url:	https://support.google.com/tagmanager/answer/9323295?hl=De
Host:	google.com

Provider:	Google Inc.
Cookiename:	Google Maps
Runtime:	365
Privacy source url:	https://policies.google.com/privacy#infocollect
Host:	google.com

Provider:	Youtube.com
Cookiename:	CONSENT
Runtime:	30
Privacy source url:	https://policies.google.com/privacy
Host:	Google.com

Data quality management with AI

We are using cookies.

Privacy settings

Required Cookies

Cookies for external Content