According to a study by The Information Difference, one third of its respondents rate their data quality as poor, and only 4 percent as excellent. What’s jaw-dropping – 63 percent have no idea what poor data quality may be costing them.
One major root cause of poor data quality is duplicate records. Duplicate records – at any level or amount – create conflicting information which prevents organizations from gaining a single, accurate view of a customer. Without clear, duplicate-free information, organizations run the risk of making poor business decisions, wasting resources on additional manual processes, and initiating poorly performing customer-response driven campaigns.
The Problem of Finding Non-Exact Matches
Deduplication of data through the use of merge/purge solutions is one of the critical components in the data cleansing process. But the biggest roadblock to identifying duplicates lies in detecting non-exact matching duplicate records – data that appears to be multiple sets of unique information, but are actually duplicate records. Most data contains these “non-exact matching” duplicate records, which are very difficult and tricky to identify.
For example, a ‘Beth Smith’ at ‘United Data Machines’ can be recorded in the same or different database as ‘Smithe, Elizabeth’ at ‘UDM’. In reality, Beth Smith and Elizabeth Smithe are the same person, but your organization might identify the data as two different contacts, with two different order histories, two different buying patterns, etc.
To identify these hard-to-spot, non-exact matching duplicates, the most advanced merge/purge applications utilize fuzzy matching algorithms.
Fuzzy matching is a mathematical process that determines the similarities between data sets, information and facts – where the outcome is neither true nor false, nor 100 percent certain – hence the word, “fuzzy.” The process compares any data type of any length and from any place in a field to find non-exact matches.
Types of Fuzzy Matching Algorithms
There are many different algorithms that can be implemented as part of the deduplication process.
Exact Matching – Determines whether two strings are identical. Exact Matching is useful for cases when certain strings must match, like for example, matching social security numbers.
PhonetEx – Detects “alike-sounding” relationships between words. Also known as “phonetic.”
N-gram or Q-gram-based – The linear n-gram or q-gram-based algorithm models are primarily used in statistical, natural language processing. An n-gram is a subsequence of n items from a given sequence – which can be phonemes, syllables, letters, words or base pairs, as defined by Wikipedia.
Jaro – Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings. The Jaro distance measure was developed for name comparison in the U.S. Census. It is designed to compare surnames to surnames and first names to first names, but not whole names to whole names. The Jaro distance measure is also good for short strings.
Dice’s Coefficient – A variation of the N-gram algorithm. Dice’s Coefficient counts matching n-grams but discards duplicate n-grams.
Jaccard Similarity – A variation of the N-gram algorithm. The Jaccard Similarity is identical to the N-gram algorithm but uses a different formula for similiarity computation: Intersection/union.
Overlap Coefficient – Another variation of the N-gram algorithm. The Overlap Coefficient is identical to the N-gram algorithm but uses a different formula for similarity computation: Intersection/minNumNGrams.
Levenshtein – The Levenshtein algorithm computes for the similarity of two strings by taking into account the amount of character mistakes – which are based off the number of incorrect characters, inserted characters and deleted characters. The Levenshtein algorithm specializes in keyboard typing errors. It takes into consideration mis-typed letters (pressing A instead of S), missing letters, and accidentally inserted letters. This algorithm is useful for industries such as call centers where possible typing errors are frequent. Levenshtein similarity is computed using the following formula: (maxLength – mistakes)/ maxLength.
Needleman-Wunsch – A variation of the Levenshtein algorithm. Levenshtein and Needleman-Wunsch are identical except that character mistakes are given different weights depending on how far two characters are on a standard keyboard layout. For example: A to S is given a mistake weight of 0.4, while A to D is a 0.6 and A to P is a 1.0.
Smith-Waterman-Gotoh – A variation of the Needleman-Wunsch algorithm. Needleman-Wunsch and Smith-Waterman-Gotoh are identical except that character deletions are given a different weight. This effectively adds the “understanding” that the keyboarder may have tried to abbreviate one of the words.
MDKeyboard – A variation of the Smith-Waterman-Gotoh algorithm. Smith-Waterman-Gotoh and MDKeyboard are identical except that character transpositions are given a different weight. This effectively adds the “understanding” that the keyboarder may have typed in one character before another.
Longest Common Substring – The Longest Common Substring (LCS) algorithm counts for the longest common set of adjacent characters between two strings. LCS similiarity is computed using the following formula: LCSLength/ maxLength.
Containment – The Containment algorithm will return 100 percent if one string is a subset of another. A 0 percent is returned otherwise.
Ex: JOHNSON JHNSN 0%
JOHNSON JOHN 100%
Frequency – The Frequency algorithm will match the characters of one string to the characters of another without any regard to the sequence. For example, “abcdef” would be considered a 100% match “badcfe”.
Soundex – SoundEx is a string transformation and comparison based algorithm. For example, JOHNSON would be transformed to “J525” while JHNSN would be transformed to “J525” which would then be considered a SoundExing match after evaluation. If the original strings are identical, SoundEx will return 100 percent. If the SoundEx’d strings are equal, returns 99 percent. Else SoundEx will return 0%.
Double Metaphone – A variation of the Soundex algorithm. Double Metaphone performs two different Soundex-style transformations. It creates two Soundex-like strings (primary and alternate) for both strings.