Approximate Matching

Blog Administrator | Analyzing Data, Data Management, Data Quality, Duplicate Elimination, Record Linkage | , , , , ,

By David Loshin

Actually, my first name is not David – that is really my middle name, but it is the given name my parents used when talking to me. This has actually led to a lot of confusion over the years, especially when confronted with a form asking for me “first name” and my “last name.” For official forms (like my driver’s license) I use my real first name as my “first name,” but for non-official forms I often just use David. The result is that there is inconsistency in my own representation in records across different data systems.

If we were to rely solely on an exact data element-to-data element match of
values to determine record duplication, the variation in use of my first or
middle name would prevent two records from linking. In turn, you can extrapolate
and see that any variations across systems of what should be the same values
will prevent an exact match, leading to inadvertent duplication.

Fortunately, we can again rely on data quality techniques. We have our stand-bys
of parsing and standardization, which can be enhanced through the use of
transformation rules to map abbreviations, acronyms, and common misspellings to
their standard representations – an example might be mapping “INC” and “INC.”
and “Inc” and “inc” and “inc.” and “incorp” and “incorp.” and “incorporated” all
to a standard form of “Inc.”

We can add to this another tool: approximate matching. This matching technique
allows for two values to be compared with a numeric score that indicates the
degree to which the values are similar. An example might compare my last name
“Loshin” with the word “lotion” and suggest that while the two values are not
strict alphabetic matches, they do match phonetically.

There are a number of techniques used for approximate matching of values, such
as comparing the set of characters, the number of transposed, inserted, or
omitted letters, different kinds of forward and backward phonetic scoring, as
well as other more complex algorithms.

In turn, we can apply this approximate matching to the entire set of
corresponding identifying attributes and weight each score based on the
differentiation factor associated with each attribute. For example, a
combination of first name and last name might provide greater differentiation
than a birth date, since there is a relatively limited number of dates on which
an individual can be born (maximum 366 per year).

By applying a weighted approximate match to pairs of records, we can finesse the
occurrence of variations in the data element values that might prevent direct
matching from working. More on this topic in future posts.