By David Loshin
Let’s revisit our example from my last post by adding in a new record for evaluation:
John Hansen, 1824 Polk Ave., Memphis TN 38177
Emily S. Hansen, 1824 Polk Ave., Memphis, TN 38177
Emily Stoddard, 1824 Polk Avenue, Memphis, TN
We had already decided that John and Emily shared a household, but all of a
sudden we have a third record with a name that shares some similarity, with one
of the existing names, and an almost exact street address match (note that the
third record is missing a ZIP code).
We could speculate that “Emily Stoddard” changed her name after she got married
to “John Hansen,” or that she changed an address somewhere as she moved form her
bachelorette pad to their newlywed home. But without exact knowledge of the
facts, it is only speculation, and one must exercise some care when relying on
speculation for business decisions.
If a few small differences pose a challenge to linkage, what would you think of
dozens, or even hundreds of variations for names, locations, or other data
Just as a case in point: in a hallway conversation at the recent Data Governance
Conference, a colleague mentioned that one of his customers’ databases had over
one hundred variations for a certain big-box retailer’s name! The conclusion you
can draw from this is that a key part of the record linkage process involves
some traditional data quality tactics, namely appending a standardized version
of the data to help your linkage algorithms score record similarity as a prelude
to establishing connectivity.