The Challenge of Identifying Information

Blog Administrator | Analyzing Data, Data Integration, Data Management, Data Quality, Record Linkage | , , , ,

By David Loshin

In my last post, I introduced the question of determining which characteristics are used to uniquely differentiate between any pair of records within a data set. The same question is relevant when attempting to match a pair of records as well, once they are determined to represent the same entity. I like to call these “identifying attributes,” and the values contained therein I call “identifying information.”

Let’s look at an example for customer data integration: what data element values
do I compare when trying to link two records together? Let’s start with the
obvious ones, namely (ha ha) first and last names. Of course, we all know that
there are certain names that are relatively common – just ask my friend John
Smith, with whom I worked at one of my earlier jobs.

But even if you have an uncommon name, you might be surprised. For example, if
you type in my name (“David Loshin”) at Google, you will find entries for me,
but you will also find entries for a dentist in Seattle and a professor.

Apparently, first and last names are not enough identifying information for
distinction. Perhaps there is another attribute we can use? You probably know
that I have written some books, (see http:\\, so maybe that
is an additional attribute to be used. But if you go to Amazon and do a search
for “David Loshin,” you will find me, but it turns out the professor has also
written a book.

Even an uncommon name such as mine still finds multiple hits, and while
attempting to add more identifying information can reduce that number of hits, a
poorly selected set of attributes may still not provide the right amount of
distinction. It may take a number of iterations to review a proposed set of
identifying attributes, determine their completeness, density, and accuracy
before settling on a core set of identifying characteristics to be used for

One more thing to think about, though. Once you get to the point where you are
pretty confident that those attributes are enough for differentiation, there is
one last monkey wrench in the works: even if you had the absolute set of
identifying attributes, there is no guarantee that the values themselves are
exact matches!