By David Loshin

How can you tell if two records refer to the same person (or company, or other type of organization)? In our recent posts, we have looked at how data quality techniques such as parsing and standardization help in normalizing the data values within different records so that the records can be compared. But what is being compared? That is the topic of this next set of entries.

A simplistic view might suggest that when looking at two records, comparing
the corresponding values is the best way to start. For example, we might compare
the corresponding names, telephone numbers, street addresses – stuff that
usually appears in records representing customers, residences, patients, etc.

But the simple concept belies a much more complex question about the attributes
used to describe the individual as well as differentiate pairs of individuals.
Much of this issue revolves around the approaches taken for determining what
characteristics are being managed within a representative record, the motivation
for including those characteristics, and importantly, are those data elements
used solely as “attribution” (or additional description of the entity involved)
or are they used for “distinction” (to help in unique identification).

More to the point: what are the core data elements necessary for determining the
uniqueness of a record? We often take for granted the fact that our relational
models presume one and only one record per entity, and that there might be
business impacts should more than one entry exist for each individual.

Yet individual “entities” may exist in multiple data sets, even in different
contexts. Some characteristics are part and parcel of each entity, while others
describe the entity playing a particular role. Our upcoming posts are intended
to consider some of these issues when assessing similarity for record linkage
and matching.


Leave a Reply

Your email address will not be published. Required fields are marked *