What is Record Linkage?

Blog Administrator | Analyzing Data Quality, Data Quality | , , , ,

By David Loshin

In my last entry, I talked about the fact that many distributed pieces of data about a single individual can be combined together to form a deep profile about that individual. But how are different data records from disparate data sets combined to formulate insightful profiles?

The answer lies in the ability to collect the different pieces of data that
belong to a single individual and then glom them together. For example, let’s
presume the existence of a record in one data set that has a person’s address, a
record in another data set that has that person’s telephone number, a third
record that has that person’s registration number for a toaster, another with
the person’s car year, make, and model, etc.

As long as you can find all the records that are associated with each person and
connect them together, you could collect all the interesting information
together and create a single representative profile. That profile is then
suitable for use in list generation, but is also used for more comprehensive
analytics such as segmentation, clustering analysis, and classification.

The way these records are connected together is through a process called “record
linkage.” This process searches through one or more data sets looking for
records that refer to the same unique entity based on identifying
characteristics that can be used to distinguish one entity from all others, such
as names, addresses, or telephone numbers.

When two records are found to share the same pieces of identifying information,
you might assume that those records can be linked together. It sounds simple,
but unfortunately, there are a number of challenges with linking records across
more than one data set, such as:

· The records from the different data sets don’t share the same identifying
attributes (one might have phone number but the other one does not).

· The values in one data set use a different structure or format than the data
in another data set (such as using hyphens for social security numbers in one
data set but not in the other).

· The values in one data set are slightly different than the ones in the other
data set (such as using nicknames instead of given names).

· One data set has the values broken out into separate data elements while the
other does not (such as titles and name suffixes).

Luckily, there are numerous software products that are designed to address these
discrepancies, which can simplify the record linkage process. If you recall some
of my previous posts, you may begin to see how parsing and standardization start
to fit in. These tools will parse and standardize the values prior to attempting
to compare for the purposes of linkage, and that alleviates some of the
challenges I noted.