Customer Centricity and Connections: Establishing the Link

Blog Administrator | Customer Centricity, Data Management, Data Quality, Record Linkage | , ,

By David Loshin

In my last post, we began to look at the value proposition for grouping individual customers into logical groupings. We began by looking at a grouping that generally appears naturally, namely the traditional residential household.

We talked about householding in a previous blog posting, but it is worth
reviewing the basic approaches used for determining that a group of individuals
share a household. The general approach is to analyze a collection of data
records and examine sets of identifying attributes for degrees of similarity in
naming and residence locations. Many situations are relatively straightforward,
such as this example:

John Hansen, 1824 Polk Ave., Memphis TN 38177
Emily S. Hansen, 1824 Polk Ave., Memphis, TN 38177

In this example, two individuals share both a last name and a location address,
and although the data evidence does not guarantee truth of the inference, it
might be reasonable to suggest that because there is a link between the family
name and the residence location, these two individuals are members of the same
household. The algorithm, then, is to link records into a collection of similar
records based on similarity of the surname and residence characteristics.

However, the concept of grouping is not limited to conventional groups, since
there are many artificial groups formed as a result of shared interests or
similarities in profile criteria. For example, people interested in certain
sports car models often organize “fan clubs,” new mothers often organize toddler
play groups, and sports team fans are often rabid about their franchise

In turn, your company might want to create marketing campaigns that target sets
of individuals grouped together by demographic or psychographic attributes. In
these cases, you would adjust your algorithms to link records based on
similarity of the values in other sets of data attributes.

Establishing the link goes beyond looking at the data that already exists in
your data set. Rather, you may need to append additional data acquired from
alternate sources.

And, interestingly enough, you will need to connect the acquired data to your
existing data, and that requires yet another record linkage effort. Apparently,
understanding customer collectives is pretty dependent on record linkage. And
while linking records is straightforward when all the data values line up
nicely, as you might suspect, there are some curious intricacies of linkage in
the presence of data with questionable quality.

Normalizing Structure Using Data Standardization for Improved Matching

Blog Administrator | Address Quality, Address Standardization, Analyzing Data, Data Matching, Data Quality, Record Linkage | , , , , , ,

By David Loshin

In my last few posts, I discussed how structural
differences impact the ability to search and match records across different
data sets. Fortunately, most data quality tool suites use integrated parsing
and standardization algorithms to map structures together.

As long as there is some standard representation, we should be able to come
up with a set of rules that can help to rearrange the words in a data value
to match that standard.

As an example, we can look at person names (for simplicity, let’s focus on
name formats common to the United States). The general convention is that
people have three names – a first name, a middle name, and a surname. Yet
even limiting our scope to just these components (that is, we are ignoring
titles, generationals, and other prefixes and suffixes), there is a wide
range of variance for representing the name. Here are some examples, using
my own name:

• Howard David Loshin
• Howard D Loshin
• Howard D. Loshin
• David Loshin
• Howard Loshin
• H David Loshin
• H. David Loshin
• H D Loshin
• H. D. Loshin
• Loshin, Howard D
• Loshin, Howard D.
• Loshin, H David
• Loshin, H. David
• Loshin, H D
• Loshin, H. D.

There are different versions depending on whether you use abbreviations or full
names, punctuation, and the order of the terms. A good parsing engine can be
configured with the different patterns and will be able to identify each piece
of a name string.

The next piece is standardization: taking the pieces and rearranging them into a
desired order. The example might be taking a string of the form “last_name,
first_name, initial” and transforming that into the form “first_name, initial,
last_name” as a standardized or normalized representation. Using a normalized
representation will simplify the comparison process for data matching and record

Entities and their Characteristics

Blog Administrator | Analyzing Data Quality, Data Quality, Fuzzy Matching, Record Linkage | , , , , ,

By David Loshin

How can you tell if two records refer to the same person (or company, or other type of organization)? In our recent posts, we have looked at how data quality techniques such as parsing and standardization help in normalizing the data values within different records so that the records can be compared. But what is being compared? That is the topic of this next set of entries.

A simplistic view might suggest that when looking at two records, comparing
the corresponding values is the best way to start. For example, we might compare
the corresponding names, telephone numbers, street addresses – stuff that
usually appears in records representing customers, residences, patients, etc.

But the simple concept belies a much more complex question about the attributes
used to describe the individual as well as differentiate pairs of individuals.
Much of this issue revolves around the approaches taken for determining what
characteristics are being managed within a representative record, the motivation
for including those characteristics, and importantly, are those data elements
used solely as “attribution” (or additional description of the entity involved)
or are they used for “distinction” (to help in unique identification).

More to the point: what are the core data elements necessary for determining the
uniqueness of a record? We often take for granted the fact that our relational
models presume one and only one record per entity, and that there might be
business impacts should more than one entry exist for each individual.

Yet individual “entities” may exist in multiple data sets, even in different
contexts. Some characteristics are part and parcel of each entity, while others
describe the entity playing a particular role. Our upcoming posts are intended
to consider some of these issues when assessing similarity for record linkage
and matching.

What is Record Linkage?

Blog Administrator | Analyzing Data Quality, Data Quality | , , , ,

By David Loshin

In my last entry, I talked about the fact that many distributed pieces of data about a single individual can be combined together to form a deep profile about that individual. But how are different data records from disparate data sets combined to formulate insightful profiles?

The answer lies in the ability to collect the different pieces of data that
belong to a single individual and then glom them together. For example, let’s
presume the existence of a record in one data set that has a person’s address, a
record in another data set that has that person’s telephone number, a third
record that has that person’s registration number for a toaster, another with
the person’s car year, make, and model, etc.

As long as you can find all the records that are associated with each person and
connect them together, you could collect all the interesting information
together and create a single representative profile. That profile is then
suitable for use in list generation, but is also used for more comprehensive
analytics such as segmentation, clustering analysis, and classification.

The way these records are connected together is through a process called “record
linkage.” This process searches through one or more data sets looking for
records that refer to the same unique entity based on identifying
characteristics that can be used to distinguish one entity from all others, such
as names, addresses, or telephone numbers.

When two records are found to share the same pieces of identifying information,
you might assume that those records can be linked together. It sounds simple,
but unfortunately, there are a number of challenges with linking records across
more than one data set, such as:

· The records from the different data sets don’t share the same identifying
attributes (one might have phone number but the other one does not).

· The values in one data set use a different structure or format than the data
in another data set (such as using hyphens for social security numbers in one
data set but not in the other).

· The values in one data set are slightly different than the ones in the other
data set (such as using nicknames instead of given names).

· One data set has the values broken out into separate data elements while the
other does not (such as titles and name suffixes).

Luckily, there are numerous software products that are designed to address these
discrepancies, which can simplify the record linkage process. If you recall some
of my previous posts, you may begin to see how parsing and standardization start
to fit in. These tools will parse and standardize the values prior to attempting
to compare for the purposes of linkage, and that alleviates some of the
challenges I noted.

On the Road with SQL Saturdays

Blog Administrator | Uncategorized | , , ,

By Ira Whiteside

Melissa Data’s BI Architect


SQL Saturday #62 is in the books, I would like to thank everyone who attended our presentation of SSIS Building a Metadata Mart on the Road to Data Governance.


I think this is a great format to meet folks and have more direct presentations; we had a lot of interaction and questions. If you would like a copy of the presentation or code, please send me an email See you at Houston #57 (Metadata Mart), Cleveland #60 (Advanced Fuzzy Matching and Record Linkage) or Chicago #67.