Structural Differences and Data Matching

Blog Administrator | Address Quality, Address Standardization, Data Cleansing, Data Enhancement, Data Enrichment, Data Governance, Data Integration, Data Management, Data Matching, Data Quality, Duplicate Elimination, Fuzzy Matching | , , ,

By David Loshin

Data matching is easy when the values are exact, but there are different types of variation that complicate matters. Let’s start at the foundation: structural differences in the ways that two data sets represent the same concepts. For example, early application systems used data files that were relatively “wide,” capturing a lot of information in each record, but with a lot of duplication.

More modern systems use a relational structure that segregates unique
attributes associated with each data concept – attributes about an individual
are stored in one data table, and those records are linked to other tables
containing telephone numbers, street addresses, and other contact data.

Transaction records refer back to the individual records, which reduces the
duplication in the transaction log tables.

The differences are largely in the representation – the older system might have
a field for a name, a field for an address, perhaps a field for a telephone
number, and the newer system might break up the name field into a first name,
middle name, and last name, the address into fields for street, city, state, and
ZIP code, and a telephone number into fields for area code and exchange/line

These structural differences become a barrier when performing records searches
and matching. The record structures are incompatible: different number of
fields, different field names, and different precision in what is stored.

This is the first opportunity to consider standardization: if structural
differences affect the ability to compare a record in one data set to records in
another data set, then applying some standards to normalize the data across the
data sets will remove that barrier. More on structural standardization in my
next post.