This will be the first in a series of posts identifying similar records between two different sources or grouping of records from a single source, based on existing column string of values. We will define an approach, review actual implementations with various tools and vendor’s products.

There are many facets to review. I would like to start by drawing from several articles from experienced practitioners, either from academia or the commercial market. They will be footnoted at the end of the post.

 

Although we are discussing matching, we need to address several steps prior to getting to the actual use of matching algorithms. For this post, we will focus on the pragmatic steps required to accomplish identifying the relationships between records, and address how these steps fit into an overall data governance and data quality of MDM architecture.

 

The steps required are shown here in a basic diagram: 

record-linkage-approach.jpg

The steps are as follows:

1.      Cleansing and standardization

2.      Group records

3.      Split records

4.      Compare records and determine scores

5.      Split into separate match categories

6.      Analyze results of matches

7.      Evaluate using match tools to determine if best algorithms have been combined.

The steps are as follows:

1.      Cleansing and Standardization+

a.      Create common formats and patterns for data values

b.      Preferable data driven rules that can be shared and reused

2.      Group records

a.      Choose single or multiple values

b.      Create a concatenated value free or spaces or special characters

3.      Split records

a.      Create separate data streams to support parallel match processing

4.      Compare records and determine scores

a.      Base on type of value name, product select appropriate algorithm

b.      We will discuss various algorithms in future post

5.      Split into separate match categories

a.      Match, no match and possible matches

6.      Analyze Results of Matches

a.      Matches need to reviewed for accuracy, this can be done with tools or in some cases manually

7.      Evaluate using match tools to determine if best algorithms have been combined

a.      Possible matches need to be evaluated and analyzed literately to determine if additional cleansing or different matching algorithms could be utilized more effectively

For the Microsoft enthused, I have posted several with code on how to implement this type of process in SQL and SSIS at SQL Server Central and SQL Share.

Please feel free to comment to this post or email me at ira@melissadata.com

Reference Articles: Febrl – A Freely Available Record Linkage System with a Graphical User Interface Peter Christen: http://academic.research.microsoft.com/Paper/4566092.aspx

http://www.cs.cmu.edu/~pradeepr/papers/ieee03.pdf

Using q-grams in a DBMS for Approximate String Processing:

http://pages.stern.nyu.edu/~panos/publications/deb-dec2001.pdf

Data Quality: Automated Edit/Imputation and Record Linkage:

https://www.census.gov/srd/papers/pdf/rrs2006-07.pdf

 

Leave a Reply

Your email address will not be published. Required fields are marked *