Are You A Dupe Detective?

Blog Administrator | Data Management, Data Quality, Data Quality Components for SSIS, ETL, Fuzzy Matching | , , , , , , , ,

By Joseph Vertido

The process of finding approximate matching records in your data to get rid of
duplicates is precisely that – fuzzy. It raises as many questions as answers. Am
I using a good matching algorithm? Am I matching on the right fields? Is it a
true match or a false one?

The problem begins when inconsistent data enters from multiple sources. The
meticulous process of finding these similar records and comparing to see if they
are actually the same is a daunting challenge. But with the release of the
Melissa Data Fuzzy Matching Component for SQL Server Integration Services (SSIS),

now have a tool that will make this all elementary. With this component, you
become a Sherlock Holmes – easily cracking the case of the data doppelgangers.

Finding the Culprits

Matching duplicate records are identified through a percent score. Compared
records will be given a match score ranging from 0% (non-matching) to 100%
(exact match). So what about records that score in between?

By leveraging the ETL capabilities of SSIS, the Fuzzy Matching Component allows
you to send the results through three different output destinations: Match,
Non-Match; and Possible Match. Based on how strict or loose you set our
thresholds to be, records will be redirected to the output tables accordingly –
making the job of keeping the bad records out much easier.

More Brains are Better Than One

So how exactly does the Fuzzy Matching Component determine match percentages?
Similarity computation is done through built-in fuzzy matching algorithms. But
settle with just one algorithm for similarity computation when you can have 16!

Available algorithms include common algorithms like the Jaro and Levenstein, but
also includes other more advanced algorithms such as Smith-Waterman-Gotoh and

Why so many you might ask? Each algorithm has its own strengths and weaknesses.
Some algorithms are more accurate when it comes to company and peoples’ names,
while some are more effective when it comes to company names. But with the wide
array of algorithms to choose from, you have the flexibility to choose the logic
that works for your data.

The Seven Features of the Fuzzy Matching Component

The SSIS component includes these features:

  1. Match based on several columns in your data
  2. Get actual Match Score Percentages
  3. Multiple fuzzy matching algorithms
  4. Automatic filtering of matching, non-matching, and possible matching records
  5. Built-in pre-cleansing through search and replace patterns
  6. Data Driven Model for easy migration to production
  7. Matching Metadata for data driven decision making

It’s Elementary, My Dear Watson

In what seems to be an impossible task of finding and blocking dupe records,
Melissa Data aims to help you put the puzzle pieces together and help solve the
mysteries of

fuzzy matching. With the Melissa Data Fuzzy Matching Component for SSIS, you’re
now one step closer to making your data problem-free.