dedupe

Matchcode Caveats - How to Solve Them


By Tim Sidor, Data Quality Analyst

“The more advanced I make my matchcode, the more duplicates I’ll
identify.”

This is an
assumption – true or false – that many of our new users to MatchUp make, but
often leads to false dupes, no dupes, or a process that seems to run forever.

“Why?”

Adding more
columns of conditions, can be looked at as ‘just adding more ways to return
more duplicates.’ This additional criteria may or may not result in accurate
groups, as you may have actually loosened up your intended criteria. On the
flip side, adding matchcode components may result in less duplicates as you may
have tightened up your rules too much. Applying fuzzy algorithms (without
thoroughly testing) will lead to a slower process, but may not return a
significant number of additional matches (diminishing returns of accuracy/speed
vs complexity/inefficiency).

“What can I do?”

When
learning to use MatchUp, we always suggest starting with the basics – a simple
default matchcode that we distribute, and a small data set. This allows you to
quickly run and analyze how the matchcode performed against the data. Then make
small changes – tweaking the matchcode and repeating the process or running a
slightly altered data set with a few variations in format or data values.
Eventually, you will migrate towards your end goal of incorporating your
business rules into the matching strategy (the matchcode) with your production
data.

 

By
following any of the above disciplined paths, you will more quickly arrive at
your goal and with a better understanding of how to create the best matchcode
for your environment. No diagonal shortcuts!

“OK, I already went straight to ‘Production Data and a Custom
Matchcode,’ what do I do?”

First,
evaluate the Result Codes and Dupe Group output properties. In addition to
telling you the output disposition of a record (unique, group winner,
duplicate, etc.), the Result Codes will tell you which matchcode combination
(which column of checkmarks in the matchcode) caused the record to match in a
particular Dupe Group. If you find out that a particular column is never
finding a match, or never finding a match that another column hasn’t already
found – you should consider removing it. This may also prompt you to remove
duplicated component types which may have been used with alternate settings,
from the matchcode. After re-evaluating the remaining components, and
concluding they still represent a valid strategy, you may find that your
process returns more accurate results AND processes much quicker.

“Can my process run faster?”

Yes, MatchUp
uses an advanced clustering method to find duplicates and creating advanced
matchcodes prevent efficient clustering, thus slowing processes down. For
example, we had a customer who we had drop a matchcode component with a fuzzy
setting from the second position to below another component which was using an
exact setting (and in all columns). Their process decreased from 47 hours to
under 4 – by making this simple change. Expanding on the diminishing returns
concept – if an exact matchcode, for example, returns 20,000 duplicates from a
1,000,000 record set – is changing all components to a fuzzy algorithm and then
returning 20,003 duplicates worth a process that takes 4x to run?

“What about that Result Code that tells me a specific combination
returned a false dupe?” or “Why did these records not match under my rules?”

For details
on how a matchcode relates to your data, click here for easy guidance to
understanding your matchcode rules, and remember, test thoroughly!

For more info, go to: https://www.melissa.com/data-deduplication

Leave a Reply

Your email address will not be published. Required fields are marked *

Similar posts

Get notified on new data quality features and insights

Be the first to know about new data quality and product features.