By David Loshin

As I suggested in my last post, applying parsing and standardization to normalize data value structure will reduce complexity for exact matching. But what happens if there are errors in the values themselves?

Fortunately, the same methods of parsing and standardization can be used for the content itself. This can address the types of issues I noted in the first post of this series, in which someone entering data about me would have used a nickname such as “Dave” instead of “David.”

By introducing a set of rules for pattern recognition, we can organize a number
of transformations from an unacceptable value into one that is more acceptable
or standardized. Mapping abbreviations and acronyms to fully spelled out words,
eliminating punctuation, even reordering letters in words to attempt to correct
misspellings – all of these can be accomplished by parsing the values, looking
for patterns that the value matches, and then applying a transformation or
standardization rule.

In essence, we can create a two-phased standardization process that first
attempts to correct the content and then attempts to normalize the structure.
Applying these same rules to all data sets results in a standard representation
of all the records, which reduces the effort in trying to perform the exact
matching.

Yet this process may still allow variance to remain, and for that we have some
other algorithms that I will touch upon in upcoming posts.


Leave a Reply

Your email address will not be published. Required fields are marked *