By David Loshin
Fortunately, the same methods of parsing and standardization can be used for the content itself. This can address the types of issues I noted in the first post of this series, in which someone entering data about me would have used a nickname such as “Dave” instead of “David.”
By introducing a set of rules for pattern recognition, we can organize a number
of transformations from an unacceptable value into one that is more acceptable
or standardized. Mapping abbreviations and acronyms to fully spelled out words,
eliminating punctuation, even reordering letters in words to attempt to correct
misspellings – all of these can be accomplished by parsing the values, looking
for patterns that the value matches, and then applying a transformation or
In essence, we can create a two-phased standardization process that first
attempts to correct the content and then attempts to normalize the structure.
Applying these same rules to all data sets results in a standard representation
of all the records, which reduces the effort in trying to perform the exact
Yet this process may still allow variance to remain, and for that we have some
other algorithms that I will touch upon in upcoming posts.