Designing More Effective Data Cleansing Rules

Blog Administrator | Analyzing Data Quality, Data Cleansing, Data Management, Data Quality | , , , ,

By David Loshin

In my last post, we looked at a simple data transformation and cleansing rule that was to be used to standardize a representation of a street type. We found that an uncontrolled application of the rule made changes where we didn’t really want a change to happen.

There are two reasons why applying the rule led to an undesired result. The
first issue is context: even though we want to map the string “St.” to “STREET”
when it appears at the end of an address, that same string “St.” appearing
earlier in the street name and before a proper noun is more likely to be an
abbreviation for “Saint,” not “Street.” The obvious way to fix that is to
introduce a new rule that maps “St.” to the word “SAINT.”

And that introduces the second issue: order of execution. Let’s say we did have
that second rule, so our rule set now looks like this:

• St is transformed into SAINT
• St. is transformed into SAINT
• STR is transformed into STREET
• ST is transformed into STREET
• St. is transformed into STREET
• St. is transformed into STREET
• Str is transformed into STREET
• Str. is transformed into STREET

Applying that rule set to our string “St. Charles St” yields “SAINT CHARLES
SAINT” which is also clearly incorrect. The reason is that we have conflicting
rules, in which the same input maps to two different outputs.

In that case, the first rules applied are the ones that win, since by the time
we hit the other rules with the same inputs, the input string has already been
transformed into the output version, so the rule will no longer apply.
Reordering the rules won’t work either, since the same thing will happen (just
with the other output values). To address this challenge, we need an approach
that accommodates both context and order of execution.