By David Loshin

One approach to resolving data cleansing rule conflicts is the introduction of contextual constraints for application of the rules. This could help in differentiating the application of rules, but could grow to be complex quickly.

I had noted that there is a second approach that could be used, which is
adjusting the rule set somewhat to ensure distinction of abbreviation and then
phasing the application of rules. The idea is that if we have two rules that
share the same input but have different outputs, using the form:

• R1: Transform string X into string Y1
• R2: Transform string X into string Y2

Then, a modification to the rule set to break that conflict could work if we
first correct all instances of one type of context-dependent inputs into a
modified form and then apply modified rules during a second pass. Here is
another stab at modifying our sample rules into two passes. Here is pass 1:

1. St. is transformed into __STREET__ at the end of a street name
2. St. is transformed into __STREET__ at the end of a street name

Here is pass 2:

1. St is transformed into SAINT
2. St. is transformed into SAINT
3. __STREET__ is transformed into STREET

We used the more predictable contextual rules for the first pass and changed the
flagged items into some token that would probably never appear as a placeholder
for the next pass.

Hopefully we will have filtered out all of the strictly context-dependent
instances in the first pass, allowing us to loosen the constraint for the
instances in which the context is less predictable. (thereby transforming
“Trevor St. Lawrence St.” into “TREVOR ST. LAWRENCE __STREET__” after pass 1 and
into “TREVOR SAINT LAWRENCE STREET” after pass 2).

This is just one way to approach the challenge, yet there are other ideas that
can be applied. The first step is to look at the ways your data cleansing tools
define rules as a way to consider the options. I look forward to exploring this
in greater detail in an upcoming series.

Leave a Reply

Your email address will not be published. Required fields are marked *