By David Loshin

We can look at a formal summarization of the challenge of conflicting data quality rules. We have two rules, R1 and R2, and the same input X:


• R1: Transform string X into string Y1
• R2: Transform string X into string Y2

It is easy to see this conflict using the simple examples in my
previous posts, but in fact, as your data cleansing rule set grows, the
potential for introducing conflicting rules not only grows, the ability to find
them diminishes.

There are a couple of approaches for addressing this challenge. The first is
greater differentiation in defining the cleansing rule through the use of
contextual cues. In our example, we might look at these conflicts:

1. St is transformed into SAINT
2. St. is transformed into SAINT
3. St. is transformed into STREET
4. St. is transformed into STREET

and introduce contextual constraints:

1. St is transformed into SAINT at the beginning of a street name
2. St. is transformed into SAINT at the beginning of a street name
3. St. is transformed into STREET at the end of a street name
4. St. is transformed into STREET at the end of a street name

This approach somewhat addresses the problem in some cases, but becomes an issue
again when there are new contexts, such as a string like “Trevor St. Lawrence
St.” which would necessitate yet another contextual rule.


Leave a Reply

Your email address will not be published. Required fields are marked *