By Elliot King

Elliot King

The cliché is as old as computing itself–garbage in, garbage out. And that cliché is as true now as ever, if not more so. Unfortunately, with information flowing into companies from so many sources including the Web and third-party providers, mistakes should not just be expected; they are basically inevitable. Garbage data is going to get in your data systems.

We want to close our eyes to bad data and just pretend it doesn’t matter; but
that would be a major mistake. Virtually any operation driven by faulty data is
suspect. Trends you may uncover could be wrong. Your customer contact efforts
could be inappropriate or misdirected.

Data cleansing, the systematic effort to remediate bad data is no trivial task.
First, so many different kinds of errors can exist. Mistakes can occur in single
source systems as well as multiple source systems. Errors and inconsistencies
can be introduced at both the metadata level–the data schema or the information
wrapper, in the case of the Web, may be flawed–or at the granular level, where
the information itself is just not right.

Just as the information itself, so much can go awry. There can be missing values
and misspellings. Information can be entered into the wrong field–a street name
in the city field perhaps. Attributes that should be linked aren’t–let’s say a
city without a ZIP code. Records could contradict each other or be associated
incorrectly. For example “John Smith” may actually work in payroll and not in
human resources.

Not surprisingly, data cleansing has more steps than doing the laundry. The
first step is to analyze the data and find the real mistakes–the omissions, the
contradictions and the errors. The next step is to re-engineer and validate the
new metadata and rules to address those errors. The data has to be transformed,
expunging the problems. At that point, the data can be reloaded into the
database. And in case that doesn’t sound all that daunting, each of the steps
generally has a slew of sub processes too.

Of course, none of this had to be done manually. Many different tools have been
introduced into the market. Some are more generalized while others specialize in
fixing a specific problem such as names and addresses.

At the end of the day, “garbage in, garbage out” sounds really harsh. Maybe you
should look at it this way–clean data is good data. But just like your clothes,
if you use data it will get dirty, so the cleansing actually never ends.


Leave a Reply

Your email address will not be published. Required fields are marked *