Classifying Data Quality Problems
By Elliot King
And while that definition may be good enough in a practical sense for
specific issues, it really isn’t good enough to diagnose the sources of data
quality problems generally. Constructing a general framework for data quality
problems can be a useful guide in better identifying and resolving specific
One of the earliest efforts to better understand the nature of data quality
problems calls for classifying problems into three general
categories–operational, conceptual and organizational. Operational data quality
issues are those that are generated through problems with data capture and
transmission. Inaccurate data is collected. Data may be missing. Or data may be
corrupted through some process, for example.
Conceptual data quality problems occur when data is not well defined or it is
inappropriate for its intended use. One of the most famous examples of a
conceptual data quality problem (though it is not often thought of in this way)
was brought to light in the movie Moneyball.
The basic thrust of the movie was not that the information old-time baseball
scouts used to evaluate players was wrong per se; it was they were collecting
the wrong data to identify productive players. Batting average, for example, is
less useful in determining a player’s value than on-base percentage. A pressing
new conceptual data problem is the attempt to use electronic patient records to
judge medical treatment outcomes.
When operational and conceptual data problems persist over time despite repeated
attempts to fix them, organizational data quality problems are usually the
culprit. In these cases, wrong, missing and invalid data is not really the
problem, but the symptom. Something has to be fixed in the organizational
structure or culture.
The point is this–data can be wrong for many reasons and it can’t fundamentally
be fixed without a general understanding of the error’s cause.