By Elliot King
Face it. Almost all data has problems. Most organizations have multiple data sources and too often depend on potentially unreliable data flows from customers, data entry clerks, third-party providers and different processing systems for it to be otherwise.
So what can you do to identify data quality problems before bad data interferes with critical business processes? Actually, that is something of a trick question. Depending on the speed at which your business processes are executed, in some cases you may not be able to do anything to prevent poor quality data from having an impact. But with a systematic effort, in virtually every case you can minimize the damage done by bad data and safeguard yourself in the future.
The key to beating data quality problems before they beat you is assessment. How good — that is, how accurate, timely, complete and appropriate — is the data you have? There are several approaches to answering this question. The first is what I call the journalism strategy. Just as good journalists check their facts against a second source, the data in your databases can be manually compared to the same data from a trusted source, perhaps a paper record or an expert. Of course, for a database with several million records, this might not be all that practical.
A second method is what I call the industrial approach because it is often used in product manufacturing. Take a sample of your data and see how much does not meet specifications. Unfortunately, with data records, while you may get an idea of how many records are not perfect, sampling may not tell you the frequency of specific types of errors and will rarely lead you to the source of the errors.
The third approach involves assessing your data against data rules. Remember, data records describe complex objects that have attributes and characteristics tied together in specific ways. Data rules define those relations and consist of elements such as constraints on valid values, constraints on relationships between data elements, order of events, conditions of events, timing events and so on.
Any data that violates those constraints is problematic and must be examined. Unlike sampling, you can usually examine all your records to see if they conform to your data rules.
Developing a comprehensive set of data rules is not the last step in the process of assessing data quality and identifying where poor data may have an impact on business processes, but in most cases it should be one of the first. Analyzing data according to your business rules will help uncover poor quality data.