More About Data Quality Assessment

Melissa Team | Address Quality, Analyzing Data, Analyzing Data Quality, Data Management, Data Profiling, Data Quality, Data Quality Assessment | , , , , , ,

By David Loshin

In our last series of blog entries, I shared some thoughts about data quality assessment and the use of data profiling techniques for analyzing how column value distribution and population corresponded to expectations for data quality. Reviewing the frequency distribution allowed an analyst to draw conclusions about column value completeness, the validity of data values, and compliance with defined constraints on a column-by-column basis.

However, data quality measurement and assessment goes beyond validation of column values, and some of the more interesting data quality standards and policies apply across a set of data attributes within the same record, across sets of values mapped between columns, or relationships of values that cross data set or table boundaries.

Data profiling tools can be used to assess these types of data quality standards in two ways. One approach is more of an undirected discovery of potential dependencies that are inherent in the data, while the other seeks to apply defined validity rules and identify violations. The first approach relies on some algorithmic complexity that I would like to address in a future blog series, and instead in the upcoming set of posts we will focus on the second approach.

To frame the discussion, let’s agree on a simple concept regarding a data quality rule and its use for validation, and we will focus specifically on those rules applied to a data instance, such as a record in a database or a row in a table. A data instance quality rule defines an assertion about each data instance that must be true if the rule is observed. If the assertion evaluates to be not true, the record or table row is in violation of the rule.

For example, a data quality rule might specify that the END_DATE field must be later in time than the BEGIN_DATE field, and that means that for each record, verifying observance of the rule means comparing the two date fields and making sure that the END_DATE field is later in time than the BEGIN_DATE field.

This all seems pretty obvious, of course, and we can use data profiling tools to both capture and apply the validation of the rules to provide an assessment of observance. In the next set of posts we will focus on the definition and application of cross-column and cross-table data quality rules.