Data Quality Assessment: Value Domain Compliance

Melissa Team | Analyzing Data, Analyzing Data Quality, Data Profiling, Data Quality, Data Quality Assessment, USPS | , , , ,

By David Loshin

To continue the review of techniques for using column value analysis for assessing data quality, we can build on a concept I brought up in my last post about format and pattern analysis and the reasonableness of data values, namely whether the set of values that appear in the column complies with the set of allowable data values altogether.

Many business applications rely on well-defined reference data sets, especially for contact information and product data. These reference data sets are often managed as master data, with the values enumerated in a shared space. For example, a conceptual data domain for the states of the United States can be represented using an enumerated list of 2-character codes as provided by the United States Postal Service (USPS).

That list establishes a set of valid values, which can be used for verification for any dataset column that is supposed to use that format to represent states of the United States.

A good data profiling tool can be configured to perform yet another column analysis that verifies that each value that appears in the column coincides with one of those in the enumerated master reference set. After the values have been scanned and their number of occurrences tallied, the set of unique values can be traversed and each value compared against the reference set.

Any values that appear in the column that do not appear in the reference set can be culled out as potential issues to be reviewed with the business subject matter expert.

In this blog series, we have looked at a number of methods that column value scanning and frequency analysis can be used as part of an objective review of potential data issues.

In a future series, we will look more closely at why these types of issues occur as well as methods for logging the issues with enough context and explanation to share with the business users and solicit their input for determination of severity and prioritization for remediation.