Data Quality Assessment: Column Value Analysis

Blog Administrator | Analyzing Data, Analyzing Data Quality, Data Cleansing, Data Enrichment, Data Profiling, Data Quality, Data Quality Assessment | , , , , ,

By David Loshin

In recent blog series, I have shared some thoughts about methods used for data quality and data correction/cleansing. This month, I’d like to share some thoughts about data quality assessment, and the techniques that analysts use to review potential anomalies that present themselves.

The place to start, though is not with the assessment task per se, but the context in which the data quality analyst will find him/herself when asked to identify potential data quality flaws. The challenge is in interpretation of the goal: an objective assessment is intended to identify data errors and flaws, but when the task is handed off to a technical data practitioner outside of the context of business needs, the review can be more of a fishing expedition than a true analysis.

What I mean here is that an undirected approach to data quality assessment is likely to expose numerous potential issues, and without some content scoping as to which potential issues are or are not relevant to specific business processes, a lot of time may be spent on wild goose chases to fix issues that are not really problems.

With that caveat, though, we can start to look at some data quality assessment methods, starting with one particular aspect of data profiling: column value analysis. The idea is that reviewing all of the values in a specific column along with their corresponding frequencies will expose situations in which values vary from what they should be. Most column analysis centers on value frequency. In essence, the technical approach for column analysis is to scan all the values in a column and add up their frequencies, then, present the frequencies to the analyst, ordered by frequency or in lexicographic order.

These two orderings enable the lion’s share of the analysis, yet many people don’t realize that the analysis itself must be driven by the practitioner within the context of the expectation. Over the next three postings in this series, we will look at three different ways to assess quality through reviewing the enumeration of column values and their relative frequency.