Data Quality Assessment: Column Value Analysis
By David Loshin
The place to start, though is not with the assessment task per se, but the context in which the data quality analyst will find him/herself when asked to identify potential data quality flaws. The challenge is in interpretation of the goal: an objective assessment is intended to identify data errors and flaws, but when the task is handed off to a technical data practitioner outside of the context of business needs, the review can be more of a fishing expedition than a true analysis.
What I mean here is that an undirected approach to data quality assessment is likely to expose numerous potential issues, and without some content scoping as to which potential issues are or are not relevant to specific business processes, a lot of time may be spent on wild goose chases to fix issues that are not really problems.
With that caveat, though, we can start to look at some data quality assessment methods, starting with one particular aspect of data profiling: column value analysis. The idea is that reviewing all of the values in a specific column along with their corresponding frequencies will expose situations in which values vary from what they should be. Most column analysis centers on value frequency. In essence, the technical approach for column analysis is to scan all the values in a column and add up their frequencies, then, present the frequencies to the analyst, ordered by frequency or in lexicographic order.
These two orderings enable the lion’s share of the analysis, yet many people don’t realize that the analysis itself must be driven by the practitioner within the context of the expectation. Over the next three postings in this series, we will look at three different ways to assess quality through reviewing the enumeration of column values and their relative frequency.