By David Loshin
Many business applications rely on well-defined reference data sets, especially for contact information and product data. These reference data sets are often managed as master data, with the values enumerated in a shared space. For example, a conceptual data domain for the states of the United States can be represented using an enumerated list of 2-character codes as provided by the United States Postal Service (USPS).
That list establishes a set of valid values, which can be used for verification for any dataset column that is supposed to use that format to represent states of the United States.
A good data profiling tool can be configured to perform yet another column analysis that verifies that each value that appears in the column coincides with one of those in the enumerated master reference set. After the values have been scanned and their number of occurrences tallied, the set of unique values can be traversed and each value compared against the reference set.
Any values that appear in the column that do not appear in the reference set can be culled out as potential issues to be reviewed with the business subject matter expert.
In this blog series, we have looked at a number of methods that column value scanning and frequency analysis can be used as part of an objective review of potential data issues.
In a future series, we will look more closely at why these types of issues occur as well as methods for logging the issues with enough context and explanation to share with the business users and solicit their input for determination of severity and prioritization for remediation.