By David Loshin

Once we have started our data quality assessment process by performing column value analysis, we can reach out beyond the scope of the types of null value analysis we discussed in the previous blog post. Since our column analysis effectively tallies the number of each value that appears in the column, we can use this frequency distribution of values to identify additional potential data flaws by considering a number of different aspects of value frequency (as well as lexicographic ordering), including:

  • Range Analysis, which looks at the values, and allows the analyst to consider whether they can be ordered so as to determine whether the values are constrained within a well-defined range.

  • Cardinality Analysis, which analyzes the number of distinct values that appear within the column to help determine if the values that actually appear are reasonable for what the users expect to see.

  • Uniqueness, which indicates if each of the values assigned to the attribute is used once and only once within the column, helping the analyst to determine if the field is (or can be used as) a key.

  • Value Distribution, which presents an ordering of the relative frequency (count and percentage) of the assignment of distinct values. Reviewing this enumeration and distribution alerts the analyst to any outlier values, either ones that appear more than expected, or invalid values that appear few times and are the result of finger flubs or other data entry errors.

In addition, good data profiling tools can abstract the value strings by mapping the different character types such as alphabetic, digits, or special characters to a reduced representation of different patterns. For example, a telephone number like “(301) 754-6350” can be represented as “(DDD) DDD-DDDD.” Once the abstract patterns are created, they can also be subjected to frequency analysis, allowing such assessment like:

  • Format and/or pattern analysis, which involves inspecting representational alphanumeric patterns and formats of values and reviewing the frequency of each to determine if the value patterns are reasonable and correct.

  • Expected frequency analysis, or reviewing those columns whose values are expected to reflect certain frequency distribution patterns, validate compliance with the expected patterns.

Recall again that the identification of potential issues can only be verified as a problem by review with a business process subject matter expert.


Leave a Reply

Your email address will not be published. Required fields are marked *