Data Quality Assessment: Value and Pattern Frequency
By David Loshin
- Range Analysis, which looks at the values, and allows the analyst to consider whether they can be ordered so as to determine whether the values are constrained within a well-defined range.
- Cardinality Analysis, which analyzes the number of distinct values that appear within the column to help determine if the values that actually appear are reasonable for what the users expect to see.
- Uniqueness, which indicates if each of the values assigned to the attribute is used once and only once within the column, helping the analyst to determine if the field is (or can be used as) a key.
- Value Distribution, which presents an ordering of the relative frequency (count and percentage) of the assignment of distinct values. Reviewing this enumeration and distribution alerts the analyst to any outlier values, either ones that appear more than expected, or invalid values that appear few times and are the result of finger flubs or other data entry errors.
In addition, good data profiling tools can abstract the value strings by mapping the different character types such as alphabetic, digits, or special characters to a reduced representation of different patterns. For example, a telephone number like “(301) 754-6350” can be represented as “(DDD) DDD-DDDD.” Once the abstract patterns are created, they can also be subjected to frequency analysis, allowing such assessment like:
- Format and/or pattern analysis, which involves inspecting representational alphanumeric patterns and formats of values and reviewing the frequency of each to determine if the value patterns are reasonable and correct.
- Expected frequency analysis, or reviewing those columns whose values are expected to reflect certain frequency distribution patterns, validate compliance with the expected patterns.
Recall again that the identification of potential issues can only be verified as a problem by review with a business process subject matter expert.