Validation of Data Rules
By David Loshin
As a preparatory step focusing the profiler for an assessment, the cross-column rules to be applied to each record are organized in a way such that as the table (or file) is scanned, the data attributes within each individual record’s that are the subject of a rule are extracted and submitted for assessment. If the record complies with all the rules, it is presumed to be valid. If the record fails any of the rules, it is reported as a violation and tagged with all of the rules that were not observed.
Likewise for the cross-table rules, the profiler will need to identify the dependent data attributes taken from the corresponding tables that need to be scanned for validation of referential integrity. Those column data sets can be subjected to a set intersection algorithm to determine if any values exist in the referring set that do not exist in the target (i.e., “referred-to”) data set.
Any items in the referring set that do not link to an existing master entity are called out as potential violations.
After the assessment step is completed, a formal report can be created and delivered to the data stewards delineating the records that failed any data quality rules. The data stewards can use this report to prioritize potential issues and then for root cause analysis and remediation.