By David Loshin

Over the past few blog posts, we have looked at the ability to define data quality rules asserting consistency constraints between two or more data attributes within a single data instance, as well as cross-table consistency constraints to ensure referential integrity. Data profiling tools provide the ability to both capture these kinds of rules within a rule repository and then apply those rules against data sets as a method for validation.

As a preparatory step focusing the profiler for an assessment, the cross-column rules to be applied to each record are organized in a way such that as the table (or file) is scanned, the data attributes within each individual record’s that are the subject of a rule are extracted and submitted for assessment. If the record complies with all the rules, it is presumed to be valid. If the record fails any of the rules, it is reported as a violation and tagged with all of the rules that were not observed.

Likewise for the cross-table rules, the profiler will need to identify the dependent data attributes taken from the corresponding tables that need to be scanned for validation of referential integrity. Those column data sets can be subjected to a set intersection algorithm to determine if any values exist in the referring set that do not exist in the target (i.e., “referred-to”) data set.

Any items in the referring set that do not link to an existing master entity are called out as potential violations.

After the assessment step is completed, a formal report can be created and delivered to the data stewards delineating the records that failed any data quality rules. The data stewards can use this report to prioritize potential issues and then for root cause analysis and remediation.





Leave a Reply

Your email address will not be published. Required fields are marked *