Data Quality Assessment: Sparsity and Nullness

Blog Administrator | Analyzing Data, Analyzing Data Quality, Data Quality, Data Quality Assessment | , , , , , ,

By David Loshin

The first set of data quality assessment techniques that use column value frequency analysis focuses on the relationship of the population of values to the business processes that consume the data. The intent is to understand how the relative population of the column is associated with defined (or implicit) business rules, and then isolate and validate those rules.

Accepted rules can be integrated into operational systems as controls for validation when the data is created or acquired, thereby reducing the potential downstream negative impacts of data flaws.

Sparseness analysis identifies columns that are infrequently populated. Null analysis is used to identify potential default null values that are not database system nulls. These values crop up all the time, and I am sure you have seen them.

Some examples include: “None,” “N/A,” “X,” “XX,” “9999999,” etc. These default null values are important to identify for a number of reasons. The existence of default nulls like those examples indicates a flaw in the system that forces the provision of a value at the time of creation even when no value can be provided.

We have seen systems that insist on the user entering data into required fields, particularly when there are no values that could be correct, such as a “home phone number” for a person that only uses a mobile phone, or a “first name” field for a corporate customer.

In these cases, the user is forced to provide some value, so determining the existence of this forced data entry helps the analysts isolate where in the information flow that dependency exists so that it can be resolved more effectively. Identifying default nulls is also beneficial when one can determine any consistency or patterns in their use.

It is one thing to see a multitude of default null values, but very different to find that “N/A” is used consistently, even when the data originates from different sources and different people. In addition, considering the different types of null patterns helps in the analysis of sparseness to see how frequently the column has a valid value altogether. Regarding sparseness, there is an implicit presumption that sparsely-populated columns are unused or are devolving into unused data elements.

The absence of data can indicate a data quality issue, especially as more organizations seek to increase their data collection and retention. However, recall that the objective analysis only indicates potential issues. These must be put in context by reviewing the potential issues with business subject matter experts.