Making Sense Out of Missing Data

Blog Administrator | Analyzing Data, Analyzing Data Quality, Data Integration, Data Management, Data Quality | , , , , , ,

By David Loshin

I have spent the past few blog posts considering different aspects of null values and missing data. As I mentioned last time, it is easy to test for incompleteness, especially when system nulls are allowed. And even in older systems, the variable ways that missing or null data is represented is finite, making it easy to describe rules for flagging incomplete records.

The challenge is determining how to address the missing values, and unfortunately there are no magic bullets to infer a value when there is no information provided. On the other hand, one might consider some different ideas for determining whether a data element’s value may be null, and if not, how to find a reasonable or valid value for it.

For example, linking data between different data sets can enable some degree of inference. If I can link a record in one data set that is missing a value with a similar record in a different data set whose data elements are complete, as long as certain rules are observed (such as timeliness and consistency rules), we could make the presumption that the missing value can be completed by copying from the linked record.

Alternatively, we could adjust the business processes, and either determine when there are situations in which a value is mandatory when it really doesn’t need to be, or to examine ways to engineer aspects of a workflow to ensure that the missing data is collected prior to gating transactions to their subsequent stages.

These are just a few ideas, but the sheer fact that data incompleteness remains a problem these days is a testament to the fact that the issues is not given enough attention. But with the growth in the reliance on greater volumes of data being streamed at higher velocities than ever before, the problems of missing and incomplete data sets are only going to become more acute, so perhaps now is a good time to start considering the negative impacts of missing data within your own environments!