By David Loshin
The challenge is determining how to address the missing values, and unfortunately there are no magic bullets to infer a value when there is no information provided. On the other hand, one might consider some different ideas for determining whether a data element’s value may be null, and if not, how to find a reasonable or valid value for it.
For example, linking data between different data sets can enable some degree of inference. If I can link a record in one data set that is missing a value with a similar record in a different data set whose data elements are complete, as long as certain rules are observed (such as timeliness and consistency rules), we could make the presumption that the missing value can be completed by copying from the linked record.
Alternatively, we could adjust the business processes, and either determine when there are situations in which a value is mandatory when it really doesn’t need to be, or to examine ways to engineer aspects of a workflow to ensure that the missing data is collected prior to gating transactions to their subsequent stages.
These are just a few ideas, but the sheer fact that data incompleteness remains a problem these days is a testament to the fact that the issues is not given enough attention. But with the growth in the reliance on greater volumes of data being streamed at higher velocities than ever before, the problems of missing and incomplete data sets are only going to become more acute, so perhaps now is a good time to start considering the negative impacts of missing data within your own environments!