The Format of Nothing

Blog Administrator | Analyzing Data, Analyzing Data Quality, Data Governance, Data Management, Data Profiling, Data Quality | , , , , ,

By David Loshin

The first question I always wonder about missing data is about the format of the missing data, especially in systems that predate the concept of the “system null” value. For example, early systems maintained files storing tables with fixed-width columns. When one of a record’s field was missing a value, something had to be fitted into that field to ensure that the rest of the columns lined up correctly. That meant that the data field was not really empty, but that it had a value that represented nothing, such as blanks (or spaces) or zeroes.

Later systems allowed for different types of data values to be used as null
values. For example, in one data profiling exercise I worked on, a telephone
number field had the character string “No phone number provided” when the value
was not available. But the most interesting cases of representations of missing
values occurs when the application itself enforces a mandatory population of a
data field for which no value may be available.

In some cases, the data entry person can’t progress past a certain screen until
a set of mandatory fields are completed, and when no values are available, the
data entry person is forced to enter something, leading to some very creative
(and some not so creative) representations of missing data, such as “Not
Available,” “NA,” “N/A,” “None,” “none,” “no name,” as well as multiple versions
of question marks and other random character strings. In different data sets,
the variety of values is proportional to the number of people entering data,
especially when there is no data governance or standards.

So we have an interesting range of potential representations for null values,
ranging from well-formatted character strings to somewhat haphazardly created
data values. Yet, as we will see next week, the question of the formatting and
representation of null or missing values is less interesting than the meaning of
missing data…