Disparate, Dirty, Duplicated Data – Understanding the 3Ds of Bad Data
In 1999, NASA learned how expensive bad data can be the hard way when they lost the Mars Orbitor in space. Why did this happen – because the engineers made calculations based on the Imperial system of measurements while the NASA scientists used the Metric system.
A simple mistake of not ensuring that the data was measured in the same units cost NASA billions. Such is the impact of bad data.
When talking of bad data quality, there are three ‘D’s that come into play – dirty data, disparate data and duplicate data.
- Dirty Data
Entering an address as ‘Main Str’ instead of ‘Main Street’, typographic errors, using numbers in fields intended only for alphabets – these are some of the most common examples of dirty data. Such data issues can be categorized as:
- Incorrect spellings
- Negative spacing
- Incomplete information
- Incorrect use of upper/lower cases
- Use of abbreviations and nicknames
- Incorrect use of punctuations and symbols
It may seem like a small inconsequential error but data specialists and analysts spend a considerable amount of their time simply cleaning dirty data like this. Leaving it as is, is simply not an option. How can you expect delivery agents to reach customers on time if they do not have a complete address or if they cannot understand the street name?
And imagine a customer’s frustration if they were to receive a promotional email that addresses them by a misspelled name…
- Disparate Data
Companies collect data from various sources. In theory, this helps create a cohesive record. But, the issue with collecting data from multiple sources is that every source may use a different format to record and present data. A difference in date formats is the simplest example.
The sales team may records dates in the DD/MM/YYYY format while the accounts team may record it in the MM/DD/YYYY format. Thus, the latter may read 06/12/2020 as the 12th of June while the sales team may be referring to the 6th of December. It’s a small misunderstanding that can have dramatic impacts on sales projections, marketing plans, etc.
Disparate data refers to data extracted from different sources and stored in varied data formats. This type of bad data keeps analysts from getting a deeper insight and makes it difficult for them to derive anything of value from the data.
- Duplicate Data
Duplication is the third and, in many ways, the biggest data quality issue. There are many reasons why your databank may contain duplicate records.
- A new record may be created every time information is updated. For example, a new record may be created every time a customer makes a purchase instead of updating the original record.
- New records may be created every time a customer interacts with the brand through a different medium. For example, let’s say a customer places an order through a brand’s website. The next time, he places an order through the app. Instead of using a single account for both interactions, he may create different accounts – one with his first name and one with his last name.
- New records may be created when re-registering with new phone numbers of email IDs.
- System glitches
Duplicate records make a data bank very unreliable. Think of it this way – the marketing team looks at a data bank of 500 records. Of these 300 seem to be in a particular geographic area and hence they decide to open a new branch for easier accessibility.
However, 120 records are duplicates. Thus, the new branch will cater to only 180 customers in reality. If they had access to this information, they may not have decided to open a store in that particular location.
Eliminating all duplicate records manually is simply not possible. For example, a person may create accounts as ‘Aditya Chauhan’, ‘Adi Chauhan’, ‘A. Chauhan’, etc. While some records may share the same email address, others may have only the same phone number. Thus, to truly de-duplicate records, you need an algorithm that compares all the data rows and weeds out cases with even the lowest probability of duplication.
Dealing With The Three ‘D’s
Bad data becomes more expensive the longer it is kept. Thus, it needs to be dealt with as early as possible. Logically speaking, the first step is to put quality checks in place at the data collection source. There are a number of tools that can help with this.
For example, address verification tools ensure that complete addresses. Instead of relying on human input for the complete information, using an autocomplete feature can help minimize errors and capture information in standardized formats. Similar tools can also compare data in new records to the existing database and keep duplicates from being created.
Instead of blaming IT for bad data, data governance policies need to be created to standardize fields for records and minimize the issue of disparate data. This policy will outline how data is collected, processed and managed to ensure that it is accurate and consistent. It should ideally be flexible so that it can be adapted to changing needs.
Data quality checks at only the collection source are not sufficient to keep bad data out of your system. Data often goes bad simply with time. For example, a city may choose to rename a street, thus, invalidating records that mention the old street name as part of the address. To counter this, data quality checks must be made a routine task.
All you need to do is find the right tools. For example, email verification tools can ping emails without human interaction by the company or customers to check whether the email ID is still in use. Those that have been discontinued can be highlighted and removed from the system.
Lastly, it is important to set realistic goals. Hoping to achieve 100% perfect data is a bad goal. Instead, your goal should be to make data credible and fit for intended use by ensuring that it is accurate, complete, valid, standardized and accessible.