Data Cleaning – The What, Why and How

Melissa AU Team | Data Cleansing | , ,

Deciding what to include and exclude in the new product range – deciding price points – estimating delivery timelines – all of this and more is based on analyzing vast amounts of data. In 2020, on average, every individual generated 1.7 megabytes of data per second. But, quantity isn’t everything. Poor data quality can cost businesses between $9.7 million and $14.2 million annually. Thus, as our reliance on data increases, so must the efforts to clean data.

What Is Data Cleaning?

An inventory stock report mentions a product ‘blue jar’ with the number 56 beside it. There is no indication as to whether this is the number of pieces in stock, the cost price or retail price – the ambiguity makes the data quite useless. Data cleaning is all about identifying poor quality data such as this and correcting, completing or deleting it as per needs to make the overall dataset more reliable and useful.

It is important to note that data cleaning is not the same as data transformation. Data cleaning deals with improving data consistency while the latter converts data from one format to another to make data processing simpler.

What Are The Quality Components Addressed By Data Cleaning?

To be classified as good quality, data must meet certain criteria and it must be:

• Complete
• Accurate
• Valid And Follow Defined Patterns
• Timely And Reflect Up To Date Values
• Formatted Consistently
• Relevant

Data cleaning addresses all of these aspects and minimizes the risk of making decisions based on erroneous data. Imagine the losses a company would face if it introduced a product at the wrong price point or how its reputation would fare if they could not make deliveries on time.

Steps Involved In Data Cleaning

Data cleaning can begin once quality standards and formats have been defined. The main steps involved are:

1. Identify And Correct Errors

First, data must be analyzed to identify errors and their sources. Simple typographic errors, etc. may be addressed at this stage. For example, an email entered as can be corrected to

Similarly, records with gibberish can be identified and flagged for deletion. For example, a customer may create an account with the name ‘hgljjjf’ if they are just browsing with no intention to shop. Such records are irrelevant and can skew analytics if left in the database.2

2. Validate Data

Validating data is an important step to ensure that it is accurate and up to date. Rather than do this manually, customer identities, addresses and phone numbers can be verified with software that compares the data in your records to reliable third party databases. This is an important step not only in keeping data clean but also with respect to regulatory compliances and minimizing fraud.3

3. De-duplicate Records

When different departments maintain their own databases, there’s a high chance of the data being duplicated. For example, the sales and accounts team may maintain independent records for the same customer. As a result, both records may be considered incomplete when viewed comprehensively and the organization may not get an accurate picture of their customer demographics.

Duplicate records may be created by human error when names are entered differently. De-duplicate records and create a single ‘Golden’ record that can be accessed by anyone who needs it.

4. Fill In The Gaps To Complete The Data

Incomplete addresses are one of the common reasons for delayed deliveries. Many addresses are entered without the area code. Something as small as missing out on the floor number in a commercial office building could confuse the delivery agent.

Data cleaning identifies such gaps and attempts to complete them. Area codes can be added based on information gathered when verifying the data. This is also known as data enhancement.

5.Standardize The Processes

Data cleaning cannot be a one-time exercise if you want to maintain a clean database. Even data that is accurate, complete and valid at the time of entry can decay in storage. For example, a customer may change his phone number or move to a different city. Thus, this is a process that must be performed regularly. Share the data cleaning process with the teams that source, use and analyze data to encourage them to use the data correctly. For example, if departments were to create copies of data, you would once again have to deal with duplicate records.

In conclusion

Working with clean data maximizes operational efficiency while minimizing unnecessary expenses. In today’ world where customer satisfaction is key, it also protects the company’s reputation and increases perceived value. That quantity of data an organization holds will grow continuously and hence, focusing on quality and keeping data clean is critical. The earlier you put processes in place, the easier it is.