The Value of Data Quality in Machine Learning

Melissa IN Team | Data Quality, India, Machine Learning | ,

Artificial Intelligence and Machine Learning are gradually making their way from research applications to solving real-world problems. The applicability ranges from healthcare and industrial logistics to marketing and product design. That said, the results are relevant and dependable only if the data fed into the Machine Learning models meets certain quality standards.

Poor quality data skews results making the model unreliable. For example, during the COVID pandemic, several labs in Florida reported only positive test results thus inflating the overall positivity rate. In this case, the data set was incomplete.

Data for Machine Learning applications is captured from countless sources. Customers share their details when placing orders, filling in surveys, interacting with call centre agents and sales staff, etc. They may knowingly or unknowingly make mistakes while sharing their details.

For example, e-commerce website visitors might enter a pin code other than their own when assessing the deliverability of a gift. Or they may make a typographic error while entering their phone number. The method of data capturing also influences data quality. For example, if the address field on a form has limited character blocks, the address captured may be incomplete.

Let’s take a look at how data quality parameters impact Machine Learning applications and how to ensure the goodness of data.

Incorrect Data

Incorrect data affects operations as well as decisions based on Machine Learning algorithms. For example, the difference between 571101 and 571110 seems minuscule but when this refers to postal area codes, it reflects two distinct geographic areas; Bannur and Kupya. When entered into a Machine Learning model, this could skew results and lead to questionable inferences.

Let’s say a brand wanted to choose between two locations for a new store. Their data showed 12000 customers from 571101 and 10000 customers from 571110. The Machine Learning model would indicate Bannur to be the better location. But, if 3000 pin codes were wrongly listed as 571101 instead of 571110, opening a store in Bannur may limit the brand’s profit potential.

Missing Data

An incomplete data set is another quality issue that can affect machine learning model results. When it comes to data, every detail matters. Missing data can lead to incorrect results or the creation of duplicate records.

For example, food delivery apps rely on Machine Learning models to show customers the restaurants closest to them. Let’s say a restaurant in Bombay has two branches in the suburbs, one in Bandra and one in Andheri but the addresses do not include this detail. A customer living in Bandra may not realize the difference and place an order from the Andheri branch. The order would take considerably longer to be delivered thereby spoiling the experience.

Invalid Or Outdated Data

Machine learning models are often used to understand customer demographics. To get an accurate result, the data entered must reflect current values. If a customer shifts from one city to another but the records still show his/ her old address, the demographic categorization will be wrong. This, in turn, affects all further decisions. For example, the promotional emails received may not be relevant to them.

Duplicate Records

Duplicates are the bane of Machine Learning data inputs. Two features may have identical values or they may be listed in the same index. Duplicates are usually created in a database when records are incomplete, incorrect or outdated.

Having duplicate records in the dataset affects the data scientist’s feature selection ability and reduces the computational ability of model training. It also creates the problem of multicollinearity and affects weight distribution. In the long run, it will create a biased model.

Incorrectly Formatted Data

For data to make sense in comparison to other records, they must all be formatted in the same way. This is especially problematic in cases where data is extracted from multiple sources or when data is manually updated by people from different departments or when it is stored in siloed databases. When data is inconsistently formatted, it becomes difficult to understand.

A difference in the way dates are formatted is the simplest example of how data formatting creates an issue. The sales department may save the date of purchase, November 5th, 2022 in the DD/MM/YYYY format. However, if the accounting department uses the MM/DD/YYYY format, they may read the date as May 11th, 2022. This can lead to misunderstandings and affect the reliability of decisions based on it.

Noise In The Data Stream

Noise refers to data elements that are inconsistent with the rest of the data. For example, if all customer names are saved in first name, last name format, the initial ‘M’ in ‘John M. Smith’ could be considered noise. Noise elements may also refer to outliers. For example, a value of 80 in the age stream could be a noisy element if the other values range between 10 and 20.

Noise is often unavoidable and reflective of real-world values. It can help a model become more fault tolerant. That said, these elements do not add any value to the model and can contribute to record duplication. They can also skew models and affect the accuracy with which it makes predictions.

Preparing Data For Machine Learning Models

Abraham Lincoln once said, “Give me six hours to chop down a tree and I will spend the first five sharpening the axe.” Data scientists can apply the same analogy to preparing data for machine learning models.

For machine learning models to produce useful results, all data must be cleaned and enriched before it enters the database. Data cleaning involves comparing data entered against what is stored in a reliable third-party database to correct common typographic inaccuracies, verify details to ensure that they are valid, remove invalid data, standardize formats, fill in missing information and eliminate duplicates.

There are many data cleaning tools that can automate this process. Ideally, you need to use a data cleaning tool that can clean individual records as they are added to the database as well as clean bulk records to screen duplicates and outdated data.