How to Measure Data Quality – 7 Metrics to Assess the Quality of Your Data
Melissa AU Team | |
There’s no doubt about it – the better the quality of your data, the more useful it will be for decision making. Though many companies aspire to have ‘data-driven’ objectives, a study found that only 33% of firms trust the quality of the data enough to draw insights from it. Merely categorizing data as good or bad isn’t enough.
The only way to inspire confidence in data is to work on improving its quality against measurable indices. While every organization may measure data quality differently based on their needs, there are a few basic metrics that should always be considered. Here are 7 of them.
- The Ratio Of Errors To Data
Errors in data could be in the form of inaccurate entries, missing information, outdated details, etc. The ratio of errors to data can be calculated by taking the number of such known data errors and dividing it by the size of the data set.
The larger the data set, the higher the risk of errors. However, if you have put a data quality improvement plan in place, and it’s effective, the ratio of errors to data size should reduce or stay constant as the data set size increases.
2. Number Of Empty Fields
Completeness is one of the most important data quality dimensions. For example, street addresses may be considered complete only if they have an apartment number, building name, street name, city name and pin code. When data is entered into the system, each detail is formatted into individual fields. Empty values refer to fields left blank. For example, if an address does not mention the pin code, the field would be blank and the address would be considered incomplete.
Quantifying the number of empty fields can thus give you a good idea of the overall completeness of the data set. As data goes through quality enhancement processes, the number of empty fields should fall.
3. Amount Of Dark Data
Study found that 80% of all data can be considered dark data. This refers to data that is stored by an organization but cannot be used. Common examples include email attachments, old versions of files, project notes by old employees, log files, transaction history, raw survey data, etc. Some of the data may be outdated while in other cases, it may simply be unstructured and hence inaccessible.
The more dark data an organization has, the higher the probability of data quality issues. Storing large amounts of dark data also increases the cost of data storage and can risk lowering the value of the entire data set.
4. Email Bounce Rates
Email marketing is one of the most popular ways to reach out to new and existing customers. Data quality plays an important role in determining the effectiveness of any email campaign. If the email addresses are inaccurate, incomplete or formatted incorrectly, the emails will bounce back. For example, emails will bounce back from addresses like firstname.lastname@example.org, john@@gmail.com, john@gmailcom, etc.
A high bounce rate reduces the effectiveness of the campaign and can lower a company’s reputation. In some cases, it could even lead to the server being blacklisted or monetary fines. Hence, it is important to keep an eye on the email bounce rates. Ideally, the bounce rate should be less than 2%.
5. Cost Of Data Storage
Data storage costs can also indicate data quality issues. Whether data is used or not, as the amount of data stored increases, so does the cost of storage. Hence, if the data storage costs are increasing but you aren’t seeing a corresponding increase in data operations, there could be a problem. On the other hand, if the storage costs stay constant or decline as your data operations grow, the quality of data is likely to be higher.
For example, storing duplicate versions of customer data does not offer any additional value but it does increase costs. While files are de-duplicated, the cost falls and the value of the data set increases.
6. Data Transformation Error Rates
Since data is collected from multiple sources, it may not always be available in a standardized format. Data transformation refers to converting data from one format to another. It can be in the form of adding data, deleting unwanted fields, standardizing salutations, combining data from different fields, etc.
For example, customer data may be combined from their purchase history, calls to the customer service team and information they share when signing up for an account. Data transformation that takes unacceptably long amounts of time or transformation operations that fail could be a sign of data quality issues.
7. Time Taken To Derive Value
Another way to measure data quality is to assess how long it takes your team to make use of the data. If the data is correct, complete and structured correctly, it can be used as it is but if it is incorrect or if data values are missing, the data must first be verified and enhanced before it can be used. Thus, the length of time taken by the team to derive value from a data set can be indicative of its quality.
That said, it must be noted that several other factors can influence the time taken to derive value from a data set. For example, using automated data transformation tools can shorten this time as compared to manually transforming data from one format to another.
Given that new data is constantly entering the database and data already present in the system gets outdated with time, improving data quality must be a consistent effort. There are several other metrics that may be used to assess data quality in addition to the ones listed above.
The idea behind tracking data quality metrics is not to have a 100% perfect data set – such a goal would be next-to-impossible. However, by tracking quality along such metrics, an organization can assess whether or not they are moving in the right direction and work on improving their data quality verification and enhancement processes.