Cleanse Match for SQL Server Data Quality Service (DQS)
Melissa IN Team | Data Quality Services |
Data can help companies identify challenges and opportunities and make timely decisions for their growth. Despite being aware of this, 58% of a survey’s respondents claimed that at least half their regular decisions were based on experience and gut feel rather than data.
One of the key reasons for this is unreliable data quality. Data Quality Services introduced in the SQL Server help overcome some of the challenges of improving data quality and, in turn, minimizing risks of wrong analytics and inaccuracies.
What are Data Quality Services (DQS)?
DQS uses a batch mode, automated or interactive approach to improve data quality and integrity. It provides users with an extensive knowledge base and makes the SQL Server user friendly. Data stewards can apply the knowledge base to incoming data to assess the quality and make changes through its Cleansing and Matching policies.
What is the DQS Cleansing Data Feature?
The DQS client tool uses a computer-aided process to analyze how the data set conforms to the knowledge base and highlights entries that need to be reviewed/ modified. The key elements here are:
• Identification as Correct/Incorrect
All data in the set is assessed and compared to the knowledge base and categorized as categories such as New, Correct, Incorrect, Auto Suggested, etc. Any data found to be incorrect or invalid is highlighted and brought to the attention of the data steward.
• Suggestions for replacements/ corrections
DQS uses a computer-assisted and interactive process to cleanse the incorrect/ invalid data. The former uses the knowledge gained from the knowledge base to automatically process the data and make suggestions. For example, typos in addresses are quite common. The DQS could change the entries such as ‘Mian Street’ to ‘Main Street’.
Next, it takes an interactive approach that allows the data steward to approve or reject the suggestions made. In some cases, the suggestions may also be modified before being approved.
• Standardization
By using domain rules and reference values, the DQS then standardizes data values according to a set format. This could be as simple as capitalizing all customer names or standardizing the format of customer addresses. Standardization of entries makes the data more relatable and minimizes the risk of duplication.
• Enrichment
Based on the knowledge base, DQS can also enrich data. For example, a customer may have entered their address as House number 1, Main street, Pune 411060. This can be enriched to be read as House number 1, Main street, Pune, Maharashtra 411060. This helps create complete data sets and bridge the gaps in information.
What is the DQS Matching Data Feature?
Apart from inaccuracies in data, duplication is another factor that affects data quality scores. The DQS data matching process helps overcome this challenge. It analyzes the records to identify potential duplicates and returns weighted match probabilities between all the records analyzed. This comparison may be made by matching sources against a look-up table or its source. Data stewards can then take appropriate action. It is usually done in four steps:
• Matching Policy Training
Here the knowledge base is prepared for matching by defining DQS probability rules. It identifies the domains to be used for matching and specifies the weight carried by each value. Data stewards can decide between exact matches or matches with a certain degree of similarity. Rules may also be set to replace special characters with a space to make it easier to differentiate between strings.
• Matching
Once the rules have been defined, DQS de-duplicates records by comparing each row in the source data to every other row on the basis of the matching policy. Matching is most effective when conducted after data cleansing. This way, the data is free from errors and comparisons are easier. Like cleansing, data matching uses both a computer-assisted and an interactive process.
• Auto Approving
When two rows are found with comparable data, they are assigned a matching score. Depending on the threshold score defined by a data steward in the matching policy, only those records with a match probability higher than the threshold will be considered a match.
• Survivor-ship
The survivorship rule refers to a policy that decides which record will be deleted in event of a match. Some cases may preserve the longest and most complete record while in other cases, it may be decided to keep the default record. If multiple records meet the survivorship rule, a random record may be selected as the one to be kept.
Today, data quality underlines the reliability of all data-driven decisions. It is not enough to simply have a large data cache. Data Quality Services in SQL Server can play a pivotal role in cleansing, standardizing, enriching and de-duplicating data. This makes the data more reliable, useful and helpful for future business decisions.