Data Quality Analyst/MVP Channel Manager
Two truths about data: Data is always changing. Data will always have problems. The two truths become one reality–bad data. Elusive by nature, bad data manifests itself in ways we wouldn’t consider and conceals itself where we least expect it. Compromised data integrity can be saved with a comprehensive understanding of the structure and contents of data. Enter Data Profiling.
Throw off the mantle of complacency and take an aggressive approach to data quality, leaving no opening for data contamination. How? Profiling.
More truths: Profiling is knowledge. Knowledge is understanding. That understanding extends to discovering what the problems are and what needs to be done to fix it.
Armed with Metadata
Metadata is data about your data. The analysis of gathered metadata with Profiling exposes all the possible issues to its structure and contents, giving you the information–knowledge and understanding–needed to implement Data Quality Regimens.
Here are only a few of the main types of Generic Profiling Metadata and the purpose of each:
- Column Structure – Maximum/Minimum Lengths and Inferred Data Type – These types of metadata provides information on proper table formatting for a target database. It is considered problematic, for example, when an incoming table has values which exceed the maximum allowed length.
- Missing Information – NULLs and Blanks – Missing data can be synonymous to bad data. This applies for example where an Address Line is Blank or Null, which in most cases is considered a required element.
- Duplication – Unique and Distinct Counts – This allows for the indication of duplicate records. De-duplication is a standard practice in Data Quality and is commonly considered problematic. Ideally, there should only be a single golden record representation for each entity in the data.
Other equally important types of Generic Profiling Metadata include Statistics for trends data; Patterns (ReqEx) allow for identifying deviations from formatting rules; Ranges (Date, Time, String and Numbers); Spaces (Leading/Training Spaces and Max Spaces between Words); Casing and Character Sets (Upper/Lower Casing and Foreign, Alpha Numeric, Non UTF-8) Frequencies for an overview of the distribution of records for report generation on demographics and more.
Metadata Revolution & New Face of Profiling
Right now the most powerful profiling tool for gathering Metadata is the Melissa Data Profiler Component for SSIS, which is used at the Data Flow level, allowing you to profile any data type that SSIS can connect with, unlike the stock Microsoft Profiling Component, which is only for SQL Server databases.
More importantly the Melissa Data Profiler offers over 100 types of Metadata including all the Generic Profiling Metadata mentioned here.
The innovative Melissa Data’s Profiler Component gathers Data Driven Metadata, which goes beyond the standard set of profiling categories. By combining our extensive knowledge on Contact Data, this allows us to get information not simply based on rules, norms, and proper formatting. Rather, it provides metadata with the aid of a back-end knowledge base. We can gather unique types of metadata such as postal code, State and Postal Code Mismatch, Invalid Country, Email Metadata, Phone and Names.
The secret to possessing good data goes back to a simple truth: understanding and knowledge of your data through profiling. The release of Melissa Data’s Profiler for SSIS allows you to take control of your data through use of knowledge base driven metadata. The truth shall set you free!
For more information on our profiling solutions, please visit our website.