When it comes to making data usable, it is important to note that data profiling, data integration and data quality should ideally be practiced in tandem with each other but cannot be consolidated as a single practice. All three practices address related issues in data acquisition, assessment and improvement and are often conducted by the same team.
What Is Data Profiling, Data Integration And Data Quality?
Data profiling refers to the process of looking into source data to understand its structure, the content and the interrelationships. Many business find it hard to rationalize time and resources for data profiling since it does not have actionable deliverables. Data profiling applies to both physical data and semantic data such as master data, metadata and data models.
Data integration refers to bringing data from different sources together to create a single, unified record. It can take many forms such as extract, transform, and load (ETL), enterprise application integration (EAI), etc. When it comes to data integration, it is crucial to ensure that suspect data is never loaded to a target database.
Data quality measures work on ensuring all the data in the record is accurate, unique, valid, complete and consistent. This must be a recurring task for active databases.
The Case For Coordinating Data Profiling, Data Integration And Data Quality
The 3 reasons to integrate the three practices are:
1. They Complement Each Other
Data profiling is often considered a subset of data integration and data quality while data integration and quality are considered independent practices. However, they usually work in tandem with each other since they are complementary in nature.
For example, when a company integrates data from its sales and accounts departments, it may identify a data quality issue such as a difference in date formatting. Similarly, it is only when data is integrated from all departments that data quality can be improved upon. For example, a customer’s records from the sales department may only mention when they bought a particular item but the accounts department would know the payment details.
2. They Are Usually Applied Together
The tools and techniques used for data profiling, data integration, and data quality are often applied to other initiatives such as master data management, database consolidations and migrations, data warehousing, etc. These initiatives can be successful only if they are run on high-quality, integrated datasets. Thus, coordinating practices of data profiling, data integration, and data quality help create the best dataset and improve the results of other data initiatives.
For instance, practicing data profiling, data integration and data quality together improves the accuracy of data sets and helps take better-informed business decisions to make operations more efficient.
3. The Same Team Is Usually Involved
In many cases, the same team is involved in data integration and improving data quality. This is beneficial as it reduces the chances of repeating steps, helps sync project deliverables and enhances data value. It also makes staffing projects more flexible, improves productivity for data stewards, business analysts and other business and technical personnel.
Unifying Data Profiling, Data Integration And Data Quality
Coordinating data profiling, data integration and data quality cycles is all about aligning their cycles. All three practices are iterative and cyclic and thus they may be unified in many different ways. As the cycle repeats itself you may also find that certain tasks become optional. One way of looking at this cycle is to break it down into three stages.
Analyze And Report
This stage is all about identifying issues with the data. Initially, the data must be profiled but as cycles become iterative, it evolves into data monitoring. Data profiling starts with familiarizing oneself with the records and structures of all the data sources, noting their quality and defining relationships between the data.
An inventory of all data assets must be maintained at this stage to document the discoveries. As part of a long-term cycle, data monitoring assures steady improvement. It involves developing metrics to measure data quality, maintaining a historical record of the same and analyzing trends in data quality based on these metrics.
Collaborate And Develop
A successful data profiling, data integration and data quality cycle depends on prioritizing business management needs over data management needs. At this stage, a collaborative dialog is needed between IT and business personnel.
Data integration and quality measures must react to the needs of both. By coordinating the cycles, requirements and priorities are unified so they do not have to compete for resources. New solutions may need to be devised while in other cases, old projects may have to be revisited for updates.
As you move forward, you may have to add more data sources and profile them. You will also probably add data-dependent user organizations and hence will have to factor in their needs as well.
Integrate And Improve
The third stage is all about creating high-quality integrated data records. Administrators need to conduct short-term testing and schedule monitoring, integration and quality tasks to optimize performance without threatening data integrity.
Third-party data must be acquired and appended to add value to the dataset. Data integration and data quality practices play key roles in standardizing records and changing the data from source models to other data models better suited to business purposes.
From here, it goes back to stage 1 whether to profile new data sources and prepare for new development or monitor the quality of existing datasets.
At every stage of a unified data profiling, data integration and data quality cycle, it is important not just to manage the data but also to add value to it. Going back and forth between data integration and data quality tasks is a sign of a mature unified cycle as it leverages the synergies of both practices to add value.
Coordinating the practices instead of conducting them in isolation improves efficiency and synergy for high-quality data. The cycles are iterant and for the best results, align them on a practical level as per the situation.