By Ira Whiteside
Melissa Data’s BI Architect
This Saturday in Cleveland (SQLSAT#60), I will be presenting several code samples based on a series of articles and posts identifying similar records (Fuzzy Matching) between two different sources or grouping of records from a single source, based on existing column string of values.
We will have a booth at
(SSIS, TSQL and MDS) – Record Linkage (Fuzzy Matching) Advanced Implementations
We will implement several code samples based on a series of articles and posts identifying similar records between two different sources or grouping of records from a single source, based on existing column string of values.
We will define an approach, review actual implementations with various SQL tools (TSQL, VB, SSIS and MDS). There are many facets to review. I would like to start by drawing concepts from several articles from experienced practitioners, either from academia or the commercial market.
They will be footnoted at the end of the session. Although we are discussing matching, we need to address several steps prior to getting to the actual use of matching algorithms. For this post, we will focus on the pragmatic steps required to accomplish identifying the relationships between records, and address how these steps fit into an overall data governance and data quality of MDM architecture.
We will review real world implementations of several hundred million records. The steps required are shown here in a basic diagram.
The steps are as follows:
1. Cleansing and standardization
2. Group records
3. Split records
4. Compare records and determine scores
5. Split into separate match categories
6. Analyze results of matches
7. Evaluate using match tools to determine if best algorithms have been combined. We will then implement this approach in TSQL and SSIS as well as explore the use of Fuzzy Matching in MDS.
I will also be presenting at SQLSAT #47 Phoneix and the SQL Rally in
Creating a Metadata Mart with SSIS – Data Governance, Data Profiling & Code Generation
Recently I have had a client that required migrating and re-hosting a data warehouse from the Oracle environment to the Netezza Appliance. The application had hundreds of tables and almost 2900 Source to Target Mappings. This would have taken thousands of hours of manual coding. For this client, we implemented a reusable data-driven architecture, via SSIS, that relies on a metadata mart. The end result a greatly reduced TCO (development effort) for generating the code required vs. manual coding.
In addition, the process leaves behind a metadata mart to report on for auditing and other Data Governance efforts. We generated over 100,000 lines of code and over 3,000 load scripts. After implementation the application can be maintained via the tables and automatically regenerated, providing a pro-active or self-healing application architecture to respond to changes in incoming source files or target changes.