Skip to content

Global Intelligence Blog

Insights and Analysis for the Data-Driven Enterprise

  • Data Quality Solutions
  • Request Demo
  • Blogs
    • Product News & Updates
    • Data-Driven Marketing
  • United States
    • United States
    • Australia
    • Germany
    • India
    • United Kingdom
  • 1-800-MELISSA

Global Intelligence Blog

Insights and Analysis for the Data-Driven Enterprise

On the Road with SQL Saturdays

Blog Administrator | Uncategorized | Fuzzy Matching, metadata mart, SSIS

By Ira Whiteside

Melissa Data’s BI Architect

 

This Saturday in Cleveland (SQLSAT#60), I will be presenting several code samples based on a series of articles and posts identifying similar records (Fuzzy Matching) between two different sources or grouping of records from a single source, based on existing column string of values.

 

We will have a booth at Cleveland. Please stop by to chat with me or get a demo of our newest Fuzzy Matching SSIS Components and a free 45-day demo DVD.

 

(SSIS, TSQL and MDS) – Record Linkage (Fuzzy Matching) Advanced Implementations

 

We will implement several code samples based on a series of articles and posts identifying similar records between two different sources or grouping of records from a single source, based on existing column string of values.

 

We will define an approach, review actual implementations with various SQL tools (TSQL, VB, SSIS and MDS). There are many facets to review. I would like to start by drawing concepts from several articles from experienced practitioners, either from academia or the commercial market.

 

They will be footnoted at the end of the session. Although we are discussing matching, we need to address several steps prior to getting to the actual use of matching algorithms. For this post, we will focus on the pragmatic steps required to accomplish identifying the relationships between records, and address how these steps fit into an overall data governance and data quality of MDM architecture.

 

We will review real world implementations of several hundred million records. The steps required are shown here in a basic diagram.

 

The steps are as follows:

 

1. Cleansing and standardization

2. Group records

3. Split records

4. Compare records and determine scores

5. Split into separate match categories

6. Analyze results of matches

7. Evaluate using match tools to determine if best algorithms have been combined. We will then implement this approach in TSQL and SSIS as well as explore the use of Fuzzy Matching in MDS.

 

I will also be presenting at SQLSAT #47 Phoneix and the SQL Rally in Orlando in May.

 

Creating a Metadata Mart with SSIS – Data Governance, Data Profiling & Code Generation

 

Recently I have had a client that required migrating and re-hosting a data warehouse from the Oracle environment to the Netezza Appliance. The application had hundreds of tables and almost 2900 Source to Target Mappings. This would have taken thousands of hours of manual coding. For this client, we implemented a reusable data-driven architecture, via SSIS, that relies on a metadata mart. The end result a greatly reduced TCO (development effort) for generating the code required vs. manual coding.

 

In addition, the process leaves behind a metadata mart to report on for auditing and other Data Governance efforts. We generated over 100,000 lines of code and over 3,000 load scripts. After implementation the application can be maintained via the tables and automatically regenerated, providing a pro-active or self-healing application architecture to respond to changes in incoming source files or target changes.

 

February 1, 2011

Post navigation

What Happens When Bad Data Strikes a Good Company?
So What the Heck is Data Quality Anyway?

  • Comments (0)
  • Write a Comment

Cancel Reply

Subscribe

About This Site

All data goes bad (up to 25% per year), whether due to data entry errors or the simple fact that consumers change jobs, move, update email addresses, marry, etc. At Melissa, we help companies harness the value of their Big Data, legacy data, and people data (names, addresses, phone numbers, and emails) to drive insight, maintain data quality, and support global intelligence.

Search

Find Us

Address
22382 Avenida Empresa
Rancho Santa Margarita, CA 92688-2112

 

Hours
Monday—Friday: 9:00AM–5:00PM PST

Proudly powered by WordPress | Theme: UNISCO by SnapThemes.