SSIS Fuzzy Matching improving Performance
Melissa Team | |
I have a series of blog posts on Record Linkage and Fuzzy Matching Part 1 at Melissa Data’s Data Quality Authority Blog. Part 2 of the series covered the concepts of using a Grouping or Blocking Index to improve performance.
I have developed a simple SSIS example here to demonstrate how to implement this concept. Refer to my original blog for additional background.
In SSIS, the basic process involves using a Conditional Split to create multiple threads of records and applying the Fuzzy Grouping Transform.
For this example we assume a basic customer input with: First Name, Last Name, Address, State, Zip, and Phone.
I ran a sample of 10,000 records without using a “Blocking Index” and the execution time was: Finished: 10:30:07 PM; Elapsed time: 00:00:39.234.
I then ran a sample splitting the records into three threads by blocking on state. The execution time was: Finished: 10:28:21 PM; Elapsed time: 00:00:20.810.
In this case, the time was cut in half.
Here is the first package without splitting records via a Blocking Index.
Here is the setup for the Fuzzy Grouping: (You will note that except for First and Last Name, all columns are set for exact match.)
Now, the package with a Conditional Split for Stat and multiple Fuzzy Grouping Transforms:
Here is the setup for the Condition Split used to implement the Blocking Index:
There is much more ground to cover as we move along in our blog series at Melissa Data. I will provide more detailed examples, as well as how to use data profiling to help us determine the candidates for blocking.
Please join me for today’s webinar on In Depth Fuzzy Matching at 10am, PST.
Ira Warren Whiteside, BI Architect, Melissa Data
Please feel free to comment to this post or email me at firstname.lastname@example.org