Record Linkage and Fuzzy Matching Part 2
This blog series will address overall the steps necessary for efficient data/record processing that include a record linkage or fuzzy matching step. In part 1, we covered the overall approach.
Today, we will cover the following steps:
2. Split records
They are defined in academia as creating a “Blocking Index.” (We will cover cleansing next; I am jumping ahead, because I like to start with the end in mind, and the end in this case is the fastest possible matching process.) Cleansing and Standardization are critical, but understanding Grouping or Blocking, is the central concept affecting speed.
A Blocking Index is merely a collection or block of records with specific equal fields/columns values; notice I said equal/exact, not similar. Same state, same zip code, same address, etc . . .
1. Categorize (State, Zip, etc.)
2. Split into separate stream
The huge problem with the fuzzy matching processing performance, is it is FUZZY (similar) not EXACT (equal.) If you get a matching tool and start comparing data sets, I guarantee the tool is frantically trying to get you to define as many columns or fields for an EXACT match (Blocking Index) as possible, before you do any FUZZY stuff.
Consider this: You are watching a children’s school concert and several dozen children are up on stage. Now, pick out the twins. You would probably start with looking for groups based on hair color, hair length, etc., long before you start comparing faces. This is, in essence, grouping or blocking. So, you line the blonds on the left and the brunettes on the right. You now have two blocks.
So, given that, we agree you need to leverage grouping or blocking. The next step in identifying the twins is to repeat the process for the group you created, but with a new group, until you have found the twins. Compare all blonds, then brunettes, and so on. Then, move on to short hair, long hair, and so on. Finally, move on to similar face shapes (Ahhh, FUZZY).
Hair is blond or brunette; long or short, but faces are a collection of features, and have a pattern forming an image. Our brains will instinctively look for faces that are similar, and then compare more closely. The obvious point here is to only begin comparing faces once we have narrowed down the group of children to a few.
Next time, we will back up and discuss record cleaning and standardization. In our Children’s Concert example, having all the children in a uniform and with clean faces speeds up our ability for evaluation, and not get distracted by confusing patterns of clothing or dirty faces.
As I said in the first blog, I will eventually be getting into the weeds (technical examples) but, I would like to cover the concepts behind each step in our approach first.
Please feel free to comment to this post or email me at firstname.lastname@example.org