Over at the LinkedIn Group run by Henrik Liliendahl Sorensen for Data Matching, us census bureauprincipal researcher at the has shared several reference papers on “blocking.” They are excellent and I wanted to share them with you.
According to Winkler “The following three papers are primarily concerned with ‘blocking.’ The third gives a methodology for estimating false negatives (false non-matches) in a narrow range of situations.”
Chaudhuri, S., Gamjam, K., Ganti, V., and Motwani, R. (2003), “Robust and Efficient Match for On-Line Data Cleaning,” ACM SIGMOD ’03, 313-324, http://datamining.anu.edu.au/publications/2003/kdd03-6pages.pdf
Baxter, R., Christen, P. and Churches, T. (2003), “A Comparison of Fast Blocking Methods for Record Linkage,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification,
Winkler, W. E. (2004c), “Approximate String Comparator Search Strategies for Very Large Administrative Lists,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM (also report 2005/06 at http://www.census.gov/srd/papers/pdf/rrs2005-02.pdf.
I have reviewed “Approximate String Comparator Search Strategies for Very Large Administrative Lists,” and found it very helpful.
It would seem to me, that reviewing and using the 11 Blocking Criteria used in the Capture-Recapture Estimation of Missed Matches methodology would prove very helpful for those of us looking for some “conceptual best practices/guideline” when using various tools for matching basic customer information with Name, Address, Phone.
I am interested in continuing to break these down into manageable chunks of information and coding samples for everyday matching practitioner, of which I am humbly one; much thanks to Bill for his recommendations.
Please feel free to comment to this post or email me at email@example.com