By David Loshin
- Determination of identifying attributes – specifying the data elements that, when composed together, provide enough information to differentiate between records representing different entities;
- Identity resolution in the presence of variation-having the right algorithms, tools, and techniques for using the identifying attribute values to search for and find matching records among a collection of source data sets; and
- Performance management- tuning the algorithms and tools properly to ensure (as close to) linear scalability as the volumes of data grow.
When I first was on a team that attacked these problems in the mid-1990s, the data set sizes were an order of magnitude greater than organizations typically analyzed. And while today those same data volumes would seem puny by comparison, the lessons learned remain very pertinent, since organizations continue to struggle with the same challenges. In fact, one might say that the issues have only become more acute, as the increased volumes magnify the challenges.
For one thing, even if the number of records grows, the widths of the tables typically do not. That means that the variety of the values assigned to sets of data elements may seem to decrease, making it more difficult to find the right combination of <attribute, value>pairs to be used for unique identification and differentiation.
On the other hand, the increased number of records does open the possibility for introduction of errors, especially during manual data entry, highlighting the importance of good algorithms and tools for matching and linkage.
And of course, the larger the data sets, the greater the need for scalability.
In each of the next set of posts, we will look at these issues in much greater detail, as well as consider how those specific challenges have changed in the past twenty years.