By David Loshin
While I have discussed the methods used for parsing, standardization, and matching is past blog series, one thing I alluded to a few notes back was the need for increased performance of these methods as the data volumes grow.
Let’s think about this for a second. Assume we have 1,000 records, each with a set of data attributes that are selected to be compared for similarity and matching. In the worst case, if we were looking to determine duplicates in that data set, we would need to compare each records against the remaining records. That means doing 999 comparisons 1,000 times, for a total of 999,000 comparisons.
Now assume that we have 1,000, 000 records. Again, in the worst case we compare each record against all the others, and that means 999,999 comparisons performed 1,000,000 times, for a total of 999,999,000,000 potential comparisons. So if we scale up the number of records by a factor of 1,000, the number of total comparisons increases by a factor of 1,000,000!
Of course, our algorithms are going to be smart enough top figure out ways to reduce the computation complexity, but you get the idea – the number of comparisons grows in a geometric way. And even with algorithmic optimizations, the need for computational performance remains, especially when you realize that 1,000,000 records is no longer considered to be a large number of records – more often we look at data sets with tens or hundreds of millions of records, if not billions.
In the best scenario, performance scales with the size of the input. New technologies enable the use of high performance platforms, through hardware appliances, software that exploits massive parallelism and data distribution, and innovative methods for data layouts and exchanges.
In my early projects on large-scale entity recognition and master data management, we designed algorithms that would operate in parallel on a network of workstations. Today, these methods have been absorbed into the operational fabric, in which software layers adapt in an elastic manner to existing computing resources.
Either way, the demand is real, and the need for performance will only grow more acute as more data with greater variety and diversity is subjected to analysis. You can’t always just throw more hardware at a problem – you need to understand its complexity and adapt the solutions accordingly. In future blog series, we will look at some of these issues and ways that new tools can be adopted to address the growing performance need.