By David Loshin
It should not be surprising that many entity names are subject to variation as a byproduct of human interaction with business applications.
Consider the name of mega big box retailer Wal-Mart, whose corporate name I have seen spelled as “Walmart,” “Walmarts,” “Wall-Mart,” “Wall-Marts,” “Wall-Mart’s,” “Wallmart,” “Wallmarts,” “Wal-Mart,” as well as “Wal*Mart” – and this is just the beginning, since when you add in variation between upper- and lower-casing of different letters as well as individual store identification numbers and named locations, you can end up with an incredible number of different representations.
The variations will tend to increase when they continue coming from a variety of data sources, especially in social media contexts that have no controls. It is unlikely that Twitter users will subject their tweets to any kind of name validation, and despite the fact that some people are marvelously adept at typing on a smart phone keypad, the majority of us are subject to fumbling fingers when using such a small palette for communication.
In other words, while the breadth of inputs increases, so do the opportunities for misspelling names, and the numbers of variations will be proportional to the massive data volumes.
This means two things. First, businesses will need to increasingly rely on existing tools and techniques for parsing through text, standardizing terms in context, entity recognition and extraction, and identity resolution as part of a general capability for absorbing data from a variety of text-based sources. Second, these techniques must perform well and scale up as the data volumes grow.
These conclusions should also not come as a surprise. But an interesting corollary is that entity identification and identity resolution are becoming necessary parts of the organizational information architecture, even in small and medium-sized businesses.