By David Loshin
Yet, as more organizations look to merge data sets and feeds from different
sources, some challenges remain, particularly with the use of unstructured text
(such as that presented via Twitter or Facebook.) People cannot be expected to
always conform to your organization’s data standards, and often use colloquial
terms or their own words to describe ideas that would map to your own
For example, if you wanted to filter out the individuals who prefer to drive
“pickup trucks” (one of our standard values), it is not enough to scan for that
phrase. Many individuals will refer to their pickup truck using different terms,
such as a make and model (“Ford F-150,” “Chevy Silverado”) or a different name
(“light truck”) or a nickname (“baby monster”), but these terms have to be
linked to the overall classification term.
This is an example of a simple hierarchy, in which one concept (“automobiles”)
is divided into a collection of smaller classes (the NHTSA classifications).
Each of those classes in turn contains other phrases and terms. Within each of
those included collections, there may be other inclusive categorization, such as
by make and then model.
With a well-defined hierarchy for classification, unstructured text can be
scanned for matches with values that live within the hierarchy, and that enables
the standardized classification. To round out the example, a Twitter tweet
exclaiming the author’s love of “driving his Ford F-150” can be scanned, with
the model name extracted, located within the make and model hierarchy for pickup
trucks, thereby allowing us to register his/her automobile driving preference!