By David Loshin
The first involves a text analysis methodology for scanning text and determining which character strings and phrases are meaningful and which ones are largely noise.
The second capability maps identified terms and phrases within existing known hierarchies and perform the classification.
Both of these techniques would work perfectly as long as the input data is always correct and complete – quite an assumption. That is why we need to augment these approaches with data quality techniques, largely in the area of data validation and data standardization/correction. For example, I am particularly guilty of character transposition when I type, and am as likely to tweet about my “Frod F-150” as I would about my “Ford F-150.” In this example, the inexact spelling would lead to a failure to classify my automobile preference.
However, using data quality tools, we can create a knowledge base of standard transformations that map common error schemes to their most appropriate matches. Creating a transformation rule mapping “Frod F-150” to “Ford F-150” would suggest the likely intent, supplementing the classification process.
In other words, integrating our text analytics tools with more traditional data quality methodology will not only (yet again) reduce inconsistency and confusion, it will also enhance the precision for analytical results and enable more robust customer profiling – a necessity for customer centricity.