By David Loshin
More concretely, the example we used was mapping individuals to their car
purchase preferences, but different applications used different car
classifications that did not share the same number of values and the value sets
did not directly map in a one-to-one manner. The potential result is confusion
in interpreting the results, especially if this classification is just one
variable used for creating a customer profile.
One way to address this is to put a standards policy for classification
dimensions in effect by selecting a single set of concepts, mapping those
conceptual values to a standard single set of values, and then insisting that
any application that uses that conceptual domain always use the standard.
This sounds simple, but it actually may entail some effort, since no one person
may be aware of all the places that any specific classification domain is used.
This task goes beyond a “data management” activity and essentially becomes a
“data governance” one involving a broad solicitation across the community of
data consumers to determine the classification dimensions used and the
enumerations of values employed within each dimension.
At the same time, the analyst spearheading this effort must have a plan for
capturing the classification data, harmonizing values across variant lists,
selecting a standard, communicating the standard, and then ensuring that the
standard is put into practice.
Establishing good practices and processes for domain harmonization and
standardization is an important topic to be considered in upcoming posts, but
next time we will look at a growing challenge for classification domains:
aligning data from unstructured text with the standard classification