By David Loshin

In our last set of posts, we looked at householding – inferring relationships for grouping individuals together based on shared characteristics.
In this series, we look at how we manage the quality of the data representing those shared characteristics. First let’s look at an example: organizing individuals based on their preferences for types of cars. There are a number of different classifications of cars, mostly focusing on car size, and these can be used for grouping individuals by reference.

And that is the problem: there are a number of different classifications of
cars, and without a defined standard, there’s bound to be confusion. Here are
three examples (I got them from
a page at Wikipedia):

• The Highway Loss Data Institute (HLDI) classifies cars into five groups: Sports, Luxury, Large, Midsize, and Small.
• The National Highway Traffic Safety Administration (NHTSA) has eight classifications (based on curb weight of the car): mini passenger cars, light passenger cars, compact passenger cars, medium passenger cars, heavy passenger cars, sport utility vehicles, pickup trucks, and vans.
• The EPA has a car classification based on interior and cargo space: Two-seaters, minicompacts, subcompact, compact, mid-size, large, small station wagons, mid-size station wagons, large station wagons.

While one application might assign a demographic classification based on the
HLDI groupings, another application might use the NHTSA classification, but
aspects of those classifications don’t match: the set of small HLDI cars might
include the NHTSA sets of mini passenger cars, light passenger cars, and compact
passenger cars.

The absence of a standard within the enterprise for choices of classification
may seem irrelevant within siloed functions, but as more business processes are
monitored across multiple functions, variant dimensions for classification and
analysis will create confusion somewhere down the line.


Leave a Reply

Your email address will not be published. Required fields are marked *