Contact Data and Identifying Information

Blog Administrator | Address Quality, Analyzing Data, Analyzing Data Quality, Data Quality | , , ,

By David Loshin

When inspecting two records for similarity (or for differentiation), the values in the identifying attributes from each corresponding record are compared to determine whether the two records can be presumed to represent the same entity or distinct entities.

For people, there are some obvious attributes used for comparison – they are ones that are inherently associated with the individual, such as first name, last name, birth date, eye color, or birth location.

There are two issues with this limited set of attributes: in many cases, not all of that information has been captured, and as data sets grow, the variation decreases – many people may share the same first name or last name, and even more will share the same birth date. Therefore, the default is to consider additional attribute values that are directly associated with the individual.

The most frequently used values are those associated with contact data such as residential address or telephone number. Although technology evolutions have greatly broadened the spectrum of contact data attributes.

These include: email addresses (of which there may be both professional and private versions), handles used for social media interactions (like those used on Twitter or other online forums), IP address, varieties of mobile telephones, IP telephone numbers (including online-only numbers like those acquired via Google’s Voice service), as well as other assigned identifiers (such as account numbers or the numbers on your supermarket affinity card).

Contact information has become significantly more sophisticated, and some of the previous assumptions that have supported their use in identification no longer necessarily hold true.

For example, for land line telephone numbers, the area code and exchange code could be correlated to a specific location and matched with postal codes. Today, telephone numbers are not only dissociated from location (e.g., a person can retain his Boston-based mobile number even after he moves to anywhere in the United States), they are even dissociated from telephones (such as virtual numbers connected directly to Internet-only systems).

Not only that, the advent of social communities allows for the creation of multiple personas that can be attached to more than one individual. I know of a person who has created multiple Twitter accounts, including one for each of her pets. Retail affinity cards can be shared among members of the same family.

Tracking web transactions by IP address groups multiple actions that could be performed by many people working on the same network and sharing the same Internet connection.

Using contact information for unique identification is a double-edged sword: there is a wider variety of data attributes and values to use, and they can add to the similarity analysis as well as the differentiation process.

However, you must be careful to ensure that the values are not resolving to aliases, nor that they be determined to represent multiple individuals.

Validation of Data Rules

Blog Administrator | Address Quality, Address Validation, Analyzing Data, Analyzing Data Quality, Data Profiling, Data Quality | , , , ,

By David Loshin

Over the past few blog posts, we have looked at the ability to define data quality rules asserting consistency constraints between two or more data attributes within a single data instance, as well as cross-table consistency constraints to ensure referential integrity. Data profiling tools provide the ability to both capture these kinds of rules within a rule repository and then apply those rules against data sets as a method for validation.

As a preparatory step focusing the profiler for an assessment, the cross-column rules to be applied to each record are organized in a way such that as the table (or file) is scanned, the data attributes within each individual record’s that are the subject of a rule are extracted and submitted for assessment. If the record complies with all the rules, it is presumed to be valid. If the record fails any of the rules, it is reported as a violation and tagged with all of the rules that were not observed.

Likewise for the cross-table rules, the profiler will need to identify the dependent data attributes taken from the corresponding tables that need to be scanned for validation of referential integrity. Those column data sets can be subjected to a set intersection algorithm to determine if any values exist in the referring set that do not exist in the target (i.e., “referred-to”) data set.

Any items in the referring set that do not link to an existing master entity are called out as potential violations.

After the assessment step is completed, a formal report can be created and delivered to the data stewards delineating the records that failed any data quality rules. The data stewards can use this report to prioritize potential issues and then for root cause analysis and remediation.

Get Used to It: Inconsistent Data is the New Normal

Blog Administrator | Address Quality, Analyzing Data, Analyzing Data Quality, Data Cleansing, Data Management, Data Quality | , , , , ,

By Elliot King

Nobody is perfect and neither is corporate data. Indeed, data errors are intrinsic to IT’s DNA. Data inevitably decays. Errors can be caused when data from outside sources are merged into a system. And then, of course, the humans that interact with the system are, well, human.

Unfortunately, despite the best efforts of data quality professionals, the three major IT trends–analytics, big data, and unstructured data–while promising great payoffs generally, promise to exacerbate data quality issues.

Perhaps analytics presents the most interesting set of challenges. Intuitively, companies believe that the more information incorporated into the analytic process, the sounder the outcome. This leads companies to investigate or incorporate data sets that have been little used or overlooked in the past.

And when you look into new places, sometimes you find surprises. Patient records are perhaps the most well publicized example. Few people ever closely scrutinized the paper records maintained by most doctors. But now that patient information is being imported into electronic patient records, huge numbers of mistakes are coming to the surface–both those that the examining doctor made initially, and those from the import process itself.

The problems with electronic patient records are emblematic of the Achilles heel of big data in general. It seems pretty obvious that the more data you collect, the more mistakes will be embedded in the data. Quantity works against quality in most cases, particularly when the growth of data is being driven by a range of new input devices of uneven reliability, such as sensors and Web processes.

But the issue is not just one of the size of databases, but the nature of the data captured. The main driver of big data is unstructured information and almost by definition, unstructured information is inexact, as are the methods for managing unstructured data (although they are consistently improving over time.)

Face it, data has always been messy and is getting messier. Consequently, data quality efforts have to be consistent and ongoing.


Understanding Hierarchies

Blog Administrator | Address Quality, Analyzing Data, Analyzing Data Quality, Data Management, Data Quality | , ,

By David Loshin

Defining standards for group classification helps in reducing confusion due to inconsistencies across generated reports and analyses. In the automobile classification example we have been using for the past few posts, we might pick the NHTSA values (mini passenger cars, light passenger cars, compact passenger cars, medium passenger cars, heavy passenger cars, sport utility vehicles, pickup trucks, and vans) as the standard.

Yet, as more organizations look to merge data sets and feeds from different
sources, some challenges remain, particularly with the use of unstructured text
(such as that presented via Twitter or Facebook.) People cannot be expected to
always conform to your organization’s data standards, and often use colloquial
terms or their own words to describe ideas that would map to your own
dimensional values.

For example, if you wanted to filter out the individuals who prefer to drive
“pickup trucks” (one of our standard values), it is not enough to scan for that
phrase. Many individuals will refer to their pickup truck using different terms,
such as a make and model (“Ford F-150,” “Chevy Silverado”) or a different name
(“light truck”) or a nickname (“baby monster”), but these terms have to be
linked to the overall classification term.

This is an example of a simple hierarchy, in which one concept (“automobiles”)
is divided into a collection of smaller classes (the NHTSA classifications).
Each of those classes in turn contains other phrases and terms. Within each of
those included collections, there may be other inclusive categorization, such as
by make and then model.

With a well-defined hierarchy for classification, unstructured text can be
scanned for matches with values that live within the hierarchy, and that enables
the standardized classification. To round out the example, a Twitter tweet
exclaiming the author’s love of “driving his Ford F-150” can be scanned, with
the model name extracted, located within the make and model hierarchy for pickup
trucks, thereby allowing us to register his/her automobile driving preference!