By David Loshin
differences impact the ability to search and match records across different
data sets. Fortunately, most data quality tool suites use integrated parsing
and standardization algorithms to map structures together.
As long as there is some standard representation, we should be able to come
up with a set of rules that can help to rearrange the words in a data value
to match that standard.
As an example, we can look at person names (for simplicity, let’s focus on
name formats common to the United States). The general convention is that
people have three names – a first name, a middle name, and a surname. Yet
even limiting our scope to just these components (that is, we are ignoring
titles, generationals, and other prefixes and suffixes), there is a wide
range of variance for representing the name. Here are some examples, using
my own name:
• Howard David Loshin
• Howard D Loshin
• Howard D. Loshin
• David Loshin
• Howard Loshin
• H David Loshin
• H. David Loshin
• H D Loshin
• H. D. Loshin
• Loshin, Howard D
• Loshin, Howard D.
• Loshin, H David
• Loshin, H. David
• Loshin, H D
• Loshin, H. D.
There are different versions depending on whether you use abbreviations or full
names, punctuation, and the order of the terms. A good parsing engine can be
configured with the different patterns and will be able to identify each piece
of a name string.
The next piece is standardization: taking the pieces and rearranging them into a
desired order. The example might be taking a string of the form “last_name,
first_name, initial” and transforming that into the form “first_name, initial,
last_name” as a standardized or normalized representation. Using a normalized
representation will simplify the comparison process for data matching and record