By David Loshin

While we have been talking in the last few posts about checking whether a data value observes the standard (and is therefore a valid value), the real challenge in standardization is in determining (1) that a value does not meet the standard and then (2) taking the right actions to modify it so that it does meet the standard.

That process, strangely enough, is called “standardization,” and it extends the
tokenization and parsing to recognize both valid tokens and common patterns for
invalid ones, and that is where the power of standardization lies. Here is the
basic idea: when you recognize a token value to be a known error, you can define
a business rule to map it to a corrected version.

The example I have used over the recent blog posts is a simple address standard:

· The number must be a positive integer number

· The name must have one and only one word

· The street type must be one of the following: RD, ST, AV, PL, or CT

And deriving these additional expectations:

· The address string must have three components to it (format)

· The first component has to only have characters that are digits 0-9 (syntax)

· The first character of the first component cannot be a ‘0’ (syntax)

· The third component must be of length 2 (format)

· The third component has to have one of the valid street types (content)

The next step would be to consider the variations from the expected values. A
good example might look at the third token, namely the street type, and presume
the types of errors that could happen and how they’d be corrected, such as:

Possible errors Standard
Rd, Road, Raod, rd RD
Street, STR ST
Avenue, AVE, avenue, abenue, avenoo AV
Place, PLC PL
CRT, Court, court CT

In this example, we see some variant abbreviations, fully-spelled out words, a
finger flub (the typist hit the b key instead of the v in “abenue” – I do this
all the time), and a transposition (“Raod” instead of “Road”, I also do this all
the time).

Different types of formats and patterns can be subjected to different kinds of
rules. The first token has to be an integer, but perhaps some OCR reader mis-translated
what it scanned into a character instead of a number, so we might see O instead
of 0, A instead of 4, S instead of 8, ) instead of 9, etc. That means that part
of the standardization process looks for non-digits and then apply rules that
traverse through a string and convert according to the defined mappings (A
becomes 4, for example).

For the second token, the challenge is when more than three words appear. One
set of rules might take all tokens between the first and the last and
concatenate them together into a single word.

Another approach is to scan the tokens and pluck out the one that most closely
matches one of the street types and move that to the end.

So these are the basics ideas for standardization: defining the formats and
patterns, determining the tokenization rules, parse the data and recognize valid
tokens and invalid tokens, define rules for mapping invalid tokens to valid
ones, and potentially rearrange tokens into the corrected version. In reality,
there are many more challenges, opportunities, and subtleties, but at least this
series of notes gives a high level view of the general process.


Leave a Reply

Your email address will not be published. Required fields are marked *