Formats, Syntax, and Content

Blog Administrator | Address Standardization, Data Quality | , ,

By David Loshin

One great thing about having a standard representation for data is that it
becomes easy to see whether any value does or does not meet the standard. Let’s
use a simple example: we can say that a street address has to have three parts –
a number, a name, and a “street type.” We can further specify our example
standard with these constraints:

· The number must be a positive integer number
· The name must have one and only one word
· The street type must be one of the following: RD, ST, AV, PL, or CT

OK, I know that there are streets with names that span more than one word, and I
know there are a lot more types of streets, but this experiment is to
demonstrate how we can use the standard to determine if an address is valid or
not by comparing it against the defined format, syntax, and content
characteristics, such as:

· The address string must have three components to it (format)
· The first component has to only have characters that are digits 0-9 (syntax)
· The first character of the first component cannot be a ‘0’ (syntax)
· The third component must be of length 2 (format)
· The third component has to have one of the valid street types (content)

In other words, we are refining the rules for validity into ones that we can
test.… Read More