Tokenization and Parsing
Blog Administrator | Address Standardization, Data Quality |
By David Loshin
This means breaking out each of the chunks of a data value that carry the
meaning, and in the standardization biz, each of those chunks is called a token.
A token is representative of all of the character strings used for a particular
purpose. In our example, we have three tokens – the number, name, and type.
Token categories can be further refined based on the value domain, such as our
street type, with its listed valid values. This distinction and recognition
process starts by parsing the tokens and then rearranging the strings that
mapped to those tokens through a process called standardization. The process of
parsing is intended to achieve two goals – to validate the correctness of the
string or to identify what parts of the string need to be corrected and
standardized.
We rely on metadata to guide parsing, and parsing tools use format and syntax
patterns as part of the analysis.
We would define a set of data element types and patterns that correspond to each
token type and the parsing algorithm matches data against the patterns and maps
them to the expected tokens in the string. These tokens are then analyzed
against the patterns to determine their element types.
Comparing data fields that are expected to have a pattern, such as our initial
numeric token or the third street type token, enables a measurement of
conformance to defined structure patterns. This can be applied in many other
scenarios as well, such as telephone numbers, person names, product code
numbers, etc.
Once the tokens are segregated and reviewed, as long as all tokens are valid and
are in the right place, the string is valid. In the next post, we will consider
what to do if the string is not valid.