Skip to content

Global Intelligence Blog

Insights and Analysis for the Data-Driven Enterprise

  • Data Quality Solutions
  • Request Demo
  • Blogs
    • Product News & Updates
    • Data-Driven Marketing
  • United States
    • United States
    • Australia
    • Germany
    • India
    • United Kingdom
  • 1-800-MELISSA

Global Intelligence Blog

Insights and Analysis for the Data-Driven Enterprise

Tokenization and Parsing

Blog Administrator | Address Standardization, Data Quality | algorithm, data patterns, format, Metadata, parsing, standardization, syntax, tokenization, valid data

By David Loshin

As we have discussed in the previous posts, the data values stored within data elements carry specific meaning within the context of the business uses of the modeled concepts, so to be able to standardize an address, the first step is identifying those chunks of information that are embedded in the values.

This means breaking out each of the chunks of a data value that carry the
meaning, and in the standardization biz, each of those chunks is called a token.
A token is representative of all of the character strings used for a particular
purpose. In our example, we have three tokens – the number, name, and type.

Token categories can be further refined based on the value domain, such as our
street type, with its listed valid values. This distinction and recognition
process starts by parsing the tokens and then rearranging the strings that
mapped to those tokens through a process called standardization. The process of
parsing is intended to achieve two goals – to validate the correctness of the
string or to identify what parts of the string need to be corrected and
standardized.

We rely on metadata to guide parsing, and parsing tools use format and syntax
patterns as part of the analysis.

We would define a set of data element types and patterns that correspond to each
token type and the parsing algorithm matches data against the patterns and maps
them to the expected tokens in the string. These tokens are then analyzed
against the patterns to determine their element types.

Comparing data fields that are expected to have a pattern, such as our initial
numeric token or the third street type token, enables a measurement of
conformance to defined structure patterns. This can be applied in many other
scenarios as well, such as telephone numbers, person names, product code
numbers, etc.

Once the tokens are segregated and reviewed, as long as all tokens are valid and
are in the right place, the string is valid. In the next post, we will consider
what to do if the string is not valid.




November 1, 2011April 5, 2021

Post navigation

Maximize Value and Mitigate Risk
Context is Key to Measuring Data Quality

  • Comments (0)
  • Write a Comment

Cancel Reply

Subscribe

About This Site

All data goes bad (up to 25% per year), whether due to data entry errors or the simple fact that consumers change jobs, move, update email addresses, marry, etc. At Melissa, we help companies harness the value of their Big Data, legacy data, and people data (names, addresses, phone numbers, and emails) to drive insight, maintain data quality, and support global intelligence.

Search

Find Us

Address
22382 Avenida Empresa
Rancho Santa Margarita, CA 92688-2112

 

Hours
Monday—Friday: 9:00AM–5:00PM PST

Proudly powered by WordPress | Theme: UNISCO by SnapThemes.