C++ String Toolkit (StrTk) Tokenizer

Articles

C++ String Toolkit (StrTk) Tokenizer

Introduction

This article will present the tokenizing and splitting functionality of a simple C++ library called the String Toolkit. Tokenization in the context of string processing, is the method by which a sequence of elements are broken up or fragmented in sub-sequences called tokens. The indices in the original sequence that determine such breaks in the sequence are known as delimiters.

There are two types of delimiters, normal or thin delimiters which are of length one element and a thick delimiters which are of length two or more elements. Even though tokenization is primarily used in conjunction with strings, any sequence of types that can be iterated in a linear fashion can be tokenized, examples may be list of integers, a vector of person classes or a map of strings.

Another Tokenizer?

To date there have been many attempts to define a “standard” Tokenizer implementation in C++. Of them all the likely candidate might be the implementation in the Boost library. Regardless proposed implementations should to some extent consider one or more of the following areas: over-all usage patterns, constructions, generality (or is it genericty these days?) of design, performance-efficiency.

1. Over-all usage patterns

This requirement is concerned with how easy it is to instantiate the tokenizer and integrate it into existing processing patterns, which most often than not requires integration with C++ STL algorithms and containers. A tokenzier by definition would be classed as a producer, so the question becomes how easy is it for others to consume its output? Another issue is consistency of the definition of a token in the situation where one has consecutive delimiters but is not compressing them – can or should there be such a thing as an empty token? and what do preceding and trailing delimiters mean? and when should they be included as part of the token?

2. Constructions

In the context of string tokenizing, the majority of implementations return the token as a new instance of a string. This process requires a string to be created on heap, populated by the sub-string in question from the original sequence, then returned back (some of this may be alleviated by Return Value Optimization RVO). In the case of iterators this is essentially another copy to the caller. Furthermore two kinds of tokens can make this situation worse, they are primarily a large sequence made up of lots of very short tokens or a large sequence made up of lots of very large tokens. The solution is not to return the token as a string but rather as a range and allow the caller to decide how they wish to inspect and further manipulate the token.

This minor change in interface design provides a great deal of flexibility and performance gain.

3. Generality(Genericity) of design

Most tokenizer implementations concern themselves only with strings, which for the most part is ok, because most things that need tokenizing are strings. However there will be times when one has a sequence of types that may need to be tokenized that aren’t strings, hence a tokenizer should be designed in such a way to enable such features, moreover it becomes clear that the most basic of tokenizing algorithms are invariant to the type of the delimiter.

4. Performance and Efficiency

Tokenizing strings range from low frequency inputs such as user input or parsing of simple configuration files to more complex situations such as tokenizing of HTML pages that a web crawler/indexer might do, to parsing of large market-data streams in FIX format. Performance is generally important and can usually be helped along with good usage patterns that encourage reuse of instances, minimal preprocessing and allow for user supplied predicates for the more nasty areas of the process. It should be noted that everything in the proceeding article can be done by purely using the STL – that said, C++’s ability to allow one to skin the proverbial cat in numerous way gives rise to novel solutions that are for the most part not of any practical use other than to showcase ones abilities in using the STL.

Getting started

The String Toolkit Library (StrTk) provides two common tokenization concepts, a split function and a token iterator. Both these concepts require the user to provide a delimiter predicate and an iterator range over which the process will be carried out.

The tokenization process can be further parametrized by switching between “compressed delimiters” or “no compressed delimiters” mode. This essentially means that consecutive delimiters will be compressed down to one and treated as such. Below are two tables depicting the expected tokens from various types of input. The tables represent no compressed and compressed tokenization processes respectively. The delimiter in this instance is a pipe symbol | and <> denotes an empty token.

No Compressed Delimiters

InputToken List
aa
a|ba,b
a||ba,<>,b
|a<>,a
a|a,<>
|a||b<>,a,<>,b
||a||b||<>,<>,a,<>,b,<>,<>
|<>,<>
||<>,<>,<>
|||<>,<>,<>,<>

Compressed Delimiters

InputToken List
aa
a||ba,b
|a||b<>,a,b
||a||b||<>,a,b,<>
|<>,<>
||<>,<>
|||<>,<>

Delimiters

Two forms of delimiters are supported and they are single delimiter predicate and multiple delimiters predicate otherwise known as SDP and MDP respectively. Essentially an SDP is where there is only one type that can break or fragment the sequence, where as with MDPs there is more than one unique type that can break the sequence. It is possible to represent a SDP using the MDP, however from a performance POV having separate predicates is far more efficient. Additionally for strings based on char or unsigned char (8-bit versions) there is a MDP that has a look-up complexity of O(1) making it greatly more efficient than the basic MDP.

TechnoBits. All rights reserved
Back To Top