C# StreamTokenizer

A very usefull class in Java, that I needed in C# some time ago. This is my implementation of the StreamTokenizer. Have a look at this small sample application parsing a JSON file.

 

 

The StreamTokenizer

This class is a very helpfull tool when it comes to text streams that need to be parsed as objects. Once configured for the fitting format, it handles all the whitespace-skipping and token-building for you. This means you don't have to use the Split method of the string class. This is the most common recommendation for beginners, but we take stuff more serious here. Using the string.Split method would mean you need to read the data as string first and call the Split method afterwards, resulting in memory wastage - the data is hold at least twice in memory (1. the string itself - 2. the resulting split array). The Split method isn't very fast, too.

Using the StreamTokenizer you don't have to be afraid of high memory cost. It uses one buffer (2048 bytes) to store the characters read from the text stream and then iterates over each character to build up some nice tokens from them. There are some variables used to handle this tokenizing, but I'm sure this won't add up to doubled memory usage, like the read-and-split method would.

There is also another great advantage: the StreamTokenizer handles quotations and line breaks automatically - meaning: you don't have to handle them your own (quotations can be a real pain). For examle: try to think of an easy way to parse this XML node:

<Element name="Any name"  problem1 = "value containing < /> > chars :P" 
    problem2="One more attribute containing a = char after a line break"/>

Easy you think? Well, I'm not so sure... You can't just split the string using the space char (char 32) because you would eighter not split the equal sign for the first attribute (name), or if you intend to split the attribute (name=value) using the equation char (char 61) to get the value, then you have split the second attribute (problem1) too early by using the space character, this left you with 'problem1', not 'problem1=value'. Also all the attribute values are quotations containing spaces. You'll have to join the words again to get the final value!? I don't even ask how to determine when the element ends, as the second attribute value contains all kinds of edgy chars. In short: I can't think of a very simple way to split this XML.

 

JSON sample parser

To show you what I like about the StreamTokenizer I've chosen a simple JSON sample. I've found this sample data on the Wikipedia page for JSON. JSON is a good example for the StreamTokenizer, because it describes an object hierarchy using a minimum of structural overhead. In fact: you can put a small description of JSON in just three lines:

There are objects stating with { and ending with }. Each object may have multiple attributes, which constist of a name and value separated by a colon. Attribute values can be strings, doubles, booleans or other objects - also possible are arrays. Arrays start with [ and end with ]. They can contain multiple objects separated by commas.

{
    "_comment" : "This json example is based on Wikipedia: 
    http://en.wikipedia.org/wiki/JavaScript_Object_Notation",
      
    "firstName": "John",
    "lastName": "Smith",
    "isAlive": true,
    "age": 25,
    "height_cm": 167.64,
    "address": {
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode": "10021-3100"
    },
    "phoneNumbers": [
        { "type": "home", "number": "212 555-1234" },
        { "type": "fax",  "number": "646 555-4567" }
    ]
}

 

Setting up the StreamTokenizer

If you want to use the StreamTokenizer, you first need to define the format syntax. This is done on a per-character-code basis. This means you define the token type for each character-code within the format. The character and token types are described in detail within the StreamTokenizer.cs. For this JSON samle I used this setup:

var tokenizer = new StreamTokenizer(textReader);
 
// Make all chars (0x00 - 0xFF) ordinary chars (ordinary: single char tokens)
tokenizer.ResetSyntax(); 
 
// Define all alphabetic chars to be word chars (word: multi char tokens)
tokenizer.SetWordChars((int)'a', (int)'z');
tokenizer.SetWordChars((int)'A', (int)'Z');
 
// Sets 0123456789.- to be number chars (number: multi char tokens, which are casted
// to double)
tokenizer.EnableParseNumbers();
 
// set '"' to be the quote char (quote: tokens aren't terminated by whitespace or
// ordinary chars)
tokenizer.SetQuoteChar((int)'"');
 
// set space to be a whitespace char (whitespace: those chars are skipped but they
// terminate ongoing tokens)
tokenizer.SetWhitespaceChar((int)' ');
 
// set \r and \n to be whitespace chars (Linebreaks arn't important in this format,
// so they can be skipped)
tokenizer.EolIsSignificant = false;

First, the syntax is resetted, so every possible character code (0x00 to 0xFF) is an 'ordinary' char now. This means, a call to the NextToken() method will return each char as single-char-token. This is used for all the special characters within the format, like: {}[]:,

The next part defines all lower and upper alphabetical character as 'word' chars. In fact: this is used for the 'true' value only. This token is the only word not contained within a quotation.

EnableParseNumbers() - just as the method name implies - enables the 'number' token type. This means all tokens starting with one of these chars [0123456789.-] are considered to be numbers.

One important definition is the quotation character. Nearly all words in this JSON sample are quoted and therefor easily recognized as strings.

Finally the white spaces are defined. As the JSON format doesn't care much for line breaks, we also don't and skip them.

 

Tokenizing and parsing objects

Once configured, the NextToken() method is nearly all we have to call again and again. As described in more detail within the code, the NextToken() method reads the next token from the initially provided stream and returns its token type. There is the TokenTypes enumeration containing all possible types - like Word, Number or EOF (end of file). But in some cases the NextToken() method returns the character code describing the current token. This happens for all ordinary (single-char) tokens and quotations. With this returned token type it is easy to use a switch statement to decide what type of token was read. Those statements are best for deciding what code to run next, compaired to string-comparisons.

It sure is a very good - if not best - practice to have one class per object type that needs to be distinguished. In case of the JSON format, there are objects, attributes and arrays, just as mentioned above. An implementation of a JsonObject class could therefor look just like this:

public class JsonObject
{
    public JsonObject()
    {
        this.Attributes = new List();
    }
 
    public List Attributes { get; set; }
 
    public static JsonObject ReadObject(ref StreamTokenizer tokenizer)
    {
        JsonObject obj = new JsonObject();
 
        // JsonObjects start with {
        tokenizer.AssertToken((int)'{');
 
        // Read object attributes
        ReadObjectAttributes(ref tokenizer, ref obj);
 
        // JsonObject end with }
        tokenizer.AssertToken((int)'}');
 
        return obj;
   }
 
    private static void ReadObjectAttributes(ref StreamTokenizer tokenizer, 
                                                       ref JsonObject obj)
    {
        // Peek at next token
        var token = tokenizer.NextToken();
        tokenizer.PushBackToken();
             
        // Check if JsonObject has attributes and start reading them
        if (token == (int)'"')
        { 
            ReadAttributes(ref tokenizer, ref obj);
        }
    }
 
    private static void ReadAttributes(ref StreamTokenizer tokenizer, ref JsonObject obj)
    {
       // Use JsonAttributes factory method to read the next attribute
       var attr = JsonAttribute.ReadAttribute(ref tokenizer);
       obj.Attributes.Add(attr);
 
       // If next token is a comma, there is another JsonAttribute comming 
       // -> Recursion of this method
       var token = tokenizer.NextToken();
       if (token == (int)',')
       {
           ReadAttributes(ref tokenizer, ref obj);
       }
       else
       {
           // No more attributes
           tokenizer.PushBackToken();
       }
   }
}

You sure stumbled upon the - until now - unmentioned methods AssertToken() and PushBackToken(). Both are very usefull methods to ensure the expected characters to be read. AssertToken() throws an AssertionException if the token type returned by NextToken() doesn't equal the asserted one. In this class the AssertToken() method ensures that every JsonObject starts with { and ends with }. Combining the NextToken() and PushBackToken() methods you can peek at the next tokens and decide what to do next. This is used within the ReadAttributes() method of this class: As long as a JsonAttribute is followed by a comma, the next tokens are considered to be an attribute again. The last attribute will always be followed by the closing } char of the JsonObject. This token is pushed back, so it can be read again from the AssertToken() method in ReadObject().

The following JsonAttriburte class shows how to handle a returned token type by the NextToken() method using the switch statement. This example is pretty straight forward and should easily be understood:

public class JsonAttribute
{
    public string Name { get; set; }
 
    public object Value { get; set; }
 
    public static JsonAttribute ReadAttribute(ref StreamTokenizer tokenizer)
    {
        JsonAttribute attr = new JsonAttribute();
 
        // Assert quoted name
        tokenizer.AssertToken((int)'"'); 
        attr.Name = tokenizer.SVal;
 
        // Assert : separator char
        tokenizer.AssertToken((int)':'); 
 
        // Read attribute value
        ReadAttrValue(ref tokenizer, ref attr);
 
        return attr;
    }
 
    private static void ReadAttrValue(ref StreamTokenizer tokenizer, 
                                                    ref JsonAttribute attr)
    {
        // Read next token to determine its type
        int token = tokenizer.NextToken();
        switch (token)
        {
            case (int)StreamTokenizer.TokenType.Number:
                // Number token 
                attr.Value = tokenizer.NVal;
                break;
 
            case (int)StreamTokenizer.TokenType.Word:
                // Word token (this should be a boolean value: 'true' or 'false')
                attr.Value = Convert.ToBoolean(tokenizer.SVal);
                break;
 
            case (int)'"':
                // Quote token (these are string values)
                attr.Value = tokenizer.SVal;
                break;
 
            case (int)'{':
                // { char 
                // a new JsonObject is starting
                tokenizer.PushBackToken(); // ReadObject asserts to begin with {
                attr.Value = JsonObject.ReadObject(ref tokenizer);
                break;
 
            case (int)'[':
                // [ char 
                // a new JsonArray is starting
                tokenizer.PushBackToken(); // ReadArray asserts to begin with [
                attr.Value = JsonArray.ReadArray(ref tokenizer);
                break;
        }
    }
}

The last of the three mentioned Json classes (JsonArray) does nothing new. So, you may look this one up within the attached sample project.

 

The result

After adding some ToString() methods to all classes, which reassamble the objects to a readable string, the result of this sample project may look like this:

 

The sample project

I hope you can use this sample project to understand how you may use the StreamTokenizer class. Please don't consider these JSON classes to be complete, fail-save or stable at any point! They're to show off the use of the StreamTokenizer only. You can easily find some stable JSON libraries using your favourite search engine. 

 

The sources

Here you can find the sources for the StreamTokenizer and the sample project, if you're registered and loged in.

Login

Help us

 EUR