|
"http://www.w3.org/TR/html4/loose.dtd">
>
Chapter 6
|
|
The order of the declaration of the tokens is important. The first token that is matched is returned. The regular expression has a special treatment. If it describes a keyword, TPG also looks for a word boundary after the keyword. If you try to match the keywords if and ifxyz TPG will internally search if\b and ifxyz\b. This way, if won’t match ifxyz and won’t interfere with general identifiers (\w+ for example).
There are two kinds of tokens. Tokens defined by the token keyword are parsed by the parser and tokens defined by the separator keyword are considered as separators (white spaces or comments for example) and are wiped out by the lexer.
Tokens can also be defined on the fly. Their definition are then inlined in the grammar rules. This feature may be useful for keywords or punctuation signs. Inline tokens can not be transformed by an action as predefined tokens. They always return the token in a string.
See figure 6.2 for examples.
Inline tokens have a higher precedence than predefined tokens to avoid conflicts (an inlined if won’t be matched as a predefined identifier).
TPG works in two stages. The lexer first splits the input string into a list of tokens and then the parser parses this list.
The lexer split the input string according to the token definitions (see 6.2). When the input string can not be matched a tpg.LexerError exception is raised.
The lexer may loop indefinitely if a token can match an empty string since empty strings are everywhere.
Tokens are matched as symbols are recognized. Predefined tokens have the same syntax than non terminal symbols. The token text (or the result of the function associated to the token) can be saved by the infix / operator (see figure 6.3).
Inline tokens have a similar syntax. You just write the regular expression (in a string). Its text can also be save (see figure 6.4).