From BorielWiki
Jump to: navigation, search

You can see a token as an integer identifying a set of chars taken from the input, but there are more formal token definitions in the internet. When a sequence of chars matches a pattern, a token is generated and passed to the scanner. The token (usually an integer) identifies which pattern has matched the input.

The Token Class

Instead of just storing an integer for the matched pattern, Blex uses a Token class, to store much more information. The token value will be called id.

Note: Blex uses a string to store the Token ID.

Traditionally, scanners return an integer as the Token ID. Blex uses strings. So will be a string like identifier or integer, not a numeric value.


The Token class has the following properties:

  • Contains the string name you defined for the matched pattern (yes, this is the token itself). E.g. 'INTEGER'
  • Token.text: Contains the lexeme (chars) matched for the pattern. It's equivalent to flex yy_text value.
  • Token.value: Contains the numerical value for the matched pattern. E.g. if Token.text is '32.5', Token.value will be 32.5. Defaults to None if Token.text does not represent a valid floating point number. You can directly read and change this attribute as desired.
  • Token.hook: Is a function that will be called when the pattern has been matched. It this function return False, then the Blex will discard this token (e.g. it won't be passed to the parser). This is useful, for example, to discard whitespaces, tabs or comments. If the hook returns None, then the scanner will discard the token and rollback the input buffer, trying another pattern. It's similar to Lex REJECT action.

The hook defaults to None which means no function will be called and the token will be passed to the parser as is. The hook must return the Token instance that will be sent to the parser. When you implement a hook, you can return a completely different Token instance, thus having a lot of control over which tokens are really returned.

There are 2 predefined hooks:

  • Token.skip which will cause the input pattern to be always skipped, and
  • Token.reject which will cause the matched input to be always rejected and try another match.

See Token hook for more information.

Creating a Token

A Token instance can be created invoking its constructor:

token = Token('SPACE', ' ')  # A Space

Or more interesting:

left_par = Token('LP', '(') # a Token instance for Left Parenthesis
right_par = Token('RP', ')') # a Token instance for Right Parenthesis

You mostly don't want to create Tokens yourself, unless returning a different Token instance from within a hook.