Blex Operation Mode

From BorielWiki
Jump to: navigation, search

Operation Mode

Blex has two ways of matching input: Greedy mode, and non Greedy mode (which is the default). This is useful when more than one pattern can match the input. When this happens, the operation mode flag helps the Scanner to decide which pattern will match, and which others should be discarded.

Greedy Mode

On greedy mode, the pattern which matches the longest input will be selected. Suppose you want a scanner which recognizes words starting with 'a' as the token 'A_WORD'. You can define something like:

scanner.add_token('ab[a-z]*a', 'AB_WORD')

If the input contains the secuence:

a b r a c a d a b r a

then, scanner.lex() will return a Token 'AB_WORD' containing abracadabra in its text property.

Non Greedy Mode

On non greedy mode, the first pattern that matches the input will be selected. This means the scanner will just return the first input sequence that matches any pattern. So if you add a token pattern like the above, then calling scanned.lex() a Token 'AB_WORD' containing abracadabra again, because this is the first pattern matched.

Note: Regular Expressions always behave as greedy as possible.

That is, they will try to match as much input as they can. So in this case, there's no difference between greedy and non-greedy mode.

Considerations

You have to take care of this, since in non greedy mode, patterns that matches subsets of others probably won't ever match. If any word matched by X is also matched by Y, then the pattern X matches a subset of pattern Y.

We can define two tokens like this:

scanner.add_token('abc[a-z]*', 'ABC_WORD') # Matches any word prefixed with 'abc'
scanner.add_token('a[a-z]*', 'A_WORD') # Matches any word prefixed with 'a'

Pattern defined in line #2 is a superset of the one defined in line #1. In non greedy mode, the pattern defined in line #2 will always match to 'a', and the pattern in line #1 will be ignored. Even for sequences beginning with 'abcxxxx', it will always return 'a'.

Pattern Matching Disambiguation

Regardless the operation mode, when 2 patterns matches the same input string (with the same length), the first defined takes precedence. So, in the example above, when in greedy mode the string 'abc123123' will be matched by both patterns, but Blex will return an 'ABC_WORD' token because it was defined first.

To avoid this problem, you can defined a more elaborated regular expression, like:

scanner.add_token('abc[a-z]*', 'ABC_WORD') # Matches any word prefixed with 'abc'
scanner.add_token('a[^b][^c][a-z]*', 'A_WORD') # Matches any word prefixed with 'a', except 'abcXXX'

The above will work regardless operation mode and precedence. If you find difficult or confusing to define regular expressions like the above, you can use token hooks and reject the wrong matches this way:

def check_not_abc(token, lexer):
    if len(token.text) >= 3 and token.text[:3] == 'abc': # it's an ABC_WORD
        return None # Reject this pattern and try another one
 
    return token # Accept the pattern
 
scanner.add_token('abc[a-z]*', 'ABC_WORD') # Matches any word prefixed with 'abc'
scanner.add_token('a[^b][^c][a-z]*', ('A_WORD', check_not_abc)) # Matches any word prefixed with 'a', except 'abcXXX' 

Whenever the pattern defined in line #8 matches, the function check_not_abc will be called. This funcion will check the matched word does not begins with 'abc'. If it does, rejects the pattern match by returning None. This way you will ensure that, regarding operation mode and precedence, the second rule won't ever match abc prefixed words.