
grammar file ( parser data generator input )


The parser data generator uses the grammar described in a text file to generate parse data used by parser.

It is possible to define an ambiguous grammar.

The input grammar text file must be utf-8 encoded.

The grammar file contains definition for lexical analysis and syntax analysis: whitespace and token management.

The grammar used to generate parser data has its own syntax described here. It is a kind of BNF with the following add-ons:

The formal grammar description :

the generator input grammar described by itself

Semantic rules of input Grammar:

"Start" non terminal:
A non terminal is for lexical analysis if it is declared as a token or it is referenced by a lexical rule. A lexical rule is a  definition for a non terminal token or a definition for a non terminal whose is referenced by another lexical rule.
A non terminal can not be shared between lexical and syntaxic rules.

Section for white space management:

The grammar rules are enclosed in section:
include "subrule.txt"

when declaring non terminal to use as white space this means that the declared white space appears between terminal and token:

for the declaration:
A : '@' B ;
[B] : LETTER | B LETTER ; /* B is token as list of LETTER*/
In %; section
will match: "@abcd"
In %WS; section where WS: ' ' ;
will match " @ abcd"
the used rule will be: A: WS '@' WS B ;

The parser data generator uses a transformed grammar, using basic rules, to generate parser data.
Section information are used in this transformation phase that inserts white space non terminal in grammar rules.

See white space management.


A : B C ;
A is B and C concatenation
A : B C D;
A is B,C, and D concatenation
A : B 'a' ;
A is B and terminal character 'a' concatenation
A : B | C ;
A is B or C
A : B | C | D;
A is B,C, or D
A : B | C D ;
A is B or concatenation of C and D
A : B ; {MatchX}
A is equivalent to B and match management is done by class MatchX
A : ; { MatchX }
A is empty and match management is done by class MatchX
A : B { M1 } | C D { M2 } ; { M3 }
A : { M1 } | B C { M2 } ; { M3 }
A: B ( C | D ; )
A is concatenation of B and of C or D
A : B ( C | D ; { MatchXX } ) { MatchYY}
the class MatchXX manages match of C or D
the class MatchYY manages match of concatenation of B and of C or D
it is not possible to write A : B {MatchXX} ;

TODO:transformation see match management



A : 'a'
non terminal A defined as terminal character 'a'
A : '\''
non terminal A defined as terminal apostroph character 
A : '\\'
non terminal A defined as terminal back slash character 
A : '\x41'
non terminal A defined as terminal character of code 41 in hexadecimal notation ( ascii range )
A : '\u2145'
non terminal A defined as terminal character of code 41 in hexadecimal notation ( utf 16 range )
A : '\n'
new line character (0x0A)
A: `\r"
cariage return character (0x0D)
A: '\t'
horizontal tabulation character (0X09)


A : "hello"
non terminal A defined as terminal character string "hello"

character value can be in form \n, \r, \t, \x<hex digit><hex digit>, or \u<hex digit><hex digit><hex digit><hex digit> as for character terminal

character class

A : [a-z]
non terminal A defined as one character in set 'a' through 'z'
A : [abcd]
non terminal A defined as one character in set 'a', 'b', 'c', and 'd'
A : [*/-+]
non terminal A defined as one character in set '*', '-', and '+'
the '\' before the '-' means that '-' is a character of the set ( not a range separator )
A : [()\]{}]
non terminal A defined as one character in set '(', ')', ']', '{' and '}'
the '\' before the ']' means that ']' is a character of the set ( not the character class ending bracket)
A : [_a-z]
non terminal A defined as one character in set '_' and 'a' through 'z'

character value can be in form \n, \r, \t, \x<hex digit><hex digit>, or \u<hex digit><hex digit><hex digit><hex digit> as for character terminal

© 2008-2009, parser4j