Back

grammar file ( parser data generator input )

Presentation:

The parser data generator uses the grammar described in a text file to generate parse data used by parser.

It is possible to define an ambiguous grammar.

The input grammar text file must be utf-8 encoded.

The grammar file contains definition for lexical analysis and syntax analysis: whitespace and token management.

The grammar used to generate parser data has its own syntax described here. It is a kind of BNF with the following add-ons:
where:

The formal grammar description :

the generator input grammar described by itself

Semantic rules of input Grammar:

"Start" non terminal:
A non terminal is for lexical analysis if it is declared as a token or it is referenced by a lexical rule. A lexical rule is a  definition for a non terminal token or a definition for a non terminal whose is referenced by another lexical rule.
A non terminal can not be shared between lexical and syntaxic rules.

Section for white space management:

The grammar rules are enclosed in section:
%;
....
rules
....
%WS;
....
rules
....
%TOTO;
....
rules
....
include "subrule.txt"

where:
when declaring non terminal to use as white space this means that the declared white space appears between terminal and token:

for the declaration:
A : '@' B ;
[B] : LETTER | B LETTER ; /* B is token as list of LETTER*/
In %; section
will match: "@abcd"
In %WS; section where WS: ' ' ;
will match " @ abcd"
the used rule will be: A: WS '@' WS B ;

The parser data generator uses a transformed grammar, using basic rules, to generate parser data.
Section information are used in this transformation phase that inserts white space non terminal in grammar rules.

See white space management.

Rules:

A : B C ;
A is B and C concatenation
A : B C D;
A is B,C, and D concatenation
A : B 'a' ;
A is B and terminal character 'a' concatenation
A : B | C ;
A is B or C
A : B | C | D;
A is B,C, or D
A : B | C D ;
A is B or concatenation of C and D
A : B ; {MatchX}
A is equivalent to B and match management is done by class MatchX
A : ; { MatchX }
A is empty and match management is done by class MatchX
A : B { M1 } | C D { M2 } ; { M3 }
A : { M1 } | B C { M2 } ; { M3 }
A: B ( C | D ; )
A is concatenation of B and of C or D
A : B ( C | D ; { MatchXX } ) { MatchYY}
the class MatchXX manages match of C or D
the class MatchYY manages match of concatenation of B and of C or D
it is not possible to write A : B {MatchXX} ;

TODO:transformation see match management

Terminal:

character

A : 'a'
non terminal A defined as terminal character 'a'
A : '\''
non terminal A defined as terminal apostroph character 
A : '\\'
non terminal A defined as terminal back slash character 
A : '\x41'
non terminal A defined as terminal character of code 41 in hexadecimal notation ( ascii range )
A : '\u2145'
non terminal A defined as terminal character of code 41 in hexadecimal notation ( utf 16 range )
A : '\n'
new line character (0x0A)
A: `\r"
cariage return character (0x0D)
A: '\t'
horizontal tabulation character (0X09)

string

A : "hello"
non terminal A defined as terminal character string "hello"

character value can be in form \n, \r, \t, \x<hex digit><hex digit>, or \u<hex digit><hex digit><hex digit><hex digit> as for character terminal

character class

A : [a-z]
non terminal A defined as one character in set 'a' through 'z'
A : [abcd]
non terminal A defined as one character in set 'a', 'b', 'c', and 'd'
A : [*/-+]
non terminal A defined as one character in set '*', '-', and '+'
the '\' before the '-' means that '-' is a character of the set ( not a range separator )
A : [()\]{}]
non terminal A defined as one character in set '(', ')', ']', '{' and '}'
the '\' before the ']' means that ']' is a character of the set ( not the character class ending bracket)
A : [_a-z]
non terminal A defined as one character in set '_' and 'a' through 'z'

character value can be in form \n, \r, \t, \x<hex digit><hex digit>, or \u<hex digit><hex digit><hex digit><hex digit> as for character terminal


© 2008-2009, parser4j