parser data generator grammar input

The parser data generator uses the grammar described in a text file to generate parse data used by parser.

It is possible to define an ambiguous grammar.

The input grammar text file must be utf-8 encoded.

The grammar file contains definition for lexical analysis and syntax analysis: whitespace and token management.

The grammar used to generate parser data has its own syntax described here. It is a kind of BNF with the following add-ons:

The formal grammar description :

Semantic rules of input Grammar:

"Start" non terminal:

must be defined, matching start means matching all the grammar defined.
must not be a token
must not be referenced in all other rules

A non terminal is for lexical analysis if it is declared as a token or it is referenced by a lexical rule. A lexical rule is a definition for a non terminal token or a definition for a non terminal whose is referenced by another lexical rule.
A non terminal can not be shared between lexical and syntaxic rules.

Section for white space management:

where:

first %; is a beginning of section whithout declaring non terminal use as white space.
%WS; ends the first section and starts a new one where non terminal WS will be used as white space.
%TOTO; ends the second section and starts a new one where non terminal TOTO will be used as white space.
include "subrule.txt" enables to include a file containing rule declarations in section. The file must be in same directory than the including file.

when declaring non terminal to use as white space this means that the declared white space appears between terminal and token:

for the declaration:

A : '@' B ;
[B] : LETTER | B LETTER ; /* B is token as list of LETTER*/

In %; section

will match: "@abcd"

In %WS; section where WS: ' ' ;

will match " @ abcd"
the used rule will be: A: WS '@' WS B ;

The parser data generator uses a transformed grammar, using basic rules, to generate parser data.
Section information are used in this transformation phase that inserts white space non terminal in grammar rules.

See white space management.

Rules:

A : B C ;

A is B and C concatenation

A : B C D;

A is B,C, and D concatenation

A : B 'a' ;

A is B and terminal character 'a' concatenation

A : B | C ;

A is B or C

A : B | C | D;

A is B,C, or D

A : B | C D ;

A is B or concatenation of C and D

A : B ; {MatchX}

A is equivalent to B and match management is done by class MatchX

A : ; { MatchX }

A is empty and match management is done by class MatchX

A : B { M1 } | C D { M2 } ; { M3 }

A : { M1 } | B C { M2 } ; { M3 }

A: B ( C | D ; )

A is concatenation of B and of C or D

A : B ( C | D ; { MatchXX } ) { MatchYY}

the class MatchXX manages match of C or D
the class MatchYY manages match of concatenation of B and of C or D

it is not possible to write A : B {MatchXX} ;

Terminal:

character

A : 'a'

non terminal A defined as terminal character 'a'

A : '\''

non terminal A defined as terminal apostroph character

A : '\\'

non terminal A defined as terminal back slash character

A : '\x41'

non terminal A defined as terminal character of code 41 in hexadecimal notation ( ascii range )

A : '\u2145'

non terminal A defined as terminal character of code 41 in hexadecimal notation ( utf 16 range )

A : '\n'
new line character (0x0A)
A: `\r"
cariage return character (0x0D)
A: '\t'
horizontal tabulation character (0X09)

string

A : "hello"

non terminal A defined as terminal character string "hello"

character value can be in form \n, \r, \t, \x<hex digit><hex digit>, or \u<hex digit><hex digit><hex digit><hex digit> as for character terminal

character class

A : [a-z]

non terminal A defined as one character in set 'a' through 'z'

A : [abcd]

non terminal A defined as one character in set 'a', 'b', 'c', and 'd'

A : [*/-+]

non terminal A defined as one character in set '*', '-', and '+'
the '\' before the '-' means that '-' is a character of the set ( not a range separator )

A : [()\]{}]

non terminal A defined as one character in set '(', ')', ']', '{' and '}'
the '\' before the ']' means that ']' is a character of the set ( not the character class ending bracket)

A : [_a-z]

non terminal A defined as one character in set '_' and 'a' through 'z'