C language parser development

Question

The lab was asked to write a C language parser:

Write a parser that detects the greatest number of errors for the grammar below (this grammar is a simplified version of the grammar of the C language):

< program >: < type > 'main' '(' ')' '{' < statement > '}' < type >: 'int' | 'bool' | 'void' <statement>: | < declaration > ';' | '{' < statement > '}' | < for > < statement > | < if > < statement > | < return > < declaration >: < type > < identifier > < assign > < identifier >: < character >< id_end > < character >: 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' | '_' <id_end>: | <character><id_end> <assign>: | '=' <assign_end> <assign_end>: <identifier> | <number> <number>: <digit><number_end> <digit>: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' <number_end>: | <digit><number_end> <for>: 'for' '(' <declaration> ';' <bool_expression> ';' ')' <bool_expression>: <identifier> <relop> <identifier> | <number> <relop> <identifier> <relop>: '<' | '>' | '==' | '!=' <if>: 'if' '(' <bool_expression> ')' <return>: 'return' <number> ';'

The first 4 lines seemed to understand (this is the description of Maine). And what comes next? What code options can this template implement? Can anyone already have any experience in C ++? I didn’t find anything in the network (because I don’t know what exactly I need, how stupid to somehow analyze the code?). In any case, sample code would not hurt.

> I didn’t find anything in the network (because I don’t know what exactly I need, stupidly somehow analyze the code?) Do you need to read the manual for the laboratory work?
Either this "grammar" does not allow changing the values of variables (assign saw only in the declaration) and therefore for (;;) here will be just a kind of if (or is the assignment of a new value made "de-declaration"?), Or I don’t understand something.
Sorry, writing a full-featured parser is a big non-trivial task, not for one hour of work if you are familiar with the topic, and not for one day if you are not familiar.
There is some groundwork, take any implementation of yacc (can understand any LALR (1) grammar).
If your grammar is not left-recursive (it looks like it is), you can easily write manually recursive descent parser.
@Alerr: look at these two questions: [[1]] (/ questions / 258833 /), [[2]] (/ questions / 260060 /).
There is also a cool book by N. Wirth " Algorithms + Data Structures = Programs ", it contains a detailed parser of the language of the type you need.
--- The bison sources (this is such a yacc clone) come with any Linux.
But they are complex, you will have to understand how the LALR parser works from the inside , and this task is heavier than one lab (although, of course, it will make you a specialist).

Answer 1 · 2013-10-29T20:36:59

@Alerr , yacc or bison source codes will not help you right now. But Virta read.

You can also search for the source "calculators". They are quite common and usually there for the translation of an expression, for example, in the reverse Polish notation, just the algorithm of recursive descent is used (it will also work here).

I think, the recursive analysis just will be in the following lab.

In principle, nothing complicated (here you have already been told about it) is not here. To get started, write a lexical analyzer. This is a function that reads the input and returns the token - a structure in which it is described that we have read the identifier, the keyword, the number, the operation sign, the bracket, etc.

In this part of the grammar you will go into this lexical analyzer and the grammar is simplified.

Next, just write a set of functions in accordance with the grammar. Well, perhaps we need to figure out how to report errors, make a decision, finish parsing at the first error or try to “recover” and disassemble further (this is more difficult), etc.

I'll start some very pseudocode, I think it will become clearer

 int programm () { int rc; lexem_t lx = get_lexem(); if (lx.type == KEYWORD && (lx.subtype == INT || lx.subtype == VOID || lx.subtype == BOOL)) { lx = get_lexem(); if (lx.type == IDENT && strcmp(lx.value, "main") == 0) { // очевидно, в нормальном Си тут нужно вызывать разбор списка параметров // но в нашей грамматике должна идти пара '(' ')' if ((rc = parentheses()) == OK) { lx = get_lexem(); if (lx.type == SPEC && lx.char_value == '{') if ((rc = statement()) == OK) { // уффф! почти добрались до конца lx = get_lexem(); if (lx.type == SPEC && lx.char_value == '}') return OK; // Ура!!! rc = error_message(lx, ...); // ждали '}', а прочли ??? } return rc; } rc = error_message(lx, ...); // ждали '{', а прочли ??? } return rc; // далее в том же духе обработка ошибок, просто в этом "редакторе" мало строк помещается, набирать с indent трудно ... } // конец if (lx.type == KEYWORD ... return error_message(lx, ....); // ждали int | void | bool, а прочли ??? }

I hope, in general, understandable.

One more thing, you will almost certainly need a function similar to ungetc(c) , but for the lexeme. Do not forget to consider thinking through data structures.

I would design it more directly according to the grammar: Program * parse_program () {// <type> 'main' '(' ')' '{' <statement> '}' Type * t = parse_type ();
if (! t) // error will be reported by parse_type () return NULL;
if (id-> name! = "main") add_error ("expect` main 'at position "+ id-> position);
I was not even going to start the parser, but I began to answer, I got carried away and started programming ...
> I got carried away and started programming. This is what distinguishes a programmer by vocation from a programmer "by necessity".
The calculator has already been written, it turned out to be quite simple according to the logic of the work.

Mark Shevchenko Mark Shevchenko 7,818 11 silver marks 31 bronze badge · Answer 2 · 2016-08-30T13:16:15

Here is a simple grammar of a language that can be fully realized without using lex and yacc. For lexical analysis, a lexical analyzer is needed, which, by the set of characters, will determine which lexeme has been encountered. The parser, based on the stream of tokens, can check the syntax.

Lexical analysis

In this particular case, it suffices to consider one character from the input stream, in order to understand which lexeme we are talking about. This algorithm is called analysis with a preview of 1 character.

Usually, the lexical analyzer skips all delimiters (spaces and newline characters) that it encounters in the input stream, and does not return them in the form of tokens.

To begin with, we will determine which lexemes are listed in the grammar. These are the key words:

 main, int, bool, void, for, if, return

separators and operators

 {} () ; = < > == !=

as well as identifiers and numbers. To simplify the code, we will also add a token that will correspond to the end of the file. In addition, if we cannot recognize the token, we must also return a special code. Total:

 #define MAIN 1 #define INT 2 #define BOOL 3 #define VOID 4 #define FOR 5 #define IF 6 #define RETURN 7 #define LBRACE 101 /* { */ #define RBRACE 102 /* } */ #define LPAREN 103 /* ( */ #define RPAREN 104 /* ) */ #define SEMICOLON 105 /* ; */ #define ASSIGN 106 /* = */ #define LT 107 /* < */ #define GT 108 /* > */ #define EQUAL 109 /* == */ #define NOT_EQUAL 110 /* != */ #define IDENTIFIER 201 #define NUMBER 202 #define END 301 /* конец файла */ #define ERROR 302 /* ошибка */

What is a lexical analyzer? In the simplest case, this is one function that receives a stream of characters at the input, reads the lexeme from there, and returns it. For the IDENTIFIER and NUMBER tokens, we will also need read characters, so we have to send the address of the buffer to save characters from the incoming stream.

In order to read several lexemes, you need to call the function several times. That is, the function signature should be:

 int get_lexem(FILE* stream, char* buffer);

How to implement it? Consider an input stream of characters "a = 534", for simplicity, let it be without spaces. After reading the symbol '=' from the stream, we cannot stop, because the stream can contain the token "==". So, we read the next character, it turns out to be '5' and now we definitely mean that we read the ASSIGN token and not EQUAL . However, the character '5' has already been read from the stream, and on the next call we will read from there already '3' and '4'. We definitely need to save the last character read between get_lexem calls.

This can be done in different ways, we will return the character back to the stream, calling the standard C function ungetc . This situation will not occur every time, but only when the token contains more than one character.

 int get_lexem(FILE* stream, char* buffer, size_t buffer_length) { int ch = getc(stream); size_t i; /* пропускаем все пробелы, табуляции и символы новой строки */ while(isspace(ch)) ch = getc(stream); /* конец файла */ if(ch == EOF) return END; /* простые односимвольные лексемы */ if(ch == '{') return LBRACE; if(ch == '}') return RBRACE; if(ch == '(') return LPAREN; if(ch == ')') return RPAREN; if(ch == ';') return SEMICOLON; if(ch == '<') return LT; if(ch == '>') return GT; /* сложный случаай 1: = или == */ if(ch == '=') { ch = getc(stream); if(ch == '=') return EQUAL; ungetc(ch, stream); return ASSIGN; } /* сложный случай 2: != */ if(ch == '!') { ch = getc(stream); if(ch == '=') return NON_EQUAL; /* по грамматике сразу после ! обязан идти символ = */ /* если это не так, в лексеме ошибка */ return ERROR; } /* сложный случай 3: идентификатор или ключевое слово */ if(isalpha(ch) || ch == '_') { i = 0; do { buffer[i++] = ch; } while((ch = getc(stream)) != EOF && (isalpha(ch) || isdigit(ch) || ch == '_')); if(ch != EOF) ungetc(ch, stream); buffer[i] = '\0'; if(strcmp("main", buffer) == 0) return MAIN; if(strcmp("int", buffer) == 0) return INT; if(strcmp("bool", buffer) == 0) return BOOL; if(strcmp("void", buffer) == 0) return VOID; if(strcmp("for", buffer) == 0) return FOR; if(strcmp("if", buffer) == 0) return IF; if(strcmp("return", buffer) == 0) return RETURN; return IDENTIFIER; } /* сложный случай 4: число */ if(isdigit(ch)) { i = 0; do { buffer[i++] = ch; } while((ch = getc(stream)) != EOF && isdigit(ch)); if(ch != EOF) ungetc(ch, stream); buffer[i] = '\0'; return IDENTIFIER; } }

Considering how the IDENTIFIER and NUMBER tokens are collected. The logic here is complicated, but we have already discussed every single moment. If the stream, for example, contains the characters "_abc123 =", then the function will read them all, including the '=', while the characters "_abc123" will go into the buffer, and the '=' we will return back to the stream. The end of the file does not need to be returned to the stream.

Since keywords cannot be distinguished from identifiers by the rules of grammar, but only by a dictionary, we define them in the same place as identifiers.

In the above code, there are two problems. The first one is related to the fact that a buffer overflow is possible. Let's say we make the buffer size 1000 characters, and a crazy programmer comes up with an identifier with a length of 1002 characters. The code and so it turned out quite large, so I did not write processing this error.

The second problem is the speed, because when reading each identifier we run it through the list of keywords in the slowest way. The real compiler here uses hash tables.

Parsing

The simplest manual parsing method is a recursive descent algorithm that allows you to parse LL (1) grammars. Your grammar is such that it simplifies our task.

The rules are coded in a very simple way. For example, what is a program ?

 program :: = type "main" "(" ")" "{" statement "}"

Here is how it is encoded in C:

 char buffer[1024]; int program(FILE* stream) { int statement_result; int lexem = get_lexem(stream, buffer); if(lexem != INT || lexem != BOOL && lexem != VOID) return UNRECOGNIZED_MAIN_TYPE; lexem = get_lexem(stream, buffer); if(lexem != MAIN) return EXPECTED_MAIN; lexem = get_lexem(stream, buffer); if(lexem != LPAREN) return EXPECTED_LPAREN; lexem = get_lexem(stream, buffer); if(lexem != RPAREN) return EXPECTED_RPAREN; lexem = get_lexem(stream, buffer); if(lexem != LBRACE) return EXPECTED_LBRACE; statement_result = statement(stream); if(statement_result != OK) return statement_result; lexem = get_lexem(stream, buffer); if(lexem != RBRACE) return EXPECTED_RBRACE; return OK; }

Verbose, but in principle understandable. OK , UNRECOGNIZED_MAIN_TYPE , EXPECTED_MAIN UNRECOGNIZED_MAIN_TYPE must be defined independently. They describe errors that grammar can recognize. UNRECOGNIZED_MAIN_TYPE means that the analyzer expects one of three types, but received something else. OK means there are no errors.

Design

 lexem = get_lexem(stream, buffer); if(lexem != RBRACE) return EXPECTED_RBRACE;

occurs very often. It can be called "require a token in the input stream." It is convenient for her to have separate functions:

 bool require_lexem1(FILE* stream, int expected1) { int lexem = get_lexem(stream); return lexem == expected1; } bool require_lexem2(FILE* stream, int expected1, int expected2) { int lexem = get_lexem(stream); return lexem == expected1 || lexem == expected2; } bool require_lexem3(FILE* stream, int expected1, int expected2, int expected3) { int lexem = get_lexem(stream); return lexem == expected1 || lexem == expected2 || lexem == expected3; }

Now the program function can be made easier and shorter:

 int program(FILE* stream) { int statement_result; if(!require_lexem3(stream, INT, BOOL, VOID)) return UNRECOGNIZED_MAIN_TYPE; if(!require_lexem1(stream, MAIN)) return EXPECTED_MAIN; if(!require_lexem1(stream, LPAREN)) return EXPECTED_LPAREN; if(!require_lexem1(stream, RPAREN)) return EXPECTED_RPAREN; if(!require_lexem1(stream, LBRACE)) return EXPECTED_LBRACE; statement_result = statement(stream); if(statement_result != OK) return statement_result; if(!require_lexem1(stream, RBRACE)) return EXPECTED_RBRACE; return OK; }

The statement function is more complicated and interesting. Based on the following lexeme, it decides which alternative rule to apply.

 int statement(FILE* stream) { int statement_result; int lexem = get_lexem(stream); if(lexem == INT || lexem == BOOL || lexem == VOID) return declaration(stream); if(lexem == LBRACE) { statement_result = statement(stream); if(statement_result != OK) return statement_result; if(!require_lexem1(stream, RBRACE)) return EXPECTED_RBRACE; return OK; } if(lexem == FOR) return for_rule(stream); if(lexem == IF) return if_rule(stream); if(lexem == RETURN) return return_rule(stream); return UNRECOGNIZED_STATEMENT; }

This function is recursive, therefore it allows to recognize complex constructions of the form { { { int i = 300; } } } { { { int i = 300; } } } . For example, we also implement the declaration function:

 int declaration(FILE* stream) { if(!require_lexem1(stream, IDENTIFIER)) return EXPECTED_IDENTIFIER; if(!require_lexem1(stream, ASSIGN)) return EXPECTED_ASSIGN; if(!require_lexem2(stream, IDENTIFIER, NUMBER)) return EXPECTED_IDENTIFIER_OR_NUMBER; if(!require_lexem1(stream, SEMICOLON)) return EXPECTED_SEMICOLON; return OK; }

We try to adhere to the principle: for each rule in the grammar to write a separate function, it is more convenient to carry out correspondences. In the above code, I moved away from this principle to make the code a bit shorter, in particular, the assign and assign_end got inside the declaration .

As a result, the algorithm goes down from the upper general rule to the lower (from the program to the declaration through the statement ), and some functions can call each other recursively.

That is why the algorithm is called recursive descent .

Having implemented all the rules, we will get a full-featured program analyzer that can report errors to us.

Write the penultimate function that turns error codes into text messages:

 const char* get_message(int code) { switch(code) { case OK: return "Ok"; case UNRECOGNIZED_MAIN_TYPE: return "Unrecognized main type"; . . . } return "Unrecognized code"; }

The main function of the program is very simple:

 void main() { int code = program(stdin); const char* message = get_message(code); printf("%s\n", message); }

Conclusion

We implemented a lexical analyzer ( get_lexem ), a set of support functions for checking the presence of tokens ( require_lexem1 , require_lexem2 , require_lexem3 ) and a set of functions for each grammar rule ( program , statement , declaration and many others that you have to write yourself). The get_message function allowed us to display clear text instead of incomprehensible codes. We got two sets of constants: token codes ( LPAREN , IDENTIFIER , etc.) and error codes ( OK , EXPECTED_ASSIGN , etc.)

The entire program is capable of checking the code of another program and output OK if it corresponds to the grammar. Otherwise, it displays an error message.

C language parser development

2 answers 2

Lexical analysis

Parsing

Conclusion

More articles: