Here is a simple grammar of a language that can be fully realized without using lex and yacc. For lexical analysis, a lexical analyzer is needed, which, by the set of characters, will determine which lexeme has been encountered. The parser, based on the stream of tokens, can check the syntax.
Lexical analysis
In this particular case, it suffices to consider one character from the input stream, in order to understand which lexeme we are talking about. This algorithm is called analysis with a preview of 1 character.
Usually, the lexical analyzer skips all delimiters (spaces and newline characters) that it encounters in the input stream, and does not return them in the form of tokens.
To begin with, we will determine which lexemes are listed in the grammar. These are the key words:
main, int, bool, void, for, if, return
separators and operators
{} () ; = < > == !=
as well as identifiers and numbers. To simplify the code, we will also add a token that will correspond to the end of the file. In addition, if we cannot recognize the token, we must also return a special code. Total:
#define MAIN 1 #define INT 2 #define BOOL 3 #define VOID 4 #define FOR 5 #define IF 6 #define RETURN 7 #define LBRACE 101 /* { */ #define RBRACE 102 /* } */ #define LPAREN 103 /* ( */ #define RPAREN 104 /* ) */ #define SEMICOLON 105 /* ; */ #define ASSIGN 106 /* = */ #define LT 107 /* < */ #define GT 108 /* > */ #define EQUAL 109 /* == */ #define NOT_EQUAL 110 /* != */ #define IDENTIFIER 201 #define NUMBER 202 #define END 301 /* конец файла */ #define ERROR 302 /* ошибка */
What is a lexical analyzer? In the simplest case, this is one function that receives a stream of characters at the input, reads the lexeme from there, and returns it. For the IDENTIFIER and NUMBER tokens, we will also need read characters, so we have to send the address of the buffer to save characters from the incoming stream.
In order to read several lexemes, you need to call the function several times. That is, the function signature should be:
int get_lexem(FILE* stream, char* buffer);
How to implement it? Consider an input stream of characters "a = 534", for simplicity, let it be without spaces. After reading the symbol '=' from the stream, we cannot stop, because the stream can contain the token "==". So, we read the next character, it turns out to be '5' and now we definitely mean that we read the ASSIGN token and not EQUAL . However, the character '5' has already been read from the stream, and on the next call we will read from there already '3' and '4'. We definitely need to save the last character read between get_lexem calls.
This can be done in different ways, we will return the character back to the stream, calling the standard C function ungetc . This situation will not occur every time, but only when the token contains more than one character.
int get_lexem(FILE* stream, char* buffer, size_t buffer_length) { int ch = getc(stream); size_t i; /* пропускаем все пробелы, табуляции и символы новой строки */ while(isspace(ch)) ch = getc(stream); /* конец файла */ if(ch == EOF) return END; /* простые односимвольные лексемы */ if(ch == '{') return LBRACE; if(ch == '}') return RBRACE; if(ch == '(') return LPAREN; if(ch == ')') return RPAREN; if(ch == ';') return SEMICOLON; if(ch == '<') return LT; if(ch == '>') return GT; /* сложный случаай 1: = или == */ if(ch == '=') { ch = getc(stream); if(ch == '=') return EQUAL; ungetc(ch, stream); return ASSIGN; } /* сложный случай 2: != */ if(ch == '!') { ch = getc(stream); if(ch == '=') return NON_EQUAL; /* по грамматике сразу после ! обязан идти символ = */ /* если это не так, в лексеме ошибка */ return ERROR; } /* сложный случай 3: идентификатор или ключевое слово */ if(isalpha(ch) || ch == '_') { i = 0; do { buffer[i++] = ch; } while((ch = getc(stream)) != EOF && (isalpha(ch) || isdigit(ch) || ch == '_')); if(ch != EOF) ungetc(ch, stream); buffer[i] = '\0'; if(strcmp("main", buffer) == 0) return MAIN; if(strcmp("int", buffer) == 0) return INT; if(strcmp("bool", buffer) == 0) return BOOL; if(strcmp("void", buffer) == 0) return VOID; if(strcmp("for", buffer) == 0) return FOR; if(strcmp("if", buffer) == 0) return IF; if(strcmp("return", buffer) == 0) return RETURN; return IDENTIFIER; } /* сложный случай 4: число */ if(isdigit(ch)) { i = 0; do { buffer[i++] = ch; } while((ch = getc(stream)) != EOF && isdigit(ch)); if(ch != EOF) ungetc(ch, stream); buffer[i] = '\0'; return IDENTIFIER; } }
Considering how the IDENTIFIER and NUMBER tokens are collected. The logic here is complicated, but we have already discussed every single moment. If the stream, for example, contains the characters "_abc123 =", then the function will read them all, including the '=', while the characters "_abc123" will go into the buffer, and the '=' we will return back to the stream. The end of the file does not need to be returned to the stream.
Since keywords cannot be distinguished from identifiers by the rules of grammar, but only by a dictionary, we define them in the same place as identifiers.
In the above code, there are two problems. The first one is related to the fact that a buffer overflow is possible. Let's say we make the buffer size 1000 characters, and a crazy programmer comes up with an identifier with a length of 1002 characters. The code and so it turned out quite large, so I did not write processing this error.
The second problem is the speed, because when reading each identifier we run it through the list of keywords in the slowest way. The real compiler here uses hash tables.
Parsing
The simplest manual parsing method is a recursive descent algorithm that allows you to parse LL (1) grammars. Your grammar is such that it simplifies our task.
The rules are coded in a very simple way. For example, what is a program ?
program :: = type "main" "(" ")" "{" statement "}"
Here is how it is encoded in C:
char buffer[1024]; int program(FILE* stream) { int statement_result; int lexem = get_lexem(stream, buffer); if(lexem != INT || lexem != BOOL && lexem != VOID) return UNRECOGNIZED_MAIN_TYPE; lexem = get_lexem(stream, buffer); if(lexem != MAIN) return EXPECTED_MAIN; lexem = get_lexem(stream, buffer); if(lexem != LPAREN) return EXPECTED_LPAREN; lexem = get_lexem(stream, buffer); if(lexem != RPAREN) return EXPECTED_RPAREN; lexem = get_lexem(stream, buffer); if(lexem != LBRACE) return EXPECTED_LBRACE; statement_result = statement(stream); if(statement_result != OK) return statement_result; lexem = get_lexem(stream, buffer); if(lexem != RBRACE) return EXPECTED_RBRACE; return OK; }
Verbose, but in principle understandable. OK , UNRECOGNIZED_MAIN_TYPE , EXPECTED_MAIN UNRECOGNIZED_MAIN_TYPE must be defined independently. They describe errors that grammar can recognize. UNRECOGNIZED_MAIN_TYPE means that the analyzer expects one of three types, but received something else. OK means there are no errors.
Design
lexem = get_lexem(stream, buffer); if(lexem != RBRACE) return EXPECTED_RBRACE;
occurs very often. It can be called "require a token in the input stream." It is convenient for her to have separate functions:
bool require_lexem1(FILE* stream, int expected1) { int lexem = get_lexem(stream); return lexem == expected1; } bool require_lexem2(FILE* stream, int expected1, int expected2) { int lexem = get_lexem(stream); return lexem == expected1 || lexem == expected2; } bool require_lexem3(FILE* stream, int expected1, int expected2, int expected3) { int lexem = get_lexem(stream); return lexem == expected1 || lexem == expected2 || lexem == expected3; }
Now the program function can be made easier and shorter:
int program(FILE* stream) { int statement_result; if(!require_lexem3(stream, INT, BOOL, VOID)) return UNRECOGNIZED_MAIN_TYPE; if(!require_lexem1(stream, MAIN)) return EXPECTED_MAIN; if(!require_lexem1(stream, LPAREN)) return EXPECTED_LPAREN; if(!require_lexem1(stream, RPAREN)) return EXPECTED_RPAREN; if(!require_lexem1(stream, LBRACE)) return EXPECTED_LBRACE; statement_result = statement(stream); if(statement_result != OK) return statement_result; if(!require_lexem1(stream, RBRACE)) return EXPECTED_RBRACE; return OK; }
The statement function is more complicated and interesting. Based on the following lexeme, it decides which alternative rule to apply.
int statement(FILE* stream) { int statement_result; int lexem = get_lexem(stream); if(lexem == INT || lexem == BOOL || lexem == VOID) return declaration(stream); if(lexem == LBRACE) { statement_result = statement(stream); if(statement_result != OK) return statement_result; if(!require_lexem1(stream, RBRACE)) return EXPECTED_RBRACE; return OK; } if(lexem == FOR) return for_rule(stream); if(lexem == IF) return if_rule(stream); if(lexem == RETURN) return return_rule(stream); return UNRECOGNIZED_STATEMENT; }
This function is recursive, therefore it allows to recognize complex constructions of the form { { { int i = 300; } } } { { { int i = 300; } } } . For example, we also implement the declaration function:
int declaration(FILE* stream) { if(!require_lexem1(stream, IDENTIFIER)) return EXPECTED_IDENTIFIER; if(!require_lexem1(stream, ASSIGN)) return EXPECTED_ASSIGN; if(!require_lexem2(stream, IDENTIFIER, NUMBER)) return EXPECTED_IDENTIFIER_OR_NUMBER; if(!require_lexem1(stream, SEMICOLON)) return EXPECTED_SEMICOLON; return OK; }
We try to adhere to the principle: for each rule in the grammar to write a separate function, it is more convenient to carry out correspondences. In the above code, I moved away from this principle to make the code a bit shorter, in particular, the assign and assign_end got inside the declaration .
As a result, the algorithm goes down from the upper general rule to the lower (from the program to the declaration through the statement ), and some functions can call each other recursively.
That is why the algorithm is called recursive descent .
Having implemented all the rules, we will get a full-featured program analyzer that can report errors to us.
Write the penultimate function that turns error codes into text messages:
const char* get_message(int code) { switch(code) { case OK: return "Ok"; case UNRECOGNIZED_MAIN_TYPE: return "Unrecognized main type"; . . . } return "Unrecognized code"; }
The main function of the program is very simple:
void main() { int code = program(stdin); const char* message = get_message(code); printf("%s\n", message); }
Conclusion
We implemented a lexical analyzer ( get_lexem ), a set of support functions for checking the presence of tokens ( require_lexem1 , require_lexem2 , require_lexem3 ) and a set of functions for each grammar rule ( program , statement , declaration and many others that you have to write yourself). The get_message function allowed us to display clear text instead of incomprehensible codes. We got two sets of constants: token codes ( LPAREN , IDENTIFIER , etc.) and error codes ( OK , EXPECTED_ASSIGN , etc.)
The entire program is capable of checking the code of another program and output OK if it corresponds to the grammar. Otherwise, it displays an error message.