Run a program from .ini

Question

Suppose we write this in the ini file:

repeat (8) { click(10,10) pause(100s) click(10,10) repeat(2) { pause(10) } }

something like the code of the most common clicker, like a UOpilot.

How to process and analyze loops, nested loops?

@VladD so, and from this point on in more detail, can you recommend literature / articles?
Or N. Wirth, “ algorithms + data structures = programs ” (googling).

Accepted Answer · 2015-09-19T11:43:19

Okay, well, you need an interpreter. Let's dissolve and write this same interpreter.

First, you need to make a formal grammar of your language. Based on what you gave as an example of a program, you can assume the following grammar:

 program ::= statement* EOF compound-statement ::= '{' statement* '}' statement ::= repeat-statement | function-call | compound-statement repeat-statement ::= "repeat" "(" number-constant ")" compound-statement function-call ::= ident "(" arglist ")" arglist ::= EMPTY | arg ["," arg]* arg ::= number-constant | duration-constant

We see that the grammar is simple, which means that it is easy to parse either manually or through the standard lex / yacc combination (flex / bison). Parsing via lex / yacc is much easier, and the technical work is much less, but let's write the parser manually, it will be more useful.

So, since our grammar is LL (1) (left-recursive, and to determine the type of the following rule, it’s enough to peek 1 token forward), we will write the simplest recursive descent parser.

I will try to write a parser fairly simple ideologically so that it is easy to understand, but on the other hand, it is general enough so that it can be easily expanded.

The first part of any self-respecting parser is the tokenizer. We want to divide the input string into meaningful tokens , so as not to mess with individual characters.

For starters, what kind of tokens do we have? it

keywords
- repeat
punctuation
- round brackets
- braces
- comma
identifiers (function names)
constants
- numbers (like 10 )
- duration (like 100s )

This gives this enum :

 enum token_type { // keywords tt_repeat, tt_ident, tt_number, tt_duration, // punctuation tt_lparen, tt_rparen, tt_lbrace, tt_rbrace, tt_comma, // end of file tt_eof, // parse error tt_error };

and data type for token:

 struct token { token_type type; string string_value; long num_value; };

The tokenizer itself will provide the functionality to “pry” one forward token and will remember the current position of the token in the text for easy debugging:

 class tokenizer { const string text; // весь текст int curridx; // текущая позиция в тексте int endidx; // длина текста int currline, currcol; // текущая строка/номер символа в строке // (нужны лишь для отладки) token lookahead; // следующий токен void set_lookahead(); // перейти к следующему токену // и запомнить его в lookahead public: tokenizer(string text) : text(text), curridx(0), endidx(text.length()), currline(1), currcol(1) { lookahead.num_value = 0; lookahead.string_value = ""; set_lookahead(); } token peek_next() { return lookahead; } void move_ahead() { set_lookahead(); } };

The basic logic of the tokenizer is, of course, in the set_lookahead function. This function is usually built on regulars, just like lex does. But we will write it manually.

 void tokenizer::set_lookahead() { // начнём с пропуска незначащих пробелов while (curridx < endidx && isspace(text[curridx])) { // не забываем следить за нашей позицией в файле if (text[curridx] == '\n') { currcol = 1; currline++; } else { currcol++; } // переходим к следующему символу curridx++; } // тут мы точно знаем, где начинается наш следующий токен lookahead.lineno = currline; lookahead.colno = currcol; if (curridx == endidx) // конец файла { lookahead.type = tt_eof; return; } char c = text[curridx]; // с пунктуацией всё просто if (c == '(' || c == ')' || c == '{' || c == '}' || c == ',') { lookahead.type = (c == '(') ? tt_lparen : (c == ')') ? tt_rparen : (c == '{') ? tt_lbrace : (c == '}') ? tt_rbrace : tt_comma; curridx++; currcol++; return; } // если токен начинается с буквы, это ключевое слово или идентификатор if (isalpha(c)) { // отделим-ка его сначала в переменную string result; while (curridx < endidx && isalpha(text[curridx])) { result += text[curridx]; curridx++; currcol++; } // проверим ключевые слова if (result == "repeat") { lookahead.type = tt_repeat; return; } // у нас только одно, больше проверять нечего // если не ключевое слово, значит, идентификатор lookahead.type = tt_ident; lookahead.string_value = result; return; } // константы if (isdigit(c)) // numeric { // отделим её от потока текста string result; // может содержать и буквы в нашей грамматике while (curridx < endidx && isalnum(text[curridx])) { result += text[curridx]; curridx++; currcol++; } auto c_str = result.c_str(); char* last_char; // есть ли способ попроще узнать, является ли строка числом без "хвоста"? long converted = strtol(c_str, &last_char, 10); // разобрали все цифры? тогда это число if (*last_char == 0) { lookahead.type = tt_number; lookahead.num_value = converted; return; } // в конце s? тогда это продолжительность if (*last_char == 's' && *(last_char + 1) == 0) { lookahead.type = tt_duration; lookahead.num_value = converted; return; } } // ничего не нашли? окей, запомним, что это ошибка lookahead.type = tt_error; }

We continue. Since our language is simple, we will not distinguish between the syntax tree and the parse tree (that is, the syntax tree and the parse tree), and instead of honestly implementing the visitor in the syntax tree, we simply put the virtual function execute. (Yes, I'm lazy.)

Define the classes for the syntax tree. They directly correspond to grammar. (Since these are definitions, it would be necessary to refer everywhere to the namespace std , but I will not make the code heavier.)

 struct statement { virtual void execute() = 0; virtual ~statement() { } }; struct function_call : public statement { shared_ptr<installed_function> p_function; unique_ptr<arglist> p_args; virtual void execute() { p_function->function(p_args->args); } }; struct compound_statement : public statement { vector<unique_ptr<statement>> statements; virtual void execute() { for (auto& p_statement : statements) p_statement->execute(); } }; struct repeat_statement : public statement { unique_ptr<statement> p_statement; long num_repeat; virtual void execute() { for (long i = 0; i < num_repeat; i++) p_statement->execute(); } }; struct program : public compound_statement { };

Separately, we need data structures for function calls. Let the arguments be typed, the union time has passed.

 struct arg { virtual ~arg() { } }; struct number_constant : arg { int value; }; struct duration_constant : arg { std::chrono::seconds value; }; struct arglist { vector<unique_ptr<arg>> args; };

Of course, it would be easy for you to make life difficult for you, and simply declare the names of functions with keywords, but this will not make it easy to add new functions.

 struct installed_function { std::function<void(const std::vector<unique_ptr<arg>>&)> function; int argnum; };

Let's go to the parser itself. He is quite straightforward. The only trick is that functions of the try_ type try to find a syntactic structure at the current point of the stream, defining it by the first token (remember that our grammar is LL (1)), and return nullptr if the first token is not suitable:

 class parser { arg* try_parse_arg(); arglist* try_parse_arglist_until_rparen(); function_call* try_parse_function_call(); repeat_statement* try_parse_repeat_statement(); statement* try_parse_statement(); compound_statement* try_parse_compound_statement(); public: program* parse(); tokenizer tokenizer; installed_functions& functions; public: parser(string input, installed_functions& functions) : tokenizer(input), functions(functions) { } };

The implementation of functions directly corresponds to the grammar. We report errors using exceptions, so we will have to use smart pointers to avoid memory leaks.

 // program ::= statement* EOF program* parser::parse() { unique_ptr<program> p(new program()); while (true) { // пробуем найти statement statement* s = try_parse_statement(); if (!s) // не нашли? на выход! break; p->statements.emplace_back(s); // добавляем в список } // проверяем, что больше в файле ничего нет token t = tokenizer.peek_next(); if (t.type != tt_eof) throw parse_exception("extra characters after program end", t); return p.release(); } // compound-statement ::= '{' statement* '}' compound_statement* parser::try_parse_compound_statement() { // ищем левую фигурную скобку token t = tokenizer.peek_next(); if (t.type != tt_lbrace) return nullptr; tokenizer.move_ahead(); unique_ptr<compound_statement> p(new compound_statement()); // тут в цикле добавляем вложенные statement'ы, как и для program while (true) { statement* s = try_parse_statement(); if (!s) break; p->statements.emplace_back(s); } // здесь должна быть закрывающая фигурная скобка t = tokenizer.peek_next(); if (t.type != tt_rbrace) throw parse_exception("expected closing brace after compound statement", t); tokenizer.move_ahead(); return p.release(); } // statement ::= repeat-statement | function-call | compound-statement statement* parser::try_parse_statement() { statement* result = nullptr; // пропробуем найти repeat statement result = try_parse_repeat_statement(); if (result) // нашли? вот и хорошо return result; // не нашли? пробуем найти function call result = try_parse_function_call(); if (result) // нашли вызов функции? прекрасно, задание выполнено return result; // не нашли? может, тут compound statement? result = try_parse_compound_statement(); // нашли-не нашли, не интересно, вернём nullptr в крайнем случае. return result; } // repeat-statement ::= "repeat" "(" number-constant ")" compound-statement repeat_statement* parser::try_parse_repeat_statement() { // проверяем, есть ли в начале repeat token t = tokenizer.peek_next(); if (t.type != tt_repeat) return nullptr; tokenizer.move_ahead(); // теперь обязательно скобка t = tokenizer.peek_next(); if (t.type != tt_lparen) throw parse_exception("opening parenthesis expected", t); tokenizer.move_ahead(); // теперь количество повторений t = tokenizer.peek_next(); if (t.type != tt_number) throw parse_exception("number expected", t); tokenizer.move_ahead(); // запомним его в переменую long num = t.num_value; // тут должна быть закрывающая скобка t = tokenizer.peek_next(); if (t.type != tt_rparen) throw parse_exception("closing parenthesis expected", t); tokenizer.move_ahead(); // а за этим всем compound statement // ну, это мы умеем unique_ptr<compound_statement> s(try_parse_compound_statement()); if (!s) throw parse_exception("compound statement expected after repeat", tokenizer.peek_next()); // всё нашли? конструируем ответ repeat_statement* rs = new repeat_statement(); rs->num_repeat = num; rs->p_statement = move(s); return rs; } // function-call ::= ident "(" arglist ")" function_call* parser::try_parse_function_call() { // ищем идентификатор token ft = tokenizer.peek_next(); if (ft.type != tt_ident) return nullptr; tokenizer.move_ahead(); // и запомниаем имя функции string name = ft.string_value; // проверим сначала общий синтаксис, а потом будем выяснять, // есть ли такая функция token t = tokenizer.peek_next(); if (t.type != tt_lparen) throw parse_exception("left parenthesis expected for function call", t); tokenizer.move_ahead(); unique_ptr<arglist> args(try_parse_arglist_until_rparen()); if (!args) throw parse_exception("argument list not found", tokenizer.peek_next()); t = tokenizer.peek_next(); if (t.type != tt_rparen) throw parse_exception("right parenthesis expected after function arg list", t); tokenizer.move_ahead(); // обычно проверка семантики (является ли идентификатор функцией) - // задача семантического анализа, то есть, последующих этапов компиляции // но поскольку мы пишем игрушечный парсер, мы сделаем это прямо здесь auto pfunc = functions.get(name); if (!pfunc) throw parse_exception("unknown function", ft); // проверим заодно и количество аргументов if (pfunc->argnum != args->args.size()) throw parse_exception("argument count for function doesn't match", ft); // в принципе, ничего, кроме лени, не мешает нам проверить тут и *типы* аргументов // а так придётся проверять типы аргументов во время выполнения // и бросать runtime_exception function_call* fc = new function_call(); fc->p_function = pfunc; fc->p_args = move(args); return fc; } // arg ::= number-constant | duration-constant arg* parser::try_parse_arg() { arg* result = nullptr; token t = tokenizer.peek_next(); if (t.type == tt_number) { auto r = new number_constant(); r->value = t.num_value; result = r; } else if (t.type == tt_duration) { auto r = new duration_constant(); r->value = chrono::seconds(t.num_value); result = r; } if (result != nullptr) tokenizer.move_ahead(); return result; } // arglist ::= EMPTY | arg ["," arg]* arglist* parser::try_parse_arglist_until_rparen() { unique_ptr<arglist> result(new arglist()); token t = tokenizer.peek_next(); if (t.type == tt_rparen) return result.release(); while (true) { arg* p = try_parse_arg(); if (!p) throw parse_exception("expected argument", tokenizer.peek_next()); result->args.emplace_back(p); t = tokenizer.peek_next(); if (t.type == tt_rparen) return result.release(); if (t.type != tt_comma) throw parse_exception("comma expected between arguments", t); tokenizer.move_ahead(); } }

Great, we went most of the way. It remains only to implement the mechanism of adding functions.

 class installed_functions { unordered_map<string, shared_ptr<installed_function>> content; public: void install_function( function<void(const vector<unique_ptr<arg>>&)> f, int argnum, string name) { installed_function* pf = new installed_function; pf->function = f; pf->argnum = argnum; content.insert(make_pair(name, shared_ptr<installed_function>(pf))); } shared_ptr<installed_function> get(string name) { auto pfunc = content.find(name); if (pfunc == content.end()) return nullptr; return pfunc->second; } };

Define the functions themselves:

 void f_pause(const vector<unique_ptr<arg>>& args) { auto parg = dynamic_cast<duration_constant*>(args[0].get()); if (!parg) throw runtime_exception("argument type mismatch in function pause"); auto duration = parg->value; cout << "pause: " << duration.count() << " seconds" << endl; } void f_click(const vector<unique_ptr<arg>>& args) { auto parg1 = dynamic_cast<number_constant*>(args[0].get()); auto parg2 = dynamic_cast<number_constant*>(args[1].get()); if (!parg1 || !parg2) throw runtime_exception("argument type mismatch in function click"); auto x = parg1->value; auto y = parg2->value; cout << "click: (" << x << ", " << y << ")" << endl; }

And you can test:

 string text = "repeat (8)\n" "{\n" "click(10,10)\n" "pause(100s)\n" "click(10,10)\n" "repeat(2)\n" " {\n" " pause(10)\n" " }\n" "}"; int main(int argc, char* argv[]) { installed_functions functions; functions.install_function(f_pause, 1, "pause"); functions.install_function(f_click, 2, "click"); parser p(text, functions); unique_ptr<program> tree; try { tree.reset(p.parse()); cout << "executing:" << endl; tree->execute(); } catch (const parse_exception& ex) { cerr << "parse exception at line " << ex.row << ", char " << ex.col << ": " << ex.text << endl; } catch (const runtime_exception& ex) { cerr << "runtime exception: " << ex.text << endl; } return 0; }

We get the output:

executing:
click: (10, 10)
pause: 100 seconds
click: (10, 10)
runtime exception: argument type mismatch in function pause

Oh yes, we incorrectly specified the type of argument for pause ! Change pause(10) to pause(10s) , everything works like a clock.

Update : A slightly more advanced version of the same parser is available on the github .

Everything is nice, but why is int instead of size_t used to index the lines?
@gbg: Well, yes, there are a couple of flaws in the code :-D I still, among other things, do not use modern constructions, since my compiler (VS 2012) is not old enough.
By the way, it would be necessary to rewrite to yacc, and the truth.
about the ubiquitous auto, for example, I have a dual opinion.
On the other hand, the code becomes less transparent to citizens who study code without an IDE.
And with the third, such code is slightly more difficult to refactor.
@gdb, "why int is used instead of size_t for indexing rows" I also prefer int, since
With size_t, you can easily run into problems in the event of a negative overflow.

Grundy 58.7k 7 golden marks 54 silver marks 95 bronze marks · Answer 2 · 2015-09-18T17:01:54

If it is not to some ready-made interpreter, then something like this:
1. break into separate tokens:
['repeat', '(', '8', ')', '{', 'click', '(', '10', '10', ')', 'pause', '(', ' 100s', ')', 'click', '(', '10', '10', ')', 'repeat', '(', '2', ')', '{', 'pause' , '(', 'ten', ')', '}', '}'
2. build a tree on them (as an option - like:

repeat
- eight
- block
- - click
- - - ten
- - - ten
- - pause
- - - 100s
- - click
- - - ten
- - - ten
- - repeat
- - - 2
- - - block
- - - - pause
- - - - ten

And we begin to perform with the top node. Although for good - everything is somewhat more complicated.

kiv kiv 21 1 bronze sign · Answer 3 · 2015-09-18T18:39:38

YACC to help you. Such a simple grammar is written very quickly, even with zero knowledge.

kiv

21 1 bronze sign

... and lex then already. Or better flex / bison. - VladD
Yes, for such a simple code is easier, perhaps, without lex ... - kiv
I find it difficult to agree that without a tokenizer is easier. At least insignificant spaces need to be bitten somehow. Or do you mean manual tokenization? - VladD September
Try to write more detailed answers. Explain what is the basis of your statement? - Nicolas Chabanovsky ♦

|

Run a program from .ini

3 answers 3

More articles: