Regular expression for C # lexical analyzer

Question

The goal is to implement a lexer for the C / 8 # if / else construct, decided to implement it by successively passing the input line through regexps, which I started to deal with recently. When writing regexps for constants / variables, I use non-capture groups before and after a capture group to adequately recognize lexemes, but the problem is that if the matches of non-capture groups match, the value on the right is not recognized. If you remove a group without capture in front of the main group, all values are recognized. For example, without the first test group:

With a test group:

This is only observed at the junction of matches:

That is, as can be seen from the example, in the else block, the intermediate value n is not highlighted, because it is located at the “junction”.

The expression itself (with all groups):

/(?:[-+*\/=<>{)(;]{1})([A-Za-z]{1}[A-Za-z0-9]*)(?:[-;+*\/=<>)]{1})/

Line:

 if(dsf==35){x=y/10;}else{y=n/25+n;}

What should be changed in order to pass the test and at the same time recognize all the desired lexemes (in this case, variables)?

UPD: problem with analysis solved, implementation of parsing remained

Code of patterns, if anyone is interested (on the left is a token, on the right is its indices in the input line):

 public static Dictionary<string, string> Patterns = new Dictionary<string, string>() { {"variable", @"(?:[-+*\/=<>{)(;]{1})([A-Za-z]{1}[A-Za-z0-9]*)"}, {"operator", @"(?:[A-Za-z0-9]+?)([<>]{1}[=]?|[=]{2}|!{1}={1}|[-+*\/=]+?)"}, {"punctuation", @"([)({};]+?)"}, {"keyword", @"(if|else)(?:[({]+?)"}, {"constante", @"(?:[-+*\/=<>(]+?)([-]?[\d]+[.]?[\d]*)"} };

Link to regex101 with the edited expression, maybe you will have more comments:

It is necessary to give not only the expression, but also the text that you are trying to process them - so that others do not have to type from the picture :)
you in all cases after any arithmetic action or equally not highlighted.
I would not advise using regulars for something more complicated than a simple partitioning into tokens (which actually should be the task of the lexer).
Imagine how you will parse the nested if 's to begin with.
Cool of course, but instead of screenshots, can you give links to regex101?
@ Sergey: If you break into tokens, then it is not quite clear, where does the if / else construct.
The tokenizer should normally accept the text else if if x > < > if = else , and issue to the outside the tokens ELSE IF IF IDENT(x) GT LT GT IF EQ ELSE .
That is, it should not “recognize the if / else construct, but simply recognize the tokens IF and ELSE .

Accepted Answer · 2016-05-02T14:14:51

You fit too hard. Write just regulars that recognize a piece of text from the beginning of the line. After that, in order to find the next token, apply all the regulars and find the longest match. (You only need to take care that the identifier keywords have an advantage over the identifiers.) After receiving the token, bite off the finished piece of the string, and repeat from the beginning. No capture groups are needed at all.

Here you have a skeleton:

 class Program { enum TokenType { Comparison, Punct, If, Else, Ident, // должно идти после ключевых слов! NumLiteral }; class Token { public TokenType Type; public string StringValue; public double NumericValue; // сюда стоит ещё добавить информацию о позиции в исходном тексте } // регулярки, описывающие лексему, начинаются с ^, чтобы матчить только в начале Dictionary<TokenType, Regex> regexes = new Dictionary<TokenType, Regex>() { [TokenType.Comparison] = new Regex(@"^(>=?|<=?|==)", RegexOptions.Compiled), [TokenType.Punct] = new Regex(@"^[,)(}{;]", RegexOptions.Compiled), [TokenType.If] = new Regex(@"^if", RegexOptions.Compiled), [TokenType.Else] = new Regex(@"^else", RegexOptions.Compiled), [TokenType.Ident] = new Regex(@"^[a-zA-Z_][a-zA-Z_0-9]*", RegexOptions.Compiled), [TokenType.NumLiteral] = new Regex(@"^[0-9]+(\.[0-9]+)?", RegexOptions.Compiled) }; // что ещё записать в токен? Dictionary<TokenType, Action<string, Token>> creators = new Dictionary<TokenType, Action<string, Token>>() { [TokenType.Comparison] = (s, token) => token.StringValue = s, [TokenType.Punct] = (s, token) => token.StringValue = s, [TokenType.If] = (s, token) => { }, [TokenType.Else] = (s, token) => { }, [TokenType.Ident] = (s, token) => token.StringValue = s, [TokenType.NumLiteral] = (s, token) => token.NumericValue = double.Parse(s) }; // а вот и вся логика: IEnumerable<Token> Tokenize(string text) { // откусываем пробелы var remainingText = text.TrimStart(); while (remainingText != "") { // находим наиболее подходящий паттерн: var bestMatch = (from pair in regexes let tokenType = pair.Key let regex = pair.Value let match = regex.Match(remainingText) let matchLen = match.Length // упорядочиваем по длине, а если длина одинаковая - по типу токена orderby matchLen descending, tokenType select new { tokenType, text = match.Value, matchLen }).First(); // если везде только 0, ошибка if (bestMatch.matchLen == 0) throw new Exception("Unknown lexeme"); var token = new Token() { Type = bestMatch.tokenType }; creators[bestMatch.tokenType](bestMatch.text, token); yield return token; // откусываем распознанный кусок и пробелы после него remainingText = remainingText.Substring(bestMatch.matchLen).TrimStart(); } } static void Main(string[] args) { new Program().Run(); } void Run() { var text = "else if if x > < > if == else; 25.0 7"; foreach (var token in Tokenize(text)) Console.WriteLine(token.Type); } }

Issues:

Else
If
If
Ident
Comparison
Comparison
Comparison
If
Comparison
Else
Punct
Numliteral
Numliteral

[TokenType.Comparison] = new Regex (@ "^ (> =? | <=? | ==)", RegexOptions.Compiled, is some new type of initialization?
Anyway, { TokenType.Comparison, new Regex(...) } , but pontoe :)
Well, it's paired with regexes[TokenType.Comparison] = new Regex(...) .
@ VladD, hello, I decided not to create a new question in order not to clutter up the tape, but to ask for help from you. Now I am writing a parser based on the lexical one, and now I cannot decide which structure to use for the analysis table.
I created a Dictionary <int, Delegate> Rules, now I need a structure that stores as a key a unique combination {enum TokenType, enum NonterminalType} and the corresponding rule number, or get rid of the dictionary and immediately use the delegate of the required method-rule as the value. Tried with DataTable, but nothing good comes out.

Regular expression for C # lexical analyzer

1 answer 1

More articles: