C # text search between tags

Question

Good afternoon, I’ve already climbed in Google, but I couldn’t find an answer to my question, I apologize if it’s too stupid, but I hope someone can help.

document whose content is

//----- (00000001) -------------------------------------------------------- some text //----- (00000002) -------------------------------------------------------- some text2 //----- (00000003) --------------------------------------------------------

I want to find a way to pull all the content between

 //----- (00000001) -------------------------------------------------------- //----- (00000002) --------------------------------------------------------

where the numbers in the brackets are the key: I set the argument for example 00000005 and you need to get everything that is on the lines starting from the one where it will be found (00000005) and to the next character set //-----

if we serve

 //----- (00000001) -------------------------------------------------------- some text2 //----- (00000002) -------------------------------------------------------- some text3 some text4 //----- (00000003) --------------------------------------------------------

and the argument 00000002 then on the output get

 //----- (00000002) -------------------------------------------------------- some text3 some text4

I'm trying to do it on c #, but so far I just can't figure out how to get all the "from and to" lines

I just can not understand what expression to put in view of the conditions of the problem for Regex. Match (**)

as an argument for the search goes static string testArg = "00000001";

 public static void Main(string[] args) { using (var outputFile = new StreamWriter(String.Format("output.c"))) { StreamReader file = new StreamReader("test.c"); string line; while ((line = file.ReadLine()) != null) { string matchText = Regex.Match(line, "((.*)//-----)").Value; outputFile.WriteLine("{0}", matchText); } } }

but how I can’t find this search argument here;

If the format is one of the standard ones, then it is better to use a ready-made parser.
If you parse email attachment, then again it is better not to reinvent the wheel.

Answer 1 · 2015-12-01T16:37:42

The author probably does not need the answer. But since I spent half a day today testing various regular expressions in solving this problem (for the sake of interest). That can someone come in handy on regulars:

 // Допустим прочитаем файл с данными. using (StreamReader str = new StreamReader(@"d:\testTxt.txt", Encoding.Default)) { string txtFile = str.ReadToEnd(); // Ключ который необходимо найти и текст под ним до следующего ключа или конца строки. string findKey = "00000003"; // Регулярное выражение. string regular = @"^//-{5} \(" + findKey + @"\) -{56}(?<txt>[^/]+(/[^/]*)*?)(^//-{5}|$)"; Regex rx = new Regex(regular, RegexOptions.Multiline); Match m = rx.Match(txtFile); // Найденный текст идущий после искомого ключа. string foundText = m.Groups["txt"].Value; Console.WriteLine(foundText); }

When testing a small file in a cycle of 100,000 iterations, the time is 1c 67ms. It is fast enough.

When using the standard regular schedule using the minimum quantifier to the point - ~ 45sec.

 string regular = @"//-{5} \(" + searchNumber + @"\) -{56}(?<txt>.+?)(?://-{5})|$"; Regex rx = new Regex(regular, RegexOptions.Singleline); Match m = rx.Match(txtFile); string foundText = m.Groups["txt"].Value;

and if you use positive search and reverse positive instead of saving brackets, the program freezes for 2.5 minutes.

Answer 2 · 2015-11-30T11:29:20

One of the options:

 string testArg = "00000002"; string pattern = @"//-+\s+\(" + testArg + @"\)\s+-+.*?(?=//-+)"; string text = File.ReadAllText("file.txt"); var options = RegexOptions.Singleline; string matchText = Regex.Match(text, pattern, options).Value; Console.WriteLine(matchText);

Explanation of the regular season:
// - two characters /
-+ - one or more characters -
\s+ - one or more spaces
\( - bracket, you need to screen it
Add the desired value to the template.
\) - closing bracket
Again one or more spaces, one or more dashes.
.*? - any characters, in any quantity (zero or more), are not greedy, so as not to capture subsequent comments. That we need them.
(?= ) - positive lookahead (lookahead) - will look for a suitable template, but it will not be captured in the result.
//-+ - what we are looking forward to.

RegexOptions.Singleline - this parameter is needed for the metacharacter. (dot) captured line breaks.

In general, the regular season can be wound almost non-infinitely.
For example, to prevent the line feed characters from appearing at the end, you can change the look-ahead template:
(?=\r\n//-+)
It is understood that the file is a Windows-style string translation.

If you do not need to capture the comment itself at the beginning, then use the positive look back (lookbehind):

 string pattern = @"(?<=//-+\s+\(" + testArg + @"\)\s+-+\r\n).*?(?=\r\n//-+)";

Option without regular:

 var resultLines = File.ReadLines("file.txt") .SkipWhile(s => !s.StartsWith("//----- (" + testArg)) // пропускаем строки, пока не встретится нужная .Skip(1) // пропускаем саму эту строку комментария .TakeWhile(s => !s.StartsWith("//----- (")) // берем строки до тех пор, пока не встретится опять комментарий ; string resultText = string.Join(Environment.NewLine, resultLines); Console.WriteLine(resultText);

A regular expression solution may not cope with very long text due to the lazy quantifier applied to the point.
And the second solution does not include the initial //----- (идентификатор) -------------------------------------------------------- .

C # text search between tags

2 answers 2

More articles: