Is there something wrong with my regular expression or why does it work on this expression?

The task is to delete the lines in the HTML document in which all the cells are empty. HTML fragment of the document below:

<tr> <td collspan="5"> <b>Какой-то текст</b> </td> </tr> 

Regular expression that I wanted to remove fragments:

 <tr.*?>(<td.*?>|</td>|\s)*</tr> 

However, the example falls under this expression. It seems that \ s eats all the text inside the tr tag, but I can not understand why.

  • one
    <tr[^>]*>(\s*<td[^>]*>\s*<\/td>)*\s*<\/tr> - splash58 am
  • @ splash58, Thank you, it seems to work. But it is not clear what is the fundamental difference between my expression and this? Is [^>] * not an analog. *?> In my case? Any characters 0 or more times except> and any characters as little as possible before the> character - Pincher1519
  • depends on the greed of regex'a. I don’t know about the library of the sishnuyu library, maybe it’s just before the last one > takes out - splash58
  • one
    @ Pincher1519 чтобы получилось ".*?" - как можно меньше. чтобы получилось ".*?" - как можно меньше. This does not mean that the minimum section will be captured if a larger section of text fits a regular schedule, but this smaller section is not. Check out the Regex Debugger debugger how the regex101.com/r/lG7eT9/1 regular schedule behaves. - Visman
  • one
    Something no one wants to write an answer :) But with HtmlAgilityPack easier. - Wiktor Stribiżew

1 answer 1

In regular expressions, the .NET dot works about as well as in other regular expression libraries other than POSIX: the dot finds any character other than a newline character ( "\n" ). Quantifier *? finds not the shortest substring; it finds as many characters as it needs to find a match. Since the string (like the regular one) is parsed from left to right (default), <td.*?> Finds <td , then 0 or more characters other than the line feed character, before the first occurrence > , followed by </tr> . If it were not </tr> , <td.*?> Would find <td collspan="5"> , as expected.

Solution : in order not to go beyond a single tag, use the exclusive symbol class [^<>]* , [^<]*? or [^>]* . If the tag can be non-serialized < or > , for example <tr><td name="<67"></td></tr> , you will need a tempered greedy token (English) ("greedy" "moderate" quantifier) (?:(?!</?[a-zA-Z]).)* , which does not find such characters with which the tag begins ( <a or </a ).

It is best to use the HTML parser - "slower you go - you will continue."

In most cases, fit:

 <tr[^<]*?>(<td[^<]*?>|</td>|\s)*</tr> 

See the demo