In regular expressions, the .NET dot works about as well as in other regular expression libraries other than POSIX: the dot finds any character other than a newline character ( "\n"
). Quantifier *?
finds not the shortest substring; it finds as many characters as it needs to find a match. Since the string (like the regular one) is parsed from left to right (default), <td.*?>
Finds <td
, then 0 or more characters other than the line feed character, before the first occurrence >
, followed by </tr>
. If it were not </tr>
, <td.*?>
Would find <td collspan="5">
, as expected.
Solution : in order not to go beyond a single tag, use the exclusive symbol class [^<>]*
, [^<]*?
or [^>]*
. If the tag can be non-serialized <
or >
, for example <tr><td name="<67"></td></tr>
, you will need a tempered greedy token (English) ("greedy" "moderate" quantifier) (?:(?!</?[a-zA-Z]).)*
, which does not find such characters with which the tag begins ( <a
or </a
).
It is best to use the HTML parser - "slower you go - you will continue."
In most cases, fit:
<tr[^<]*?>(<td[^<]*?>|</td>|\s)*</tr>
See the demo
<tr[^>]*>(\s*<td[^>]*>\s*<\/td>)*\s*<\/tr>
- splash58 am>
takes out - splash58чтобы получилось ".*?" - как можно меньше.
чтобы получилось ".*?" - как можно меньше.
This does not mean that the minimum section will be captured if a larger section of text fits a regular schedule, but this smaller section is not. Check out theRegex Debugger
debugger how the regex101.com/r/lG7eT9/1 regular schedule behaves. - Visman