According to C # documentation :

\W - Matches any non-alphanumeric character.

It seemed to me logical to assume that the underscore character - "_" falls under this definition.

But in practice it turned out that the regular expression - @"\W+" does not find the underscore character in the line - @"@$^&#№_\|/*-+=~%{}()[];:,.!? "" ""

Please explain why the regular expression does not find the underscore character "_"?

2 answers 2

All characters except \W fall under \w

In C #, \w includes letters (not A-Za-z , but all characters from certain categories of Unicode, including Cyrillic), numbers, and special characters from the category Punctuation, Connector .

The _ symbol is included in the Punctuation, Connector category (there are a dozen other characters besides it, like and ).

If the problem is only with _ - explicitly add it to the list of characters: [\W_]+ .

If you want to capture all the Punctuation, Connector - add the whole class to all non-bugs and non-numbers - add the whole class: [\W\p{Pc}]+ .

    And so, we need to remember that the inverse of the \W metacharacter is \w , which can be written as [a-zA-Z0-9_] that is, a character that is used in words.

    Usually it includes all letters, all numbers and the underscore _ , from this, we now know that the underscore is in the \w group, and the \W metacharacter includes everything except the characters defined by the \w metacharacter.

    Therefore, in your case, you can write an expression like @"[\W_]+"?

    • one
      and unless \w does not include all Unicode sevmoly - Cyrillic, various umlauts, etc.? - MaxU
    • @MaxU, for sure, I completely forgot about the umlauts, they fall down, if you have something to add or correct, you can do it) - Let's say Pie