How to remove characters from the text with the exception of the Russian / English alphabet, punctuation and white space characters using regular expressions? Is there a ready class (in terms of regexp)? I would like to avoid direct listing.
2 answers
Try
replaceAll("[[\W[0-9_]]&&[\S]&&[^А-Яа-я-.?!)(,:]]", ""); |
The regex pattern has predefined character groups:
Predefined character classes
. - Any character (may or may not match line terminators)
\ d - A digit: [0-9]
\ D - A non-digit: [^ 0-9]
\ s - A whitespace character: [\ t \ n \ x0B \ f \ r]
\ S - A non-whitespace character: [^ \ s]
\ w - A word character: [a-zA-Z_0-9]
\ W - A non-word character: [^ \ w]
Unfortunately, there is no ready-made group of characters to remove everything from the line except letters, numbers and punctuation marks. Therefore it is necessary to combine.
For example:
String str = "1 my example str ~ !"; System.out.println("before:" + str); str = str.replaceAll("[^\\w ,.:\"'!\\t]", ""); System.out.println(" after:" + str); - Unfortunately, your version does not take into account Russian characters. - Victor Khovanskiy
- oneIf anything, here is a convenient online regular tester: freeformatter.com/java-regex-tester.html - Mikhail Grebenev
|