How to remove characters from the text with the exception of the Russian / English alphabet, punctuation and white space characters using regular expressions? Is there a ready class (in terms of regexp)? I would like to avoid direct listing.

    2 answers 2

    Try

    replaceAll("[[\W[0-9_]]&&[\S]&&[^А-Яа-я-.?!)(,:]]", ""); 

      The regex pattern has predefined character groups:

      Predefined character classes

      . - Any character (may or may not match line terminators)

      \ d - A digit: [0-9]

      \ D - A non-digit: [^ 0-9]

      \ s - A whitespace character: [\ t \ n \ x0B \ f \ r]

      \ S - A non-whitespace character: [^ \ s]

      \ w - A word character: [a-zA-Z_0-9]

      \ W - A non-word character: [^ \ w]

      Unfortunately, there is no ready-made group of characters to remove everything from the line except letters, numbers and punctuation marks. Therefore it is necessary to combine.

      For example:

       String str = "1 my example str ~ !"; System.out.println("before:" + str); str = str.replaceAll("[^\\w ,.:\"'!\\t]", ""); System.out.println(" after:" + str); 
      • Unfortunately, your version does not take into account Russian characters. - Victor Khovanskiy
      • one
        If anything, here is a convenient online regular tester: freeformatter.com/java-regex-tester.html - Mikhail Grebenev