Faced a strange reaction of regulars in R: R considers some letters of the Cyrillic alphabet as punctuation and starts to work inadequately. Question: how to overcome this oddity to remove single letters from the text? The code below shows this problem:

gsub(pattern = "\\b[:alpha:]{1}\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж з z ZZ 123",ignore.case = T) gsub(pattern = "\\b[a-zа-ячё]{1}\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж з z ZZ 123",ignore.case = T) 
  • one
    And what exactly is the problem? [:alpha:] should be written as [[:alpha:]] . The second regular season can also be written as "\\b[a-zа-яё]\\b" ( ч already included in the "\\b[a-zа-яё]\\b" range). - Wiktor Stribiżew
  • But does it work as it should - ideone.com/mLLtRi ? - Wiktor Stribiżew

3 answers 3

More universal solution:

 library(stringi) stri_replace_all_regex(" - 1 , очковый оцинкованный ёж я z ZZ 123", "\\b\\p{L}\\b", " ") #R> [1] " - 1 , очковый оцинкованный ёж ZZ 123" 

More information about regular expressions in ICU: http://userguide.icu-project.org/strings/regexp .

See also:

 help("stringi-search-charclass", "stringi") help("stringi-search-regex", "stringi") 
  • With my settings (see my comment above), your advice works: the letters “H” and “E” count as letters. Thank. - Edvardoss

There is one error and 2 shortcomings in the code:

  • [:alpha:] - POSIX character classes must be used inside square brackets ("bracket expressions") (= [[:alpha:]] )
  • {1} can always be better removed
  • ч already included in the range а-я

Use

 gsub(pattern = "\\b[[:alpha:]]\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж я z ZZ 123",ignore.case = T) gsub(pattern = "\\b[a-zа-яё]\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж я z ZZ 123",ignore.case = T) 

See the demo

If nothing helps, use PCRE:

 x = " - 1 , очковый оцинкованный ёж я z ZZ 123" gsub("(*UCP)\\b\\p{L}\\b"," ", x, perl=TRUE) ## => [1] " - 1 , очковый оцинкованный ёж ZZ 123" 

Another demo .

  • (*UCP) - includes support for Unicode in the regular season
  • \\b - word boundary (beginning of the word)
  • \\p{L} - any letter of Unicode
  • \\b - word boundary (end of word)
  • Very strange. I run your example in my own: gsub(pattern = "\\b[a-zа-яё]\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж я z ZZ 123",ignore.case = T) I get " - 1 , ковый оцинкованный ё ZZ 123" - Edvardoss
  • Apparently this is the case: sessionInfo () R version 3.2.1 (2015-06-18) Platform: i386-w64-mingw32 / i386 (32-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: LC_COLLATE = Russian_Russia.1251 LC_CTYPE = Russian_Russia.1251 LC_MONETARY = Russian_Russia.1251 LC_NUMERIC = C LC_TIME = Russian_Russia.1251 - Edvardoss
  • Yes, it's all about the settings. Try gsub("\\b[[:alpha:]]\\b"," ", `Encoding<-`(x, "UTF8"),ignore.case = T) or change LC_COLLATE to English_United States.1252 or LC_ALL . - Wiktor Stribiżew
  • Doesn't PCRE work the same way as ICU? gsub("(*UCP)\\b\\p{L}\\b"," ", x, perl=TRUE) ? - Wiktor Stribiżew

It worked for me like that.

 > gsub(pattern = "\\b[a-zа-яё]\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж я z ZZ 123",ignore.case = T) [1] " - 1 , очковый оцинкованный ёж ZZ 123" > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets [6] methods base other attached packages: [1] RDocumentation_0.8.0 loaded via a namespace (and not attached): [1] httr_1.2.1 rjson_0.2.15 [3] R6_2.2.0 tools_3.3.2 [5] withr_1.0.2 curl_2.3 [7] githubinstall_0.2.1 memoise_1.0.0 [9] data.table_1.10.0 jsonlite_1.2 [11] digest_0.6.11 proto_1.0.0