Regular in R to remove single letters from text

Question

Faced a strange reaction of regulars in R: R considers some letters of the Cyrillic alphabet as punctuation and starts to work inadequately. Question: how to overcome this oddity to remove single letters from the text? The code below shows this problem:

gsub(pattern = "\\b[:alpha:]{1}\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж з z ZZ 123",ignore.case = T) gsub(pattern = "\\b[a-zа-ячё]{1}\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж з z ZZ 123",ignore.case = T)

The second regular season can also be written as "\\b[a-zа-яё]\\b" ( ч already included in the "\\b[a-zа-яё]\\b" range).

Accepted Answer · 2017-01-19T14:23:46

See also:

 help("stringi-search-charclass", "stringi") help("stringi-search-regex", "stringi")

With my settings (see my comment above), your advice works: the letters “H” and “E” count as letters.

Answer 2 · 2017-01-19T08:57:51

There is one error and 2 shortcomings in the code:

[:alpha:] - POSIX character classes must be used inside square brackets ("bracket expressions") (= [[:alpha:]] )
{1} can always be better removed
ч already included in the range а-я

Use

 gsub(pattern = "\\b[[:alpha:]]\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж я z ZZ 123",ignore.case = T) gsub(pattern = "\\b[a-zа-яё]\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж я z ZZ 123",ignore.case = T)

See the demo

If nothing helps, use PCRE:

 x = " - 1 , очковый оцинкованный ёж я z ZZ 123" gsub("(*UCP)\\b\\p{L}\\b"," ", x, perl=TRUE) ## => [1] " - 1 , очковый оцинкованный ёж ZZ 123"

Another demo .

(*UCP) - includes support for Unicode in the regular season
\\b - word boundary (beginning of the word)
\\p{L} - any letter of Unicode
\\b - word boundary (end of word)

I run your example in my own: gsub(pattern = "\\b[a-zа-яё]\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж я z ZZ 123",ignore.case = T) I get " - 1 , ковый оцинкованный ё ZZ 123"
Apparently this is the case: sessionInfo () R version 3.2.1 (2015-06-18) Platform: i386-w64-mingw32 / i386 (32-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: LC_COLLATE = Russian_Russia.1251 LC_CTYPE = Russian_Russia.1251 LC_MONETARY = Russian_Russia.1251 LC_NUMERIC = C LC_TIME = Russian_Russia.1251
Try gsub("\\b[[:alpha:]]\\b"," ", `Encoding<-`(x, "UTF8"),ignore.case = T) or change LC_COLLATE to English_United States.1252 or LC_ALL .

VladMinkov VladMinkov one · Answer 3 · 2017-01-19T13:45:23

It worked for me like that.

 > gsub(pattern = "\\b[a-zа-яё]\\b",replacement = " ",x = " - 1 , очковый оцинкованный ёж я z ZZ 123",ignore.case = T) [1] " - 1 , очковый оцинкованный ёж ZZ 123" > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets [6] methods base other attached packages: [1] RDocumentation_0.8.0 loaded via a namespace (and not attached): [1] httr_1.2.1 rjson_0.2.15 [3] R6_2.2.0 tools_3.3.2 [5] withr_1.0.2 curl_2.3 [7] githubinstall_0.2.1 memoise_1.0.0 [9] data.table_1.10.0 jsonlite_1.2 [11] digest_0.6.11 proto_1.0.0

Regular in R to remove single letters from text

3 answers 3

More articles: