Required to remove unnecessary duplicates, if any. The example below (remove duplicates "Voronezh") does not work.

gsub(pattern = "(г\\.\\s\\b[[:alpha:]]+\\b,)\\1{2,}",replacement = "\\1",x="г. Воронеж, г. Воронеж, г. Воронеж, ул. Ленинский пр-т,174",ignore.case = T) 
  • Need to remove only duplicates that follow each other? - Wiktor Stribiżew
  • Yes, and there are addresses where there are no duplicates (one city without repetitions) - Edvardoss
  • Where there are no duplicates, and it is not necessary to delete :) - Wiktor Stribiżew

1 answer 1

In the original regular expression, the backward reference to the value of the first exciting group contains the г. Воронеж, therefore immediately after the г. Воронеж, another г. Воронеж, should follow, but the comma should be followed by a space.

Use

 p <- "(г\\.\\s*[[:alpha:]]+)(?:,\\s*\\1)+" x <- "г. Воронеж, г. Воронеж, г. Воронеж, ул. Ленинский пр-т,174" gsub(p,"\\1", x, ignore.case = TRUE) 

See R-demo online

Description :

  • (г\.\s*[[:alpha:]]+) - Exciting group number 1:
    • г\. - г.
    • \s* - 0+ spaces
    • [[:alpha:]]+ - 1+ letters
  • (?:,\\s*\\1+)+ - 1+ repetitions of the following patterns:
    • ,\\s* - comma + zero and more spaces
    • \\1 is the value in the first exciting group.
  • Can you give a little more detail on the "?:" - what is it? - Edvardoss
  • This is not an exciting group that does not create backlinks (i.e. it does not store part of the resulting match in a memory buffer, as the exciting group does). Serves only for grouping templates. - Wiktor Stribiżew
  • Your answer helped, thanks. I decided to supplement my version with a space at the end and it earned the same. pattern = "(*г\\.\\s\\b[[:alpha:]]+\\b,\\s)\\1{2,}" - Edvardoss