regex: remove duplicate phrases

Question

Required to remove unnecessary duplicates, if any. The example below (remove duplicates "Voronezh") does not work.

gsub(pattern = "(г\\.\\s\\b[[:alpha:]]+\\b,)\\1{2,}",replacement = "\\1",x="г. Воронеж, г. Воронеж, г. Воронеж, ул. Ленинский пр-т,174",ignore.case = T)

Yes, and there are addresses where there are no duplicates (one city without repetitions)
Where there are no duplicates, and it is not necessary to delete :)

Accepted Answer · 2017-02-09T09:38:45

In the original regular expression, the backward reference to the value of the first exciting group contains the г. Воронеж, therefore immediately after the г. Воронеж, another г. Воронеж, should follow, but the comma should be followed by a space.

Use

 p <- "(г\\.\\s*[[:alpha:]]+)(?:,\\s*\\1)+" x <- "г. Воронеж, г. Воронеж, г. Воронеж, ул. Ленинский пр-т,174" gsub(p,"\\1", x, ignore.case = TRUE)

See R-demo online

Description :

(г\.\s*[[:alpha:]]+) - Exciting group number 1:
- г\. - г.
- \s* - 0+ spaces
- [[:alpha:]]+ - 1+ letters
(?:,\\s*\\1+)+ - 1+ repetitions of the following patterns:
- ,\\s* - comma + zero and more spaces
- \\1 is the value in the first exciting group.

This is not an exciting group that does not create backlinks (i.e. it does not store part of the resulting match in a memory buffer, as the exciting group does).
I decided to supplement my version with a space at the end and it earned the same.

regex: remove duplicate phrases

1 answer 1

More articles: