Python RegExp Elimination of compound words

Question

Good day, dear experts! I began to study at my leisure RE. And I had a problem associated with the fact that would fold the compound words.

(?<=(Молоко)|(Хлеб)|(Тольятти))(.*)$

Input data:

 Хлебзавод производит Хлеб это очень полезный и питательный продукт Молокозавод производит Молоко. продукт богатый кальцием Тольяттихлеб находиться в г. Тольятти на ул. Компартии 11 Тольяттихлеб находиться в г. Тольятти. на ул. Компартии 11 Тольяттихлеб находиться в г. Тольятти., на ул. Компартии 11

What I want to get on the output:

 group(1)="Хлеб" group(2)=" это очень полезный и питательный продукт" group(1)="Молоко" group(2)=". продукт богатый кальцием" group(1)="Тольятти" group(2)=" на ул. Компартии 11" group(1)="Тольятти" group(2)=". на ул. Компартии 11" group(1)="Тольятти" group(2)="., на ул. Компартии 11"

And in fact it turns out:

 group(1)="Хлеб" group(2)="завод производит Хлеб это очень полезный и питательный продукт" group(1)="Молоко" group(2)="завод производит Молоко. продукт богатый кальцием" group(1)="Тольятти" group(2)="хлеб находиться в г. Тольятти на ул. Компартии 11" group(1)="Тольятти" group(2)="хлеб находиться в г. Тольятти. на ул. Компартии 11" group(1)="Тольятти" group(2)="хлеб находиться в г. Тольятти., на ул. Компартии 11"

I would also be extremely grateful if I could suggest how to get rid of parasitic ones in the results: "gaps", "." ,".," etc. but it is not critical. I tried to use \ b but it does not work, if I correctly understood from the documentation, then in () it is considered as - backspace I also used the construct (, |. |) After the required word, it certainly works, but it seems to me that this is not quite beautiful the solution and there is an extra group () and I only have this problem with a couple of words from several dozen and because of them I don’t really want to rewrite the rest.

Accepted Answer · 2016-07-17T21:30:35

 \b(Молоко|Хлеб|Тольятти)\b(.*)$

The metacharacter \b - the "word" boundary will not allow a composite word to become a coincidence.
Putting it to the left and right of the word we get the desired result.

 import re regex = re.compile( '\\b(Молоко|Хлеб|Тольятти)\\W+(.*)$', re.M ) text = """Наш завод производит Хлеб это очень полезный и питательный продукт Молокозавод производит Молоко. продукт богатый кальцием Тольяттихлеб находиться в г. Тольятти на ул. Компартии 11 Тольяттихлеб находиться в г. Тольятти. на ул. Компартии 11 Тольяттихлеб находиться в г. Тольятти., на ул. Компартии 11""" print( regex.findall( text ) )

Result:

 [('Хлеб', 'это очень полезный и питательный продукт'), ('Молоко', 'продукт богатый кальцием'), ('Тольятти', 'на ул. Компартии 11'), ('Тольятти', 'на ул. Компартии 11'), ('Тольятти', 'на ул. Компартии 11')]

http://ideone.com/gjchVv

Retrospective checking is not needed in your regular expression; in python, it produces a compilation error due to the variable length of the alternative.

PS Don't forget to use a regular expression with the UNICODE flag for python 2.x.
PPS This is a regular expression without parasitic characters. In the previous revision of this answer, with them, if necessary.

Thanks again, but I wove it myself for 2 days and did it through search.

Python RegExp Elimination of compound words

1 answer 1

More articles: