Clear text from numbers / numbers. Implementation in Python-3.x

Question

I need to optionally clear text from numbers / numbers. Need to implement using Python . For example, given the text:

"Мой дядя проживет по адресу улица Липатова дом 6, квартира 15. Он родился 15 мая 1970 г. Его зарплата составляет 300 долларов. Каждый месяц он откладывает по десять тысяч мне на учебу в институте ".

The output should get the following text:

 "Мой дядя проживет по адресу . Он родился Его зарплата составляет . Каждый месяц он откладывает по тысяч мне на учебу в институте "

I decided to implement this task with the help of the Natasha library. . But there is not enough functionality (or I'm doing something wrong). My code is:

 from natasha import ( DatesExtractor, AddressExtractor, MoneyExtractor, ) from natasha.markup import show_markup extractors = [ DatesExtractor(), AddressExtractor(), MoneyExtractor(), ] text = ''' Мой дядя проживет по адресу улица Липатова дом 6, квартира 15. Он родился 15 мая 1970 г. Его зарплата составляет 300 долларов. Каждый месяц он откладывает по десять тысяч мне на учебу в институте ''' spans = [] for extractor in extractors: matches = extractor(text) spans.extend(_.span for _ in matches) text = show_markup(text, spans)

Displays:

 Мой дядя проживет по адресу [[улица Липатова дом 6, квартира 15]]. Он родился [[15 мая 1970 г.]] Его зарплата составляет [[300 долларов]]. Каждый месяц он откладывает по десять тысяч мне на учебу в институте

From this follow questions:

How to remove what is in square brackets? (Example: [[May 15, 1970]])
how to remove uppercase numbers? (Example: one, twenty six, etc.)

And why in your example how should the "ten thousand" have left "thousands", because this is also a number?
At the expense of thousands, I am not sure that it should remain.
Natasha cleans the street, the house and the rest of the library itself
@ Galina Perevalova, what’s the "thousand" here - if ten thousand is the number?

gil9red gil9red 31.9k four 24 69 · Answer 1 · 2018-12-04T09:09:58

Perhaps natasha could do it, but didn’t work with it, so I ’m offering a solution out of the box (just delete the sequences [[...]] ):

 text = """\ Мой дядя проживет по адресу [[улица Липатова дом 6, квартира 15]]. Он родился [[15 мая 1970 г.]] Его зарплата составляет [[300 долларов]]. Каждый месяц он откладывает по десять тысяч мне на учебу в институте """ import re new_text = re.sub(r'\[\[.+?\]\]', '', text) print(new_text)

Console:

 Мой дядя проживет по адресу . Он родился Его зарплата составляет . Каждый месяц он откладывает по десять тысяч мне на учебу в институте

But this is solved with the help of re.sub(r'десять', ... and so many times for any capital number. Probably.
@ 1stSentinel31YearPerlHist will leave processing десять and other lines to Natasha :)
We just realized in the comments that Natasha should be left not ten, but ten thousand)))
Is it possible to do something in order not to create a new project, but all in one heap, and select and delete?
@ Galina Perevalova, but what project are you talking about?
This regular example is needed for post-processing after Natasha.
I think you need to dig in the direction of Natasha to figure out why the "ten thousand" did not process :) In this article habr.com/post/349864 there is an example from восемь тысяч that was processed

Answer 2 · 2018-12-04T09:12:39

In interactive mode, it is done like this:

 >>> s = 'Hi, my father live in Moscow, Novoslobodskaya st. 65. lit. 4' >>> result = ''.join([i for i in s if not i.isdigit()]) >>> result 'Hi, my father live in Moscow, Novoslobodskaya st. . lit. ' >>>

Used English text, since it does not print Russian characters. I do not understand in puchon.

Or you can use any of these options:

 result = re.sub(r'[0-9]+', '', s)

 >>> import re >>> output = re.sub(r'\d+', '', 'Мой дядя проживет по адресу улица Липатова дом 6, квартира 15.Он родился 15 мая 1970 г. Его зарплата составляет 300 долларов. Каждый месяц он откладывает по десять тысяч мне на учебу в институте') >>> print output Мой дядя проживет по адресу улица Липатова дом , квартира .Он родился мая г. Его зарплата составляет долларов. Каждый месяц он откладывает по десять тысяч мне на учебу в институте >>>

But how to delete uppercase numbers is another matter, and I probably need some sort of uppercase array to compare with the text, I don’t know, I guess. But not sure.

And isalpha() whitespace, special characters, Cyrillic, etc.
It is necessary that the address or date is completely removed
@Galina Perevalova: I need to optionally clear the text of numbers / numbers.

German Borisov German Borisov 4,130 6 24 · Answer 3 · 2018-12-04T09:15:45

Alternatively, you can use the replace method, replacing all the occurrences of numbers and words of the names of numbers with an empty string.

text = text.replcae("0","").replcae("1","").replcae("2","").replcae("3","") and so on

Or add all the deleted characters / words to the array and call replace in a loop.

At least someone would explain why this method is so bad that it is worth putting minuses

Clear text from numbers / numbers. Implementation in Python-3.x

3 answers 3

More articles: