Copying from html code of certain words in the document

Question

Good day! Tell me, there is a piece of html code, how can pythone implement a search for certain characters or words and copy words or numbers after or before them into the document. For example: :

> <tr style="border-top: solid 1px;"> > <td><strong>Дата 01.05.2014</strong></td> > <td><strong>Турпакет, Турция, Алания, 4 дня</strong></td> > <td style="width: 120px;"><strong>15 393 руб. <br /> (~ > 343 € , ~ 467 $)</strong></td>

It is necessary to vytechit "05/01/2014", these numbers always go after the word "date" , which never changes, unlike them, you also need to continue to find the words after the unchangeable word "Turpaket" (in this case, "Turkey, Alania" ) but the problem is that unlike the date “05/01/2014” in which the number of characters is always the same, there may be more or less, but it is known that there are always numbers after them (or the word “day / day” or tag  - any that can be used as an anchor to determine the end)

I would have at least an approximate algorithm, and I’ll finish it up somehow, I’m just not a programmer at all, but if I have something to push off, I’ll figure out how to make a thread.

ps The fact that it is copied from the html code is an option, perhaps it will be easier from xml, I'm just not familiar with the subtleties.

Truabchev Truabchev 73 7 bronze marks · Answer 1 · 2014-02-05T10:12:58

For parsing you can use for example lxml . First, search for the desired node, and then parse it using the appropriate algorithm.

 text = page.cssselect("#node_id").text_content() data = parse(text)

And the node to parse using regular expressions

Answer 2 · 2014-02-26T16:46:08

Try not to figure out the details and subtleties (if you don’t intend to become a programmer), "pogulite" on the topic of regular expressions regexp , learn the syntax and basics of the basics, there must surely be an article on Habré.

After that, you will easily write your small parser, and knowledge of regular expressions will not be superfluous, they can be useful for you to edit in large texts (for example, when using sublime) and ...

... if you want to become a programmer;)

Option: import re text = "" "> <tr style =" border-top: solid 1px; ">> <td> Date 05/01/2014 </ td>> <td> Tour package, Turkey, Alanya, 4 days </ td>> <td style = "width: 120px;"> 15 393 rub. (~> € 343, ~ $ 467) </ td> "" "pack = re.compile (r" Tour Package, (. *?) <") date = re.compile (r" Date (. *?) <") print pack.findall ( text) [0] # list print date.findall (text) [0] # list
Maybe you can do something better like this: import nltk # text = "" # your piece of html print nltk.clean_html (text)>> Date 05/01/2014> Turpaket, Turkey, Alanya, 4 days> 15 393 rub.

Deleted 351 1 golden mark 5 silver marks 13 bronze marks · Answer 3 · 2014-02-05T07:39:25

Perhaps it will be easier to work directly with the DOM?

Grab - python library for parsing sites

Copying from html code of certain words in the document

3 answers 3

More articles: