Good day. Parsyu page using Python grab. The page has a table of several dozen lines. I need to parse the information in the list of lists.

I know xpath at the most basic level. I have a good idea how to get a collection of lines with it; how to get the cells I need from the existing line. It seems to even imagine how to get a collection of all the cells in the table at once.

But I need to put all this in the list of lists so that each line in its list is.

It only comes to my mind to take a collection of strings, iterate over them with a python in a loop, and for each row make a separate xpath query.

How good is this option? Is there some more elegant way?

  • can you provide a URL so you can try? - MaxU 2:41 pm
  • @MaxU, for example, from here kinozal.tv/browse.php I want to pair the table with the search results (in reality, a certain query will be added to the url, but this is not important, the format of the issue for a specific query is the same without it) - Xander

1 answer 1

Here is a variant using the Pandas module:

In [37]: import pandas as pd In [38]: url = 'http://kinozal.tv/browse.php?s=%F7%E5%EB%EE%E2%E5%EA&g=0&c=0&v=0&d=0&w=0&t=0&f=0' 

In the next line we:

  1. we read (parsim) the third table (by default read_html() parses all the tables and returns a list of DataFrames, we are interested in the third table [with index 2 ]) at this URL
  2. skipping the first column ( Unnamed: 0 )
  3. rename column Unnamed: 1 -> Name
  4. save the resulting DataFrame as df

Code:

 In [39]: df = pd.read_html(url, header=0)[2].iloc[:, 1:].rename(columns={'Unnamed: 1':'Name'}) 

Show the first 10 lines of our frame:

 In [40]: df.head(10) Out[40]: Name Комм. Π Π°Π·ΠΌΠ΅Ρ€ Π‘ΠΈΠ΄ΠΎΠ² ΠŸΠΈΡ€ΠΎΠ² Π—Π°Π»ΠΈΡ‚ Π Π°Π·Π΄Π°Π΅Ρ‚ 0 Π”ΠΆΠΎΡ€Π΄ΠΆ Π‘. КлСйсон - Π‘Π°ΠΌΡ‹ΠΉ ... 2 342 ΠœΠ‘ 2 3 сСгодня Π² 19:17 fx365 1 ПослСдний Ρ‡Π΅Π»ΠΎΠ²Π΅ΠΊ Π½Π° Π—Π΅ΠΌΠ»Π΅... 1 1.9 Π“Π‘ 15 61 сСгодня Π² 18:19 BLACKTIR 2 ПослСдний Ρ‡Π΅Π»ΠΎΠ²Π΅ΠΊ Π½Π° Π—Π΅ΠΌΠ»Π΅... 2 707 ΠœΠ‘ 8 35 сСгодня Π² 18:18 BLACKTIR 3 Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅... 2 3.25 Π“Π‘ 25 8 сСгодня Π² 17:15 Olyanchik 4 Π€Ρ€ΡƒΠ½Π·ΠΈΠΊ ΠœΠΊΡ€Ρ‚Ρ‡ΡΠ½. Π§Π΅Π»ΠΎΠ²Π΅ΠΊ с... 2 500 ΠœΠ‘ 23 0 Π²Ρ‡Π΅Ρ€Π° Π² 22:33 Π§Π΅Π»ΠΎΠ²Π΅ΠΊ91 5 Π§Π΅Π»ΠΎΠ²Π΅ΠΊ-Π½Π΅Π²ΠΈΠ΄ΠΈΠΌΠΊΠ° (9 сСзон... 4 4.85 Π“Π‘ 6 14 Π²Ρ‡Π΅Ρ€Π° Π² 22:29 Gorgona007 6 Борис Π›ΠΈΡ‚Π²Π°ΠΊ - Π’Ρ€Π΅Π½ΠΈΠ½Π³ Π»ΠΈΡ‡... 3 142 ΠœΠ‘ 7 1 Π²Ρ‡Π΅Ρ€Π° Π² 16:57 sekes 7 Π§Π΅Π»ΠΎΠ²Π΅ΠΊ ΠΈΠ· Π›Π°Ρ€Π°ΠΌΠΈ / The Ma... 1 744 ΠœΠ‘ 20 0 Π²Ρ‡Π΅Ρ€Π° Π² 14:39 dushevnaya 8 Π§Π΅Π»ΠΎΠ²Π΅ΠΊ - ΡˆΠ²Π΅ΠΉΡ†Π°Ρ€ΡΠΊΠΈΠΉ Π½ΠΎΠΆ ... 0 1.46 Π“Π‘ 19 1 15.10.2016 Π² 21:34 Amancio 9 Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅... 1 243 ΠœΠ‘ 53 1 15.10.2016 Π² 21:15 Amancio 

print all the lines in the name (column: Name ) of which the substring 'Π΅ΠΌΡ„ΠΈΡ€Π°' is present:

 In [41]: df.ix[df.Name.str.contains('Π΅ΠΌΡ„ΠΈΡ€Π°')] Out[41]: Name Комм. Π Π°Π·ΠΌΠ΅Ρ€ Π‘ΠΈΠ΄ΠΎΠ² ΠŸΠΈΡ€ΠΎΠ² Π—Π°Π»ΠΈΡ‚ Π Π°Π·Π΄Π°Π΅Ρ‚ 3 Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅... 2 3.25 Π“Π‘ 25 8 сСгодня Π² 17:15 Olyanchik 9 Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅... 1 243 ΠœΠ‘ 53 1 15.10.2016 Π² 21:15 Amancio 14 Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅... 11 7.25 Π“Π‘ 228 13 15.10.2016 Π² 02:28 daboen 15 Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅... 4 2.38 Π“Π‘ 53 2 14.10.2016 Π² 20:29 DaDalida 16 Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅... 7 1.58 Π“Π‘ 172 4 14.10.2016 Π² 19:50 jaaadina123 

list of items that satisfy the condition, in the form of a regular list:

 In [43]: df.ix[df.Name.str.contains('Π΅ΠΌΡ„ΠΈΡ€Π°'), 'Name'].tolist() Out[43]: ['Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅ΠΊ / 2016 / Π Π£ / HDTVRip (720p)', 'Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅ΠΊ. ΠšΠΎΠ½Ρ†Π΅Ρ€Ρ‚ Π² Олимпийском / Π ΠΎΠΊ / 2016 / MP3', 'Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅ΠΊ / 2016 / Π Π£ / HDTV (1080i)', 'Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅ΠΊ / 2016 / Π Π£ / DVB', 'Π—Π΅ΠΌΡ„ΠΈΡ€Π° - МалСнький Ρ‡Π΅Π»ΠΎΠ²Π΅ΠΊ / 2016 / Π Π£ / SATRip'] 

PS in general, with the help of Pandas, you can do a lot of interesting things (especially data processing) with minimal costs (minimum code) with almost maximum (for Python) performance.