Good day. I'm friends with python recently. And got involved in the parsing site. The site has a table with several columns and several lines. You need to bypass all the rows and columns and display them as needed. And I did not think of anything better than a nested loop. But I feel with my heart and soul that in such a wonderful language like Python you can make it much more beautiful. Please tell me, well, or at least hint in which direction to think? I get what I need from each line of the source data and then I will process them as needed.

from bs4 import BeautifulSoup import requests def startGrab(): url = 'http:/site.net' try: page = requests.get(url) except: print(url) soup = BeautifulSoup(page.content, "html5lib") for row in soup.find_all("tr", {"class" : "belowHeader"}): i = 0 x = 0 for row2 in row.find_all("td", {"class" : "tdteamname2"}): if i==0: team1 = row2.get_text() else: team2 = row2.get_text() i += 1 for row3 in row.find_all("td", {"class" : "tdpercentmw1"}): if x == 0: coef1 = row3.get_text() elif x == 1: coef2 = row3.get_text() else: coef3 = row3.get_text() x += 1 print(team1+" "+team2+" "+coef1+" "+coef2+" "+coef3) if __name__ == '__main__': startGrab() 
  • Can you give an example of the page you want to parse? - MaxU
  • There is no access to the site from the outside. I can say that the table class "belowHeader". There are 2 cells of the class "tdteamname2" and then three cells of the class "tdpercentmw1". The text from the cells in the above example is pulled out normally. But I want to make the code more beautiful. - LeReve
  • Well, for example, you can put an example of your HTML file on some file sharing service - MaxU
  • Yes of course, here's a piece of the table: codeshare.io/G87W4a - LeReve
  • To create a good answer in the inspection code label, the answer can be useful @ 200_success ♦ with codereview.SE - jfs

1 answer 1

Line comments on the code:

Search for an item by item name and its class

Instead:

 soup.find_all("tr", {"class" : "belowHeader"}) 

You can simply:

 soup.find_all("tr", "belowHeader") 

Use enumerate () to get the loop index

Instead:

 i = 0 for td in tr.find_all('td', 'tdteamname2'): ... i += 1 

Should write:

 for i, td in enumerate(row.find_all('td', 'tdteamname2')): ... 

You can use the names of the elements tr, td instead of row, row1, row2

Use explicit collections instead of numbered names.

Instead:

 x = 0 for row3 in row.find_all("td", {"class" : "tdpercentmw1"}): if x == 0: coef1 = row3.get_text() elif x == 1: coef2 = row3.get_text() else: coef3 = row3.get_text() x += 1 

Use:

 coef = [td.get_text() for td in tr.find_all('td', 'tdpercentmw1')] 

Similarly for a team :

 team = [td.get_text() for td in tr.find_all('td', 'tdteamname2')] 

[optional] You can use print(*коллекция)

Instead:

 print(team1+" "+team2+" "+coef1+" "+coef2+" "+coef3) 

You can write:

 print(*team, *coef) 

Transmit encoding if known from http headers

Instead:

 soup = BeautifulSoup(page.content, "html5lib") 

You can write:

 soup = BeautifulSoup(page.content, "html5lib", from_encoding=page.encoding) 

[optional] use stdlib if there is no reason to reverse

For example, if there are no special problems with the markup, then you can use the 'html.parser' instead of the 'html5lib' parser.

Or if you have enough possibilities in your case, urlopen() , then you can not use requests — this may be less secure ( requests are updated more often), but bugs are more stable.

Do not use pure (bare) except:

In your case, you can let the script just die, because if the page is not loaded, then it has nothing to do. You can catch the expected types of exceptions and end with an informative error message (from the experience of using it is clear what types to expect, for example, you can start with OSError ). Do not catch too much to not hide the bugs in the code.

In order not to litter the console, you can sys.excepthook your sys.excepthook .

Use shebang #! for executables

If you collect all the code in one place:

 #!/usr/bin/env python3 from urllib.request import urlopen from bs4 import BeautifoulSoup # $ pip install beautifulsoup4 soup = BeautifulSoup(urlopen('http://example.com'), 'html.parser') for tr in soup.find_all('tr', 'belowHeader'): team = (td.get_text() for td in tr.find_all('td', 'tdteamname2')) coef = (td.get_text() for td in tr.find_all('td', 'tdpercentmw1')) print(*team, *coef) 

If problems with the encoding arise, then the response.headers.get_content_charset(default) be passed as the from_encoding parameter.

If there are problems with html recognition speed (not loading), then you can try the 'lxml' parser, instead of 'html.parser' .

Nested loops look justified here. If there is no special reason to get rid of them, you can leave them.