I teach Python, so as not to learn dry code, I decided to learn in practice. Began to write a parser. I can not only parse one item with a phone number. Only it turns out to parse all the span tags and all that is in them.

<span class="ls-detail_price">8 000 $</span> <span>373-76-766250, 373-77-592228</span> <span class="ls-detail_price">6 000 $</span> <span>373-76-966250, 373-77-592233</span> 
Here is a design that spars, that is, all span tags, and I need only one with a phone number. Here is the construction of a telephone number!

 <div class="ls-detail_anData"> <span class="ls-detail_price">1 600 $</span> <div class="mapath list"> <span id="pointer_icon">Тирасполь</span> </div> <div> <span class="phone_icon">373-77-534801</span> </div> </div> 

Tried through a class that refers to a phone number, but does not work and returns an empty array. Phone inserted through before :: how to get to it using Python? The goal is to get rid of the span tags and the extra line with the price. Please do not minus, I really want to understand, just can not understand what to do next.

 # -*- coding: utf-8 -*- #!/usr/bin/env python3 import urllib.request from bs4 import BeautifulSoup from lxml import html fname = 'test' def get_html(url): response = urllib.request.urlopen(url) return response.read() def parse(html): projects = [] #Парсим всю страницу целиком soup = BeautifulSoup(html) #Делаем обор по тегу и классу div = soup.find('div', class_='ls-detail') #Находим только что связано с тегом span for row in div.find_all('span'): print(row) def main(): parse(get_html('https://makler.md/ru/transport/cars')) if __name__ == '__main__': main() 

Well, I did it this way, but he mostly spars only text and small tagged constructions.

 #!/usr/bin/env python3 from bs4 import BeautifulSoup # $ pip install beautifulsoup4 from urllib.request import urlopen url = 'https://makler.md/ru/transport/cars' fname = 'test' def get_html(url): with urlopen(url) as html_page: charset = html_page.headers.get_content_charset(None) # may be None soup = BeautifulSoup(html_page, 'html.parser', from_encoding=charset) f = open("%s %s" % (fname,".html"), "w") f.write(soup.text) f.close def main(): get_html('https://makler.md/ru/transport/cars') if __name__ == '__main__': main() 

  • Dear @jfs, why minus .... try to figure it out before you do that. For those who understand the code, I think you don’t need to write what you get, just by copying and compiling, you can see it once. And secondly, it seems clear I described what I would like, I want to learn to parse. There was a problem that I can not only parse one element with a phone number, the class does not work, although this unit has it. Problem due to the fact that the phone number is registered through before :: and I don’t know how to get to it. And in general, I can’t further understand what to do, where to dig. - Stasinskii
  • @jfs <span class = "ls-detail_price"> $ 8,000 </ span> <span> 373-76-766250, 373-77-592228 </ span> This is the design that matches, that is, all the span tags, and I only need one with a phone number. <span class = "phone_icon"> 373-77-534801 </ span> That's the construction of a phone number inside! - Stasinskii
  • @jfs I agree may not exactly and not described the question, but the question Why minus? Daladno, you can not answer, better tell me so normal description? - Stasinskii
  • one
    Find tags with ls-detail_price and call tags ls-detail_price for tags to get the next item after the current tag (the next one doesn’t mean it's nested) and it will span with phones. If next_sibling did not help, there is its counterpart, I don’t remember exactly, something like: find_next_sibling in it indicates the tag that comes after the current one. And I would advise to use css-selectors, instead of methods. It seems to me that they are easier perceived, and are used in many places. Example: div = soup.select('div.ls-detail') or simply div = soup.select('.ls-detail') . - gil9red
  • @ gil9red Here is an example found something similar here by this type? print soup.find(text="Address:").findNext('td').contents[0] - Stasinskii

1 answer 1

To extract text from a <span> element with the phone_icon class:

 #!/usr/bin/env python3 from bs4 import BeautifulSoup # $ pip install beautifulsoup4 soup = BeautifulSoup("""<div class="ls-detail_anData"> <span class="ls-detail_price">1 600 $</span> <div class="mapath list"> <span id="pointer_icon">Тирасполь</span> </div> <div> <span class="phone_icon">373-77-534801</span> </div> </div>""", 'html.parser') print(soup.find('span', 'phone_icon').get_text()) # -> 373-77-534801 

To download html link:

 #!/usr/bin/env python3 from urllib.request import urlopen from bs4 import BeautifulSoup # $ pip install beautifulsoup4 with urlopen(url) as html_page: charset = html_page.headers.get_content_charset(None) # may be None soup = BeautifulSoup(html_page, 'html.parser', from_encoding=charset) print(soup.find('span', 'phone_icon').get_text()) 

the code passes the encoding from the Content-Type http header, if available.

  • but if so then I will need to save the page and then work with it like this, because I inserted the construct in that code but gives the error print = (soup.find ('span', 'phone_icon'). get_text ()) AttributeError: ' NoneType 'object has no attribute' get_text ' - Stasinskii
  • @Stasinskii: the code in the response works / (and worked) as is. I added code that explicitly follows the html link. The next obvious step: save the result of get_html() to a file and look at it carefully — is there <span class="phone_icon"... ? Does the code work with this markup? (specify the file name where you saved instead of html to check: BeautifulSoup(open('input.html', encoding='utf-8')) ) Type constructions: print = ( this is an error. Be attentive to details. Otherwise it makes debugging difficult (It’s not clear whether this is a real problem or you just copied it badly). - jfs
  • Look, I wrote down in the file, but frankly speaking there is a mess, almost without tags. - Stasinskii
  • @Stasinskii: done wrong. I wrote "save result get_html ()" . Since I see only a question (at that time you got get_html() urlopen().read() ), you should use exactly the version that you had in question at that time was published. And you are trying to save soup.text instead. Save your url to the file with open('input.html', 'wb') as f: f.write(urlopen(url).read()) . Send it to BeautifulSoup(open('input.html', encoding='utf-8')) If you haven’t received a number, look at input.html with your eyes (read the questions in the previous comment) - jfs