HTML parser, in Python, I can not parse the span with a phone number

Question

I teach Python, so as not to learn dry code, I decided to learn in practice. Began to write a parser. I can not only parse one item with a phone number. Only it turns out to parse all the span tags and all that is in them.

<span class="ls-detail_price">8 000 $</span> <span>373-76-766250, 373-77-592228</span> <span class="ls-detail_price">6 000 $</span> <span>373-76-966250, 373-77-592233</span>

Here is a design that spars, that is, all span tags, and I need only one with a phone number. Here is the construction of a telephone number!

 <div class="ls-detail_anData"> <span class="ls-detail_price">1 600 $</span> <div class="mapath list"> <span id="pointer_icon">Тирасполь</span> </div> <div> <span class="phone_icon">373-77-534801</span> </div> </div>

Tried through a class that refers to a phone number, but does not work and returns an empty array. Phone inserted through before :: how to get to it using Python? The goal is to get rid of the span tags and the extra line with the price. Please do not minus, I really want to understand, just can not understand what to do next.

 # -*- coding: utf-8 -*- #!/usr/bin/env python3 import urllib.request from bs4 import BeautifulSoup from lxml import html fname = 'test' def get_html(url): response = urllib.request.urlopen(url) return response.read() def parse(html): projects = [] #Парсим всю страницу целиком soup = BeautifulSoup(html) #Делаем обор по тегу и классу div = soup.find('div', class_='ls-detail') #Находим только что связано с тегом span for row in div.find_all('span'): print(row) def main(): parse(get_html('https://makler.md/ru/transport/cars')) if __name__ == '__main__': main()

Well, I did it this way, but he mostly spars only text and small tagged constructions.

 #!/usr/bin/env python3 from bs4 import BeautifulSoup # $ pip install beautifulsoup4 from urllib.request import urlopen url = 'https://makler.md/ru/transport/cars' fname = 'test' def get_html(url): with urlopen(url) as html_page: charset = html_page.headers.get_content_charset(None) # may be None soup = BeautifulSoup(html_page, 'html.parser', from_encoding=charset) f = open("%s %s" % (fname,".html"), "w") f.write(soup.text) f.close def main(): get_html('https://makler.md/ru/transport/cars') if __name__ == '__main__': main()

Dear @jfs, why minus .... try to figure it out before you do that.
For those who understand the code, I think you don’t need to write what you get, just by copying and compiling, you can see it once.
And secondly, it seems clear I described what I would like, I want to learn to parse.
There was a problem that I can not only parse one element with a phone number, the class does not work, although this unit has it.
Problem due to the fact that the phone number is registered through before :: and I don’t know how to get to it.
And in general, I can’t further understand what to do, where to dig.
@jfs  $ 8,000 373-76-766250, 373-77-592228  This is the design that matches, that is, all the span tags, and I only need one with a phone number.
 373-77-534801  That's the construction of a phone number inside!
@jfs I agree may not exactly and not described the question, but the question Why minus?
Daladno, you can not answer, better tell me so normal description?
Find tags with ls-detail_price and call tags ls-detail_price for tags to get the next item after the current tag (the next one doesn’t mean it's nested) and it will span with phones.
If next_sibling did not help, there is its counterpart, I don’t remember exactly, something like: find_next_sibling in it indicates the tag that comes after the current one.
And I would advise to use css-selectors, instead of methods. It seems to me that they are easier perceived, and are used in many places.
Example: div = soup.select('div.ls-detail') or simply div = soup.select('.ls-detail') .
@ gil9red Here is an example found something similar here by this type?
print soup.find(text="Address:").findNext('td').contents[0]

Answer 1 · 2016-11-23T20:17:47

To extract text from a  element with the phone_icon class:

 #!/usr/bin/env python3 from bs4 import BeautifulSoup # $ pip install beautifulsoup4 soup = BeautifulSoup("""<div class="ls-detail_anData"> <span class="ls-detail_price">1 600 $</span> <div class="mapath list"> <span id="pointer_icon">Тирасполь</span> </div> <div> <span class="phone_icon">373-77-534801</span> </div> </div>""", 'html.parser') print(soup.find('span', 'phone_icon').get_text()) # -> 373-77-534801

To download html link:

 #!/usr/bin/env python3 from urllib.request import urlopen from bs4 import BeautifulSoup # $ pip install beautifulsoup4 with urlopen(url) as html_page: charset = html_page.headers.get_content_charset(None) # may be None soup = BeautifulSoup(html_page, 'html.parser', from_encoding=charset) print(soup.find('span', 'phone_icon').get_text())

the code passes the encoding from the Content-Type http header, if available.

but if so then I will need to save the page and then work with it like this, because I inserted the construct in that code but gives the error print = (soup.find ('span', 'phone_icon'). get_text ()) AttributeError: ' NoneType 'object has no attribute' get_text '
@Stasinskii: the code in the response works / (and worked) as is.
The next obvious step: save the result of get_html() to a file and look at it carefully — is there <span class="phone_icon"... ?
(specify the file name where you saved instead of html to check: BeautifulSoup(open('input.html', encoding='utf-8')) ) Type constructions: print = ( this is an error. Be attentive to details. Otherwise it makes debugging difficult (It’s not clear whether this is a real problem or you just copied it badly).
Look, I wrote down in the file, but frankly speaking there is a mess, almost without tags.
Since I see only a question (at that time you got get_html() urlopen().read() ), you should use exactly the version that you had in question at that time was published.
Save your url to the file with open('input.html', 'wb') as f: f.write(urlopen(url).read()) .
Send it to BeautifulSoup(open('input.html', encoding='utf-8')) If you haven’t received a number, look at input.html with your eyes (read the questions in the previous comment)

HTML parser, in Python, I can not parse the span with a phone number

1 answer 1

More articles: