Empty string instead of response when recognizing html using lxml

Question

I want to write an ordinary program that will search for all branches of the div and copy it inside the text and save it to a file. Found for this library lmxl for parsing the site. I read it slightly and decided to try it. The code works with a bang, but there is a problem, it gives me an empty string in the result. I probably thought it was I, as usual, skrivozhopil, and not even the example with the habr does not work, ie. works, but produces the same empty string.

Here is the code:

import urllib import lxml.html page = urllib.urlopen("http://habrahabr.ru/") #открываем сайт хабрахабр doc = lxml.html.document_fromstring(page.read()) #читаем страницу for topic in doc.cssselect('a.topic'): #ищем все <a> по классу topic print topic.text outFile = open('output.txt', 'w') #создаем файл doc.write(outFile, encoding='utf-16') #записываем, что получилось

And voila! An empty file is created. Explain, please, problems. thank

doc.write (outFile, encoding = 'utf-16') AttributeError: 'HtmlElement' object has no attribute 'write' (python 2.7) Maybe my packages are not the same ... In general, I could not repeat.
Yes, by the way, how does the curvature of the buttocks affect the work of the programs?
I use python 2.7 win32, although OS win7-x64, but due to the fact that pygtk does not work on x64, I installed a python for win32.
Pre-installed packages: lxml 2.2.8-win32, pygtk (I want to use in the future) and that's it.
I drove this program in different ways from clicking on the file, calling via cmd and through import.

gil9red gil9red 31.9k four 24 69 · Answer 1 · 2016-05-09T23:14:59

There is no a.topic on the page, and a.post_title is there. I doc.write cursed my doc.write , saying that there is no such write method, so I implemented the code like this:

 import urllib import lxml.html page = urllib.urlopen("http://habrahabr.ru/") doc = lxml.html.document_fromstring(page.read()) out = open('output.txt', 'w', encoding='utf-16') for topic in doc.cssselect('a.post_title'): out.write(topic.text) out.write('\n') out.close()

Alexander Fridman Alexander Fridman 31 2 · Answer 2 · 2016-01-25T16:18:10

On the page http://habrahabr.ru/ there are no elements with the topic class.
This is how you can sort through the names of posts on the main page:

  from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://habrahabr.ru/').read() soup = BeautifulSoup(html, 'lxml') for topic in soup.find_all(name='a', attrs={'class': 'post_title'}): print(topic['href'])

A couple of notes: the author uses python2, not the third. And the question specifies lxml , not BeautifulSoup - gil9red

Empty string instead of response when recognizing html using lxml

2 answers 2

More articles: