I want to write an ordinary program that will search for all branches of the div and copy it inside the text and save it to a file. Found for this library lmxl for parsing the site. I read it slightly and decided to try it. The code works with a bang, but there is a problem, it gives me an empty string in the result. I probably thought it was I, as usual, skrivozhopil, and not even the example with the habr does not work, ie. works, but produces the same empty string.

Here is the code:

import urllib import lxml.html page = urllib.urlopen("http://habrahabr.ru/") #открываем сайт хабрахабр doc = lxml.html.document_fromstring(page.read()) #читаем страницу for topic in doc.cssselect('a.topic'): #ищем все <a> по классу topic print topic.text outFile = open('output.txt', 'w') #создаем файл doc.write(outFile, encoding='utf-16') #записываем, что получилось 

And voila! An empty file is created. Explain, please, problems. thank

  • one
    What kind of python? doc.write (outFile, encoding = 'utf-16') AttributeError: 'HtmlElement' object has no attribute 'write' (python 2.7) Maybe my packages are not the same ... In general, I could not repeat. Yes, by the way, how does the curvature of the buttocks affect the work of the programs? Usually, problems are created by curvature of the hands ... - alexlz
  • I use python 2.7 win32, although OS win7-x64, but due to the fact that pygtk does not work on x64, I installed a python for win32. Pre-installed packages: lxml 2.2.8-win32, pygtk (I want to use in the future) and that's it. I drove this program in different ways from clicking on the file, calling via cmd and through import. No hint of error or success. - blabla
  • and the file output.txt is created? - actionless

2 answers 2

There is no a.topic on the page, and a.post_title is there. I doc.write cursed my doc.write , saying that there is no such write method, so I implemented the code like this:

 import urllib import lxml.html page = urllib.urlopen("http://habrahabr.ru/") doc = lxml.html.document_fromstring(page.read()) out = open('output.txt', 'w', encoding='utf-16') for topic in doc.cssselect('a.post_title'): out.write(topic.text) out.write('\n') out.close() 
    1. On the page http://habrahabr.ru/ there are no elements with the topic class.
    2. This is how you can sort through the names of posts on the main page:
      from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://habrahabr.ru/').read() soup = BeautifulSoup(html, 'lxml') for topic in soup.find_all(name='a', attrs={'class': 'post_title'}): print(topic['href']) 
    • A couple of notes: the author uses python2, not the third. And the question specifies lxml , not BeautifulSoup - gil9red