I'm learning to parse slowly, there was a problem with the encoding. The code works, but on some pages it throws a UnicodeDecodeError error: 'utf-8' codec can't decode byte 0xae in position 12: invalid start byte and swears at the print (company.text_content ()) line
As I understand it, this string cannot be interpreted as text in UTF-8. Found a solution to the same problem https://stackoverflow.com/questions/10226342/how-to-handle-unicodedecodeerror-without-losing-any-data , but unfortunately did not help. Can you advise something else?
import urllib.request import lxml.html as html import time BASE_URL="http://www.wholesalecentral.com/catoverview.htm" def pars_company(url): """парсим компании на странице СНачала парсим первую страницу, потом находим ссылки на все другие страницы и в цикле парсим их, далее сохраняем все в словарь""" dict_company={} page = urllib.request.urlopen(url) doc = html.document_fromstring(page.read()) for company in doc.xpath('.//div[@class="row listings"]/p/a[@onclick]'): print(company.text_content()) dict_company[company.text_content()]=company.get('href') page = urllib.request.urlopen(url) doc = html.document_fromstring(page.read()) pages=doc.xpath('.//div[@class="wide-content column"]/div[@class="row"]/p/a') for link in pages: next_link=urllib.request.urljoin(BASE_URL, link.get('href')) page = urllib.request.urlopen(next_link) doc = html.document_fromstring(page.read()) for company in doc.xpath('.//div[@class="row listings"]/p/a[@onclick]'): dict_company[company.text_content()]=company.get('href') return dict_company def main(): #dict_link=pars_mainpage(BASE_URL) pars_company('http://www.wholesalecentral.com/Licensed-Items-Collectibles.html?visitorid=654393974&dbid=1') if __name__ == "__main__" : main()