I'm learning to parse slowly, there was a problem with the encoding. The code works, but on some pages it throws a UnicodeDecodeError error: 'utf-8' codec can't decode byte 0xae in position 12: invalid start byte and swears at the print (company.text_content ()) line
As I understand it, this string cannot be interpreted as text in UTF-8. Found a solution to the same problem https://stackoverflow.com/questions/10226342/how-to-handle-unicodedecodeerror-without-losing-any-data , but unfortunately did not help. Can you advise something else?

import urllib.request import lxml.html as html import time BASE_URL="http://www.wholesalecentral.com/catoverview.htm" def pars_company(url): """парсим компании на странице СНачала парсим первую страницу, потом находим ссылки на все другие страницы и в цикле парсим их, далее сохраняем все в словарь""" dict_company={} page = urllib.request.urlopen(url) doc = html.document_fromstring(page.read()) for company in doc.xpath('.//div[@class="row listings"]/p/a[@onclick]'): print(company.text_content()) dict_company[company.text_content()]=company.get('href') page = urllib.request.urlopen(url) doc = html.document_fromstring(page.read()) pages=doc.xpath('.//div[@class="wide-content column"]/div[@class="row"]/p/a') for link in pages: next_link=urllib.request.urljoin(BASE_URL, link.get('href')) page = urllib.request.urlopen(next_link) doc = html.document_fromstring(page.read()) for company in doc.xpath('.//div[@class="row listings"]/p/a[@onclick]'): dict_company[company.text_content()]=company.get('href') return dict_company def main(): #dict_link=pars_mainpage(BASE_URL) pars_company('http://www.wholesalecentral.com/Licensed-Items-Collectibles.html?visitorid=654393974&dbid=1') if __name__ == "__main__" : main() 

    1 answer 1

    In general, the problem was solved by changing one line:

     doc = html.document_fromstring(page.read().decode(encoding='unicode-escape')) 
    • .decode('unicode-escape') is the wrong decision, which at best (when it does not spoil your data) indicates a problem with the input format. To correctly determine the encoding, see the response in Python . If you can not follow the recommendations from the answer by reference, then remove from your question all the code that is not related to the definition of the page encoding. Give the http headers you receive and the minimum html example that causes a decoding error. The minimum reproducible example is jfs
    • Thanks for the link. Using .get_content_charset () defined the encoding (iso-8859-1). So, it will be correct doc = html.document_fromstring (page.read (). Decode ('iso-8859-1'))? - Roman Sablin