Hello. I decided to write a parser page with a single table on python. The "meta" tag of the page contains the utf-8 encoding. I read all the data I need from this table, but Russian characters are written in an incomprehensible abracadabra. Here is the program code:

#!/usr/bin/env python # -*- coding: utf-8 -*- # vim:fileencoding=utf-8 import lxml.html as html import requests page = requests.get('https://org.mephi.ru/pupil-rating/get-rating/entity/4575/original/no') tree = html.fromstring(page.content) range_list = tree.xpath('//tr[@class="trPosBen"]/td[1]/text()') unique_list=tree.xpath('//tr[@class="trPosBen"]/td[3]/text()') fio_list=tree.xpath('//tr[@class="trPosBen"]/td[4]/text()') hostel_list=tree.xpath('//tr[@class="trPosBen"]/td[5]/text()') score_list=tree.xpath('//tr[@class="trPosBen"]/td[6]/span[1]/text()') sum_score_list=tree.xpath('//tr[@class="trPosBen"]/td[7]/text()') docs_list=tree.xpath('//tr[@class="trPosBen"]/td[8]/text()') 

Then I combine all these lists in 'result_list' to make a table. When outputting to the console, everything works without errors, but the Russian characters are displayed as follows: RED BLACK. When I try to write this "table" to a text file, an encoding error pops up:

 Traceback (most recent call last): File "C:/Users/Vasiiil/PycharmProjects/untitled/HelloWorld.py", line 55, in <module> f.write(str(i[j]) + " ") File "C:\Program Files (x86)\Python35-32\lib\encodings\cp1251.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-7: character maps to <undefined> 

After adding the 'encoding =' utf-8 'parameter to the file variable, the error does not pop up. But the same abracadabra is written into the file: GEM. Help me please. Already the third day I get on the Internet in search of a solution to this problem, but I haven’t found anything.

2 answers 2

Try

 html.fromstring(page.raw.read().decode('utf-8')) 

or

 page.encoding = 'utf-8' html.fromstring(page.text) 
  • Thank you very much. Now everything is displayed correctly - vasiiil
 >>> a = 'texqweфывыфвt'.encode(encoding='utf-8', errors='ignore').decode('utf-8', 'ignore') >>> print(a) texqweфывыфвt 
  • Nothing changed. All the same characters on the output. When specifying the value of a variable in the code or reading it from a text file, Russian characters are normally recognized without additional encoding and decoding. I encountered this problem exactly when reading the html page - vasiiil