Hello. I decided to write a parser page with a single table on python. The "meta" tag of the page contains the utf-8 encoding. I read all the data I need from this table, but Russian characters are written in an incomprehensible abracadabra. Here is the program code:
#!/usr/bin/env python # -*- coding: utf-8 -*- # vim:fileencoding=utf-8 import lxml.html as html import requests page = requests.get('https://org.mephi.ru/pupil-rating/get-rating/entity/4575/original/no') tree = html.fromstring(page.content) range_list = tree.xpath('//tr[@class="trPosBen"]/td[1]/text()') unique_list=tree.xpath('//tr[@class="trPosBen"]/td[3]/text()') fio_list=tree.xpath('//tr[@class="trPosBen"]/td[4]/text()') hostel_list=tree.xpath('//tr[@class="trPosBen"]/td[5]/text()') score_list=tree.xpath('//tr[@class="trPosBen"]/td[6]/span[1]/text()') sum_score_list=tree.xpath('//tr[@class="trPosBen"]/td[7]/text()') docs_list=tree.xpath('//tr[@class="trPosBen"]/td[8]/text()')
Then I combine all these lists in 'result_list' to make a table. When outputting to the console, everything works without errors, but the Russian characters are displayed as follows: RED BLACK. When I try to write this "table" to a text file, an encoding error pops up:
Traceback (most recent call last): File "C:/Users/Vasiiil/PycharmProjects/untitled/HelloWorld.py", line 55, in <module> f.write(str(i[j]) + " ") File "C:\Program Files (x86)\Python35-32\lib\encodings\cp1251.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-7: character maps to <undefined>
After adding the 'encoding =' utf-8 'parameter to the file variable, the error does not pop up. But the same abracadabra is written into the file: GEM. Help me please. Already the third day I get on the Internet in search of a solution to this problem, but I haven’t found anything.
lxml
could sometimes have problems with Unicode. You can use thebeautifulsoup4
package to automatically pick the correct encoding . - jfs