Python problems with cp1251 encoding in the lxml module

Question

The problem is that when I try to parse the xml document, I get an error:

lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding!

Code:

 #-*- coding cp1251 -*- import sys from lxml import etree reload(sys) sys.setdefaultencoding("cp1251") inputFile = a.ED tree = etree.parse(inputFile) nodes = tree.xpath('/') print nodes.decode('cp1251')

Windows 7, python 2.7, lxml 2.3

In the document:

 <ED101 sysCode ="04"> <dsig:SigValue xmlns:dsig="urn">AAAA</dsig:SigValue> <Name>Сергей Николаевич</Name> </ED101>

Specify the encoding of the xml-document or transcode its contents in utf-8.
does not relate directly to the issue, but it is worth mentioning: 1- #-*- coding cp1251 -*- line has no effect in your Python code, since
2- do not use reload(sys); sys.setdefaultencoding("cp1251")
reload(sys); sys.setdefaultencoding("cp1251") is simply a way to spoil the data (without explicit errors that would indicate a problem) or to get output krakozaby .
lxml should return the unicode type for non-ascii content.
It is necessary either a colon or an equal sign after coding add, so that the line is perceived as an encoding declaration.
Example: # -*- coding: utf-8 -*- (without a SyntaxError colon, a non-ascii source code will appear (in string constants, in comments).
@NicolasChabanovsky my comments are not related to the problem with xml.
To fix XMLSyntaxError, follow the recommendation of Sergey Gornostaev.

Python problems with cp1251 encoding in the lxml module

0

More articles: