problem with the BeautifulSoup encoding

Question

Wrote the code as advised here for parsing the xml file. But as I understand it, there is a problem with the encoding, or I did not understand correctly.

from bs4 import BeautifulSoup infile = open('C:\\Users\\inikitatech\\Python Example\\xml_data.xml', 'r') contents = infile.read() soup = BeautifulSoup(contents, 'xml') print(soup.select_one('id').text) print(soup.select_one('href').text.strip()) print(soup.select_one('url').text.strip())

It turns out this error:

 Traceback (most recent call last): File "C:/Users/inikitatech/PycharmProjects/PythonExample/ZakupParser.py", line 4, in <module> contents = infile.read() File "C:\Users\inikitatech\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1251.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 739: character maps to <undefined>

I looked at how this problem could be solved. They write that you can solve this problem using the built-in library codecs, but another error is coming out.

 from bs4 import BeautifulSoup import codecs infile = codecs.open('C:\\Users\\inikitatech\\Python Example\\xml_data.xml', 'r', 'utf-8') contents = infile.read() soup = BeautifulSoup(contents, 'xml') print(soup.select_one('id').text) print(soup.select_one('href').text.strip()) print(soup.select_one('url').text.strip())

Here is a mistake:

 Traceback (most recent call last): File "C:/Users/inikitatech/PycharmProjects/PythonExample/ZakupParser.py", line 6, in <module> soup = BeautifulSoup(contents, 'xml') File "C:\Users\inikitatech\PycharmProjects\PythonExample\venv\lib\site-packages\bs4\__init__.py", line 165, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?

What does this mean and how is it ok to parse xml?

soup = BeautifulSoup(contents, 'html.parser') or soup = BeautifulSoup(contents, 'lxml')
I understand that for this you do not need to import the lxml library.
No, BeautifulSoup will try to import the specified parser itself
@ gil9red you can suggest somewhere .text is used, and somewhere .text.strip ().
.text returns the text from the element, and strip() the string method for removing empty characters on the left and right, such as spaces, tabs, transfers to the next line.
if .text returns the string " text\n \n" , then strip it will shorten to "text"

gil9red gil9red 31.9k four 24 69 · Accepted Answer · 2018-07-11T08:20:02

The author was helped by specifying the correct parser:

 soup = BeautifulSoup(contents, 'html.parser')

or

 soup = BeautifulSoup(contents, 'lxml')

problem with the BeautifulSoup encoding

1 answer 1

More articles: