There is a code that parses websites:

def prepare_content(url): headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'} response = requests.get(url, headers=headers) tree = fromstring(response.text) tree.make_links_absolute(response.url) return tree 

Here is an example of the code in which problems arise:

 date = day.xpath('.//div[@class="h3"]/text()')[1].strip().split(',')[0] print('date:', date) # date: Ночь c 27 ноября на 28 ноября print("'Ночь с' in date:", 'Ночь с' in date) # 'Ночь с' in date: False if 'Ночь с' in date: date = ' '.join(a.split()[-2:]) print('date:', date) # date: Ночь c 27 ноября на 28 ноября 

The print("'Ночь с' in date:", 'Ночь с' in date) should produce True , not False

I understand very little in the encodings, but can the difference between the encoding of the partial information and the encoding used by IDE? If so, how to cast the parsed data to IDE encoding?

  • one
    In date, the letter c is English. :) - andreymal
  • Like in the 3rd python there should not be such problems. Try unless type(date) to display. Or really with c problem. - andy.37
  • @andreymal, 'S' is not Cyrillic. Thank. - TitanFighter

1 answer 1

Try adding the first line in the source file. Read more here .

 # -*- coding: utf-8 -* 
  • Try to publish detailed answers containing a specific example of the minimum solution, supplementing them with a link to the source. Answers –references (as well as comments) do not add knowledge to the Runet. - Nicolas Chabanovsky