I try to use CsQuery for parsing html. There is a set of divs in the html document with the class some_class. Inside every diva is a text in Russian. I try to parse divas as follows:

//... CQ cq = CQ.Create(html, Encoding.UTF8); List<IDomObject> items = cq.Find("div.some_class").ToList(); // Демо код для просмотра содержимого item-ов items.ForEach(x => var test = x.InnerText); 

As a result, the text in the test is presented in hexadecimal code. I rummaged in IDomObject, I did not find a way to set the encoding. It is also not clear why the encoding flies if I set it when creating the cq object.

Has anyone encountered a similar situation?

  • And what do you mean by hexadecimal code? - VladD
  • For example, instead of Cyrillic I get - "\ n & # 1088; - & # 1085; & # 1051; & # 1077; & # 1085; & # 1080; & # 1085; & # 1089; & # 1082; & # 1080; & & # 1081 ;, & # 1091; & # 1083; & # 1057; & # 1090; & # 1077; & # 1087; & # 1072; & # 1085; & # 1072; & # 1072; & # 1079; & # 1080; & # 1085; & # 1072 ;, 40 "If you use a decoder - 2cyr.com/decode/?lang=ru - it correctly decodes when you select the original encoding WINDOWS-1251. - Dmitriy
  • Hmm, how do you display the text? How do you see the problem? What happens if the line output to a file? - VladD
  • if written to a file, the text looks the same. - Dmitriy
  • one
    Okay, what is your version of CsQuery? Here is the same problem in the old version: github.com/jamietre/CsQuery/issues/105 - VladD

1 answer 1

This is a bug in version 1.3.4 of CsQuery . The error message mentions the workaround: use .Cq().Text() instead of .InnerText .

The problem is fixed in version 1.3.5 beta, so if you are not afraid of the beta version, go to it.