When parsing with CsQuery, the text of the element is obtained in the 16th code

Question

I try to use CsQuery for parsing html. There is a set of divs in the html document with the class some_class. Inside every diva is a text in Russian. I try to parse divas as follows:

//... CQ cq = CQ.Create(html, Encoding.UTF8); List<IDomObject> items = cq.Find("div.some_class").ToList(); // Демо код для просмотра содержимого item-ов items.ForEach(x => var test = x.InnerText);

As a result, the text in the test is presented in hexadecimal code. I rummaged in IDomObject, I did not find a way to set the encoding. It is also not clear why the encoding flies if I set it when creating the cq object.

Has anyone encountered a similar situation?

For example, instead of Cyrillic I get - "\ n & # 1088; - & # 1085; & # 1051; & # 1077; & # 1085; & # 1080; & # 1085; & # 1089; & # 1082; & # 1080; & & # 1081 ;, & # 1091; & # 1083; & # 1057; & # 1090; & # 1077; & # 1087; & # 1072; & # 1085; & # 1072; & # 1072; & # 1079; & # 1080; & # 1085; & # 1072 ;, 40 "If you use a decoder - 2cyr.com/decode/?lang=ru - it correctly decodes when you select the original encoding WINDOWS-1251.
Here is the same problem in the old version: github.com/jamietre/CsQuery/issues/105

VladD VladD 183k sixteen 223 433 · Accepted Answer · 2017-01-03T11:17:08

This is a bug in version 1.3.4 of CsQuery . The error message mentions the workaround: use .Cq().Text() instead of .InnerText .

The problem is fixed in version 1.3.5 beta, so if you are not afraid of the beta version, go to it.

When parsing with CsQuery, the text of the element is obtained in the 16th code

1 answer 1

More articles: