I want to parse the html in which the JSON resides, then to write it to the database.

this is what html looks like

<div class="bContentColumn"> <script type="text/javascript"> Core.Namespace.exp('Pages.Detail.modelData', {"employees":[ {"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName":"Jones"} ]}); Core.Namespace.exp('Pages.Detail.useFakeSaleBlock', false); </script> </div> 

The amount of data in Json may be different. I get to the div very simply

 string div_inner_text = htmlDocument.DocumentNode.SelectSingleNode("//div[@class='bContentColumn']").InnerText; 

Then I start cutting the line, by the first comma, which occurs in the line

 string result = div_inner_text.Substring(div_inner_text.IndexOf(',') + 1); 

I got the start, but here's how do I get the JSON ending? Probyval so

 string res_two = result.Substring(0, result.IndexOf(';')); 

But I do not like this option, because the binding to the semicolon is not reliable and can generally occur in the text. Well, the question is, how can I get the most beautiful out of diva Json?

  • Why don't you first find the script in the html parser, and then parse it? - VladD
  • (Although, if you are trying to get information from a site that does not provide an API, then it is all the same useless labor, it seems to me: the site has the right to change the format at any second.) - VladD
  • there are a lot of scripts in this file - shatoidil
  • Yes, it OZON doubt that they will soon be replaced. They have XML, but it doesn’t suit me because of the incomplete information on the product - shatoidil

1 answer 1

I would do this: First we get the contents of the script tag:

 string div_inner_text = htmlDocument.DocumentNode.SelectSingleNode("//div[@class='bContentColumn']").ChilNodes[0].InnerText; 

now remove all indents / tabs / hyphens from it:

 div_inner_text=div_inner_text.Replace("\n","").Replace("\t","").Replace("\r",""); 

And we got a string that always starts with 'Core.Namespace.exp ('. We can start deleting this:

 div_inner_text=div_inner_text.Remove(0, 18); 

Start found. It remains to find the end. And get the data

 var json_data=div_inner_text.Replace("});","|").Split('|').First()+"}"; 

The likelihood of a line in the data '});' and '|' small. The option is far from perfect, but it helped me many times.