How to download html site with all the data?

Question

I'm trying to get html site ( https://site.com/../page.html?lang=en ) to read the data. The structure is approximately as follows:

<div class="class1"> <div class="1.2"> <div class="class2"> <div class="2.2"> <div> **1 JUNE**</div> <div class="class3"> <div class="3.2">

Wrote function:

 private static string GetHtml(string url) { try { var req = (HttpWebRequest)WebRequest.Create(url); req.AllowAutoRedirect = false; req.Method = "GET"; req.UseDefaultCredentials = true; req.Proxy.Credentials = System.Net.CredentialCache.DefaultCredentials; req.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"; req.Referer = "https://www.google.com/"; using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream())) { return reader.ReadToEnd(); } } catch (Exception ex) { return ex.Message; } }

But the problem is that only the site template is returned, without any data, for example, as in this case instead of

 <div> **1 JUNE**</div>

it turns out something like:

 <div> {{product.Date | date:'dd'}} {{AD.Resource(product.Date | date:'MMM')}} {{product.Date | date:'HH:mm'}}</div>

However, if I download via "Download as", using chrome, the data is loaded, not the template.

Well, the problem is that the data on the site is generated via JS.
Actually, you need to either take them from the place from where the site takes them, or drastic measures - use a browser component (for example, Cef), which can process JS scripts and, after full download, take the page code.
In your case, all data is in this path and in the usual JSON format.
It’s even easier and sufficient for you to request data (without the request body and cookie)

Vadim Bondaruk Vadim Bondaruk 768 3 eleven · Answer 1 · 2018-05-02T14:25:55

The browser, when loading the page, executes javascript code and replaces templates with the necessary data. You need a component that can repeat these actions (something like a browser engine). I know that there is a component of WebBrowser in Dot Nete, but this is the most banal option.

You can read my comments on the question and make a response.
Well, so, to work with 90% of the sites, the best option is to find the place of the data itself, from where the site itself receives it and send a request there.
But all sorts of browser components are superfluous and in rare cases need to be applied, but unfortunately nowadays everyone knows how to work only with these components and makes a lot of superfluous.

B. Vandyshev B. Vandyshev 1,375 four 14 · Answer 2 · 2018-05-03T06:54:50

The easiest to use Selenium WebDriver

Install-Package Selenium.WebDriver

Install-Package Selenium.WebDriver.ChromeDriver

 var path = Path.GetDirectoryName(Assembly.GetEntryAssembly().Location); var driver = new ChromeDriver(path); driver.Navigate().GoToUrl($"https://site.com/../page.html?lang=en"); var myData = driver.FindElement(By.CssSelector(@".class2 > .2\.2 > div")).Text;

How to download html site with all the data?

2 answers 2

More articles: