Java - how to get information from an HTML page

Question

I am writing a Java application related to parsing the page source code. The question is this. I am trying to parse the source code as follows:

Document doc = Jsoup.connect("http://example.com/").get();

Parse fine. But (!) I look at the source code I received - not all the blocks are there. For example, in the browser using the "Explore Element" I see this block and everything I see in it, but in the resulting source code using Jsoup there is no this block. Actually, the question is: how to parse the entire source code entirely?

It is not there, because apparently it is located using Java-Script, is this option possible?

Accepted Answer · 2016-12-22T18:55:20

Most likely, the desired element is created using JavaScript code by the launched browser and / or using additional query APIs. JSoup is not a browser and does not have a JavaScript engine. In doc you get the original "static" HTML.

There are several options to solve this problem:

automate a real browser through selenium , open the desired page in the selenium-powered browser, let it do its job of loading the page and parse the necessary data; then either get the source code of the page via getPageSource() and give this to JSoup for parsing, or continue using the selenium webdriver API
inspect how the page is loaded and configured (using browser developer tools), if the browser makes additional requests, repeat them in your Java code

It also sometimes happens that the necessary data is already present in HTML, but only in another place. For example, sometimes in script tags - also check it.

Java - how to get information from an HTML page

1 answer 1

More articles: