How to parse the internal contents of the html element via JSoup?

Question

Actually there are articles that are presented on the resource such html:

<article class="day_news_item"> <div class="day_news_item_img"> <a href="/world/20151117/1322854695.html"> <img src="http://cdn12.img22.ria.ru/images/132265/34/1322653488.jpg" alt="Президент России Владимир Путин. Архивное фото" title="Президент России Владимир Путин. Архивное фото" width="230" height="130" class="media"></a> </div> <div class="day_news_item_text"> <div class="day_news_item_title"> <h3> <a href="/world/20151117/1322854695.html">Путин: совместная работа Китая и России стабилизирует обстановку в мире</a> </h3></div><div class="day_news_item_announce"> <a href="/world/20151117/1322854695.html">Сотрудничество России и Китайской Народной Республики двигается вперед в области военно-технического сотрудничества, что является серьезным фактором, стабилизирующим обстановку в мире, заявил президент РФ. </a> </div> </div> </article>

On one page there can be a lot of them. Here is the class constructor whose object I need to create for each news item:

 public News(String imgRef, String title, String text, String date, String announce) { this.imgRef = imgRef; this.title = title; this.text = text; this.date = date; this.announce = announce; }

Question: what is the best way to do it through JSoup? I do not really understand how to take information from nested tags and their attributes?

I hope clearly formulated the question. Thank!

I saved the site wget, went into its html and found there the very content that I needed (and all its nested tags and attributes).

It was also preceded by the following script:

 <script>$(document).ready(function() { checkBannerHeight(17); });</script></div></div><div xmlns:str="http://exslt.org/strings" class="day_news"><div class="day_news_wrapper">

Maybe he should tell me how to get data from the main page, without going through each link to each news item separately?

Accepted Answer · 2015-11-17T14:49:58

Open a document as written, for example, here or here.
Find element selectors that contain the values you want.
You take the necessary values from attributes or text value.

For example:

 Document doc = Jsoup.connect("http://example.com/").get(); String imgURL = doc.select(".day_news_item .day_news_item_img img").attr("src"); String title = doc.select(".day_news_item .day_news_item_title h3 a").text()

If on the news page a few, then find a list of news

 Elements newsElements = doc.select(".day_news_item");

Then newsElements through the newsElements loop and, for each element, get the data you need (using the select() method select() (as discussed above).

Updated

The site may give the page in another form (for example, it will consider your application as a mobile or block downloading altogether). To at least look like a browser you need to specify the header User-Agent . In Jsoup you can do it like this:

 Document doc = Jsoup.connect("http://site.url").userAgent("Mozilla").get();

If the site will defend more actively, you will have to invent something more complicated, but this is already catching up with weapons and means of protection. In your case, the presence of the User-Agent header is probably enough.

But there such a difficulty was outlined - the text of the doc object for some reason does not contain some parts of the original web page (if I look at the page code through chrome, these elements are displayed normally), respectively, the output is zero.
@Alexander Dvortsov For example, because part of the content is generated by scripts.
It is necessary to look at the HTML source of the page and from where the scripts take the content.
First, download the page with something without scripts (wget, curl ...) and look at its structure (or add a question).
There is a script that does not look like taking content from somewhere.
Maybe this is the string immediately after it: xmlns: str = " exslt.org/strings "?
Actually, Java doc.toString () provides links to each news item, but not to their pictures and other displayed properties.
I think the decision to call on the link and parsing the pages of the news itself is not the best option.
And I came with an empty User-agent on ria.ru and found that it returns a shortened page to obscure browsers (apparently they consider it to be a mobile browser).

How to parse the internal contents of the html element via JSoup?

1 answer 1

More articles: