Good day. Help someone who knows how to make up the right expression, I can't figure them out at all.

I work with jsoup , through it I get the html-code of one of the required blocks:

 <!--Ad Injection:top--> <div style="margin-bottom:10px;"> <center> <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <!-- MySite.ru Adaptive 1 --> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-6747406633235216" data-ad-slot="7606784485" data-ad-format="auto"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </center> </div> <div> Моя подруга пообещала подарить своей дочке новый iPad за то, что она перейдет из шестого класса в седьмой. </div> <div> Мне в детстве родители обещали дать по шее, если не перейду! </div> <div class="addtoany_share_save_container addtoany_content_bottom"> <div class="a2a_kit a2a_kit_size_32 addtoany_list a2a_target" id="wpa2a_1"> <a class="a2a_button_facebook" href="http://www.addtoany.com/add_to/facebook?linkurl=http%3A%2F%2Fwww.mysite.ru%2Farchives%2F45450&amp;linkname=%D0%90%D0%BD%D0%B5%D0%BA%D0%B4%D0%BE%D1%82%20%D0%BE%D1%82" title="Facebook" rel="nofollow" target="_blank"></a> <a class="a2a_button_twitter" href="http://www.addtoany.com/add_to/twitter?linkurl=http%3A%2F%2Fwww.mysite.ru%2Farchives%2F45450&amp;linkname=%D0%90%D0%BD%D0%B5%D0%BA%D0%B4%D0%BE%D1%82%20%D0%BE%D1%82" title="Twitter" rel="nofollow" target="_blank"></a> <a class="a2a_button_vk" href="http://www.addtoany.com/add_to/vk?linkurl=http%3A%2F%2Fwww.mysite.ru%2Farchives%2F45450&amp;linkname=%D0%90%D0%BD%D0%B5%D0%BA%D0%B4%D0%BE%D1%82%20%D0%BE%D1%82" title="VK" rel="nofollow" target="_blank"></a> <a class="a2a_button_odnoklassniki" href="http://www.addtoany.com/add_to/odnoklassniki?linkurl=http%3A%2F%2Fwww.mysite.ru%2Farchives%2F45450&amp;linkname=%D0%90%D0%BD%D0%B5%D0%BA%D0%B4%D0%BE%D1%82%20%D0%BE%D1%82" title="Odnoklassniki" rel="nofollow" target="_blank"></a> <a class="a2a_button_google_plus" href="http://www.addtoany.com/add_to/google_plus?linkurl=http%3A%2F%2Fwww.mysite.ru%2Farchives%2F45450&amp;linkname=%D0%90%D0%BD%D0%B5%D0%BA%D0%B4%D0%BE%D1%82%20%D0%BE%D1%82" title="Google+" rel="nofollow" target="_blank"></a> <a class="a2a_dd addtoany_share_save" href="https://www.addtoany.com/share"></a> <script type="text/javascript"><!-- if(wpa2a)wpa2a.script_load(); //--></script> </div> </div> 

I use the replaceAll method to replaceAll lines and the regular method I found on the Internet (in this example <br> there isn’t, but there are on other pages of the parson site:

.replaceAll("<br>", "\n").replaceAll("\\<[^>]*>", "")

I end up with this:

 // граница (adsbygoogle = window.adsbygoogle || []).push({}); Моя подруга пообещала подарить своей дочке новый iPad за то, что она перейдет из шестого класса в седьмой. Мне в детстве родители обещали дать по шее, если не перейду! // граница 

Please tell me how to fix the regular schedule to bring the text to a sensible form: remove empty lines, spaces, in the example is not visible, but each empty line contains several spaces, apparently depending on how many tags were on it and screw cleaning to it (adsbygoogle = window.adsbygoogle || []).push({});

Required to pull the text:

 Моя подруга пообещала подарить своей дочке новый iPad за то, что она перейдет из шестого класса в седьмой. Мне в детстве родители обещали дать по шее, если не перейду! 
  • And what text from the above block do you want to receive? - post_zeew
  • @post_zeew, updated the post. - Pollux
  • Delete empty lines, for example, you can, for example, String.replaceAll("\\s*?\\r?\\n\\s*?(?=\\r\\n|\\n)", "") . - post_zeew
  • @post_zeew, please tell me how to remove (adsbygoogle = window.adsbygoogle || []).push({}); ? It can somehow be possible to create a regular schedule for deleting everything between the specified blocks, for example, like this: between ( and ); ? - Pollux
  • And why delete it, if it can not add? Take advantage of jSoup . - post_zeew 7:38 pm

1 answer 1

Let String page be your block. Then:

 Document doc = Jsoup.parse(page); Element firstTextElement = doc.select("div").get(1); Element secondTextElement = doc.select("div").get(2); String text = (firstTextElement.html() + "\n" + secondTextElement.html()).replace("<br>", ""); 

As a result, the text will be the search text, in which the tags <br> will be replaced with line breaks.

You can replace " <br>" with " <br>" or with "<br> " , depending on how this text is located in the text.

  • I used it before, now the code has changed a bit, I have to adjust. Then I will ask about Jsoup : there is such a structure of one post. I pull it through doc.select("div[class=inside-article] div[class=entry-content] div"); . At the exit I get such a not quite consistent list . If every third line was searched for, then there would be no problem, and here at the beginning of the list there is a disorder and it doesn’t roll. - Pollux
  • So the question is: is it possible to rigidly set the condition on .select to parse a naked div , as in this situation (the second screen), i.e. if the div has something like style="margin-bottom:10px;" , then exclude this position? - Pollux
  • Can. For example, like this: div:not([style]):not([class]) . - post_zeew
  • By the way, if you suddenly do not know, there is such a useful thing. - post_zeew
  • Thanks for the "thing", I didn’t know about this) About the div:not now I will know, a very useful thing, it solves many problems) Thanks a lot) - Pollux