There is a text:

<p>текст1</p> <p>текст2</p> <img ...><br> <p>текст3</p> <img ...><br> 

and so on..

It is necessary to make a list, each element of which is a line between tags with images, that is, 1 line is текст1 + текст2 , the second line is текст3 . I would not like to do it manually, but I have not figured out how to do it with parsers or regulars.

    2 answers 2

    Take advantage of the Jsoup library

     Document doc = Jsoup.parse(new File("files/file.html"), "UTF-8"); for(Element element : doc.getAllElements()) { if(element.tagName().equals("p")) System.out.print(element.text() + " "); else if(element.tagName().equals("img")) System.out.println(); } 

    Conclusion

    text1 text2
    text3

      You can use the Jsoup library, for example:

       import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; class HtmlParsingDemo { public static void main(String[] args) { String html = "<html><body><p>текст1</p><p>текст2</p><img src=\"some.jpg\"><br><p>текст3<img src=\"another.jpg\"><br></body></html>"; Document doc = Jsoup.parse(html); Elements paragraphs = doc.select("p"); for(Element paragraph : paragraphs) { System.out.println(paragraph.text()); } } } 

      A can and regular expression

       import java.util.regex.*; import java.util.ArrayList; public class RegexDemo { public static void main(String[] args) { String html = "<html><body><p>текст1</p><p>текст2</p><img src=\"some.jpg\"><br><p>текст3<img src=\"another.jpg\"><br></body></html>"; Pattern p = Pattern.compile(">([^<]*)<"); Matcher m = p.matcher(html); ArrayList<String> matches = new ArrayList<>(); while(m.find()) { String text = m.group(1); if(!text.isEmpty()) matches.add(text); } for(String match : matches) { System.out.println(match); } } } 

      UPDATE:

       import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.util.ArrayList; class HtmlParsingDemo { public static void main(String[] args) { String html = "<html><body><p>текст1</p><p>текст2</p><img src=\"some.jpg\"><br><p>текст3<img src=\"another.jpg\"><br></body></html>"; ArrayList<String> texts = new ArrayList<>(); Document doc = Jsoup.parse(html); Elements elements = doc.select("p, img"); ArrayList<String> text = new ArrayList<>(); for(Element element : elements) { if(element.tagName().equals("p")) text.add(element.text()); else { texts.add(String.join(" ", text)); text.clear(); } } texts.stream().forEach(System.out::println); } } 
      • Thanks for the answer, but I need to combine the paragraphs between tags with images, how can I find out which ones to merge with which, if in the end I get just a list with paragraphs? - Alex
      • one
        I showed in the answer the general principle of "how to do it with the help of parsers or regularizers", but you, apparently, are waiting for a ready-made solution. Updated the answer. - Sergey Gornostaev