Hello! I am trying to read a .docx file using the Apache POI java API. I use:

public static String view(String nameDoc){ String text = null; try{ XWPFDocument docx = new XWPFDocument( new FileInputStream(nameDoc)); XWPFWordExtractor we = new XWPFWordExtractor(docx); text = we.getText(); we.close(); docx.close(); }catch (Exception e){ e.printStackTrace(); } return text; } 

In this case, I only get the text of the file, but all my files are different. In some of them not only text is found, but also tables, images, etc. How do I get the full file content?

On Max's advice, I use wordtohtmlconverter

 public static String getDocHtml(String nameDoc){ String html = null; try { Document newDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument(); WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument); HWPFDocument doc = new HWPFDocument(new FileInputStream(nameDoc)); wordToHtmlConverter.processDocument(doc); StringWriter stringWriter = new StringWriter(); Transformer transformer = TransformerFactory.newInstance().newTransformer(); transformer.setOutputProperty(OutputKeys.INDENT, "yes"); transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8"); transformer.setOutputProperty(OutputKeys.METHOD, "html"); transformer.transform( new DOMSource(wordToHtmlConverter.getDocument()), new StreamResult(stringWriter)); html = stringWriter.toString(); }catch (Exception e){ e.printStackTrace(); } return html; } 

I send in jsp, but on the page I receive nothing. Error: org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF) org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)

    1 answer 1

    I advise you to see the documentation for XWPFCocument https://poi.apache.org/apidocs/org/apache/poi/xwpf/usermodel/XWPFDocument.html

    In particular, there is a getTablesIterator () method that iterates over all the tables in the document.

    • In addition to the tables, I also have pictures. I know that there are separate methods for pulling out pictures and tables, and how to bring all of this in sequence? Those. How to get the whole document in its original form and broadcast it for example in the jsp page? - Oleg1n
    • @ Oleg1n You can use the write method to get the document as an array of bytes like this and give it to the user like this - Max
    • Based on the articles you dropped, you did the following: public static byte[] getText(String nameDoc){ ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(); try { XWPFDocument docx = new XWPFDocument(new FileInputStream(nameDoc)); docx.write(byteArrayOutputStream); docx.close(); byteArrayOutputStream.close(); } catch (Exception e) { e.printStackTrace(); } return byteArrayOutputStream.toByteArray(); } public static byte[] getText(String nameDoc){ ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(); try { XWPFDocument docx = new XWPFDocument(new FileInputStream(nameDoc)); docx.write(byteArrayOutputStream); docx.close(); byteArrayOutputStream.close(); } catch (Exception e) { e.printStackTrace(); } return byteArrayOutputStream.toByteArray(); } public static byte[] getText(String nameDoc){ ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(); try { XWPFDocument docx = new XWPFDocument(new FileInputStream(nameDoc)); docx.write(byteArrayOutputStream); docx.close(); byteArrayOutputStream.close(); } catch (Exception e) { e.printStackTrace(); } return byteArrayOutputStream.toByteArray(); } - Oleg1n
    • Now I need to transfer this array to the jsp page so that the contents of the document are immediately translated there, without an open or save dialog. - Oleg1n
    • Ie you want to generate html from a word document? In this case, you can try WordToHtmlConverter . Here is an example of how to use it - Max