I use Apache POI 4.0.0 to convert .doc files to html. Most of the files are converted perfectly, and I like the result. But there is some part of the files that are not fully converted.
private static String ProcessingDoc(File doc, String imagedir) throws IOException, ParserConfigurationException, TransformerConfigurationException, TransformerFactoryConfigurationError { FileInputStream in = new FileInputStream(doc); HWPFDocument doc_file = new HWPFDocument(in); Document html_file = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument(); WordToHtmlConverter converter = new WordToHtmlConverter(html_file); converter.setPicturesManager(new PicturesManager() { @Override public String savePicture(byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) { File imgFile = new File(getParentDirectory(doc)); if(!imgFile.exists()){ imgFile.mkdirs(); } try { FileOutputStream out = new FileOutputStream(imagedir+"/" + suggestedName); out.write(content); out.close(); } catch (Exception e) { e.printStackTrace(); } return suggestedName; } }); converter.processDocument(doc_file); StringWriter stringWriter = new StringWriter(); Transformer transformer; transformer = TransformerFactory.newInstance().newTransformer(); transformer.setOutputProperty( OutputKeys.INDENT, "yes" ); transformer.setOutputProperty( OutputKeys.ENCODING, "utf-8" ); transformer.setOutputProperty( OutputKeys.METHOD, "html" ); try { transformer.transform( new DOMSource( converter.getDocument() ), new StreamResult( stringWriter ) ); } catch (TransformerException e) { e.printStackTrace(); } return stringWriter.toString(); } Those. The final html file is terminated just in an arbitrary place. For example, like this:
<html> <head> <META http-equiv="Content-Type" content="text/html; charset=utf-8"> <style type="text/css">.b1{white-space-collapsing:preserve;} ....Здесь идет текст... И вот это последние строки в документе. </p> </td> </tr> <tr class="r1"> <td class="td5"> <p class="p14"> <span>ИДФилиала</span> </p> </td><td class="td6"> No errors occur during the conversion, but the file is truncated. What am I doing wrong? Maybe you need some library settings? Thanks for the answer!