Hello. I am writing an application for Android, which is a parsit from a corporate page, a work schedule. On the original page there is one table that contains graphs for all 12 months for each employee of the department. There are about 600-650 rows in the table. When executing this code:

Document doc = Jsoup.connect("http://url.htm").get();

in doc, as expected, the document is saved. However, the following expression:

 doc.select("tr").size(); 

returns the number 451. The first 451 lines parry without any problems (almost), and where are the rest?

Here is a piece when the original page:

 <tr height=17 style='height:12.75pt'> <td height=17 class=xl9817500 style='height:12.75pt;border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6817500 style='border-top:none'>&nbsp;</td> <td class=xl12217500 style='border-top:none'>0</td> <td class=xl12217500 style='border-top:none'>0</td> <td class=xl13817500 style='border-top:none'>0</td> <td class=xl13117500 style='border-top:none'>0</td> <td class=xl13117500 style='border-top:none'>0</td> <td class=xl10517500 style='border-top:none'>0</td> <td class=xl6553517500></td> <td class=xl6553517500></td> <td class=xl6553517500></td> </tr> <tr height=17 style='height:12.75pt'> <td height=17 class=xl9817500 style='height:12.75pt;border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6717500 style='border-top:none'>&nbsp;</td> <td class=xl6917500 style='border-top:none'>&nbsp;</td> <td class=xl6817500 style='border-top:none'>&nbsp;</td> <td class=xl12217500 style='border-top:none'>0</td> <td class=xl12217500 style='border-top:none'>0</td> <td class=xl13817500 style='border-top:none'>0</td> <td class=xl13117500 style='border-top:none'>0</td> <td class=xl13117500 style='border-top:none'>0</td> <td class=xl10517500 style='border-top:none'>0</td> <td class=xl6553517500></td> <td class=xl6553517500></td> <td class=xl6553517500></td> </tr> 

Of such tr and td consists the entire page. I brought a piece on which the document breaks off. The doc-downloaded jsoup document ends on the second (in the code above) tr, on the 23rd td account. As I understand it, the table is automatically generated:

 <!--[if !excel]>&nbsp;&nbsp;<![endif]--> <!--Следующие сведения были подготовлены мастером публикации веб-страниц Microsoft Excel.--> <!--При повторной публикации этого документа из Excel все сведения между тегами DIV будут заменены.--> <!-----------------------------> <!--НАЧАЛО ФРАГМЕНТА ПУБЛИКАЦИИ МАСТЕРА ВЕБ-СТРАНИЦ EXCEL --> <!-----------------------------> 

Please tell me what could be the problem?

  • Try to see the contents of the doc . - post_zeew

1 answer 1

Thank you, found the answer to your question here.

In my case, the problem was solved using the .maxBodySize (0) method.

 Document doc = Jsoup.connect("http://url.htm").maxBodySize(0).get();