Good day to all, the problem is this:

I have about 40-50 thousand 1 page pdf files, they have tables of the same structure. It is necessary from these tables to parse the data in JSON or xml, well, or in sheets and dictionaries that would be easier to work, to do mathematical operations.

I tried the tabula and pdf-table-extract packages. Did not give the expected result.

Example table: enter image description here

Python version 3.6.1 Anaconda 64 bit.

Operating system Windows 7 64 bit.

Tell me how you can parse such a structure. I thought to pull out all the characters from the file and cut this long string in pieces, but it is very long and in some files there are a lot of empty cells.

  • Your hell table is big. Its size makes you think that you are on the freelance market. Try to isolate the problem using a simple example. Give an example easier to enter, as well as the expected result. - m9_psy
  • Не дали ожидаемого результата - you can describe in more detail - what is wrong? - MaxU

0