Parser tables in pdf format

Question

Good day to all, the problem is this:

I have about 40-50 thousand 1 page pdf files, they have tables of the same structure. It is necessary from these tables to parse the data in JSON or xml, well, or in sheets and dictionaries that would be easier to work, to do mathematical operations.

I tried the tabula and pdf-table-extract packages. Did not give the expected result.

Example table:

Python version 3.6.1 Anaconda 64 bit.

Operating system Windows 7 64 bit.

Tell me how you can parse such a structure. I thought to pull out all the characters from the file and cut this long string in pieces, but it is very long and in some files there are a lot of empty cells.

Its size makes you think that you are on the freelance market.
Give an example easier to enter, as well as the expected result.
Не дали ожидаемого результата - you can describe in more detail - what is wrong?

Parser tables in pdf format

0

More articles: