Need advice how to extract data from PDF

Question

There is a PDF file in which there is a table with data. A table of the form "Name - Data" and so in the column. Photo attached I'm trying to find a way to implement how to pull out its data on a cell with the name. I considered the whole PDF as trying to pull it out through the regulars. But the cant is that it is not clear to understand where his data is and what data to pull out because the pdf is read and produces plain text (data can change in Data). Tried to do something like pdf decompress but something is not successful. Software is not enough on this topic. And I can not find out the internal tags. Someone who knows, tell me :) I will be very thankful For example, there is a cell Supplier that you need to pull out its data from Aker Solution ...

KAGG Design KAGG Design 20.5k 3 12 44 · Answer 1 · 2017-03-04T18:28:12

With pdf everything is bad. I tried everything I could - converting to Word, to Excel, a text file. The most promising way is to convert to html and parse using the selenium web driver.

KAGG Design

20.5k 3 12 44

hm, interesting thought. Thank. I'll try :) - JMNext
in this case, the markup (fonts, tables, color, headers) is reproduced quite well, which makes it quite easy to find the necessary elements. If there were no jambs ... - KAGG Design
Yes, only now it is necessary to figure out how to first convert the file through the code into the folder with the program, for example, and save the path to the saved file. And then just go to him and find the right one. So far, I haven't got everything in my head in a bunch — JMNext
What code? All you need to convert is Adobe Acrobat. - KAGG Design
I need to implement through code. What would the list of files threw. It counted the fields from the file and pulled them out into a separate file. But from a PDF file, I don’t understand how to make table cells. - JMNext

|

Aleksei aleksei 556 2 7 · Answer 2 · 2017-03-04T18:49:18

To work with PDF, I used iTextSharp open source library. Very powerful, but in order to get something out of the PDF, you have to go deep into how its structure is arranged. If you have a one-time task - perhaps it will be too much, but if you need to set up data import on a regular basis, that’s what you need.

Aleksei

556 2 7

Yes, but I have a problem like this. There are for example 10 such different tables with different data. I need to pull them out and write to another document. How to find a regular field can be. But how to understand where the data lie that relate to this field. Jamb all in it. - JMNext
one
PDF is inherently very different from DOC or, say, XLS. In it, most often, there are neither tables, nor cells, but there is text, vector lines and raster inserts. Therefore, in most cases it is possible to correlate a specific text with a specific field only by coordinates. But with the coordinates all very difficult, because sometimes for different elements its own coordinate system is used, given by the transformation matrix. And a lot depends on the source, because different programs form PDF in very different ways. - Aleksei
I still have such an idea. Need to check. And if you try to convert to xml? I wonder how it will look inside - JMNext
And I tried xml. Something better than html, something worse. But in both cases there are shoals. Sometimes the data is mixed. In general, there is enough fun. - KAGG Design
Yes, PDF is generally a fun format - JMNext

|

Need advice how to extract data from PDF

2 answers 2

More articles: