Convert Russian text from pdf to txt python

Question

Trying to convert the Cyrillic from pdf to txt using PyPDF2 in Python :

import PyPDF2 pdf_file = open('mail_cir.pdf', 'rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.getPage(0) page_content = page.extractText() print (page_content.encode('utf-8'))

Errors do not give, but the text does not see.

Tried to change:

  pdf_file = codecs.open('mail_cir.pdf', 'rb', encoding='utf-8')

Then an error is issued:

TypeError: Can't convert 'bytes' object to str implicitly

And another question:

If you convert a verse, then after the end of the line, \n \n is output from scratch. How to get rid of these characters?

I check - I insert in the text in Cyrillic the text in the Latin alphabet; the text is displayed, in Russian there.
print (page_content.decode('cp1251').encode('utf-8')) try it, I haven't worked with python for a long time, and now there is no way to check if it can help
Gives an error: AttributeError: 'str' object has no attribute 'decode'
@ A1essandro, tested with the same library and extractText returned the text, but only ascii, Cyrillic was missing in it

jfs jfs 44.5k eight 53 199 · Accepted Answer · 2016-11-15T20:47:36

You can use PDFMiner to get Russian text from pdf:

 #!/usr/bin/env python import sys import pdfminer.high_level # $ pip install pdfminer.six with open('mail_cir.pdf', 'rb') as file: pdfminer.high_level.extract_text_to_fp(file, sys.stdout)

pdf2txt.py shows how this function can be used — many options can be passed.

Input (in pdf)

 English 🇬🇧 На русском 🇷🇺 Smiley: ☺ non-BMP smiley: 😂

Output (text in console)

 English На русском Smiley: ☺non-BMP smiley:

The Russian text is normally extracted, but the non-BMP smiley 😂 (U + 1F602) and flags 🇬🇧 (U + 1F1EC U + 1F1E7), 🇷🇺 (U + 1F1F7 U + 1F1FA) were lost during the conversion.

Code with PyPDF2 , similar to the one in question, was able to extract only characters in the ASCII range.

But I have a problem: from pdftypes import PDFObjectNotFound ImportError: cannot import name 'PDFObjectNotFound
@Olga: if you have difficulty installing the pdfminer.six package.
Then ask a separate question specifically about installing this particular package.
Add a minimal example of code that leads to ImportError (part of the code given in the answer above) and a full traceback.
I ran the code in the answer — you should not get an ImportError.

Convert Russian text from pdf to txt python

1 answer 1

Input (in pdf)

Output (text in console)

More articles: