Trying to convert the Cyrillic from pdf to txt using PyPDF2 in Python :

import PyPDF2 pdf_file = open('mail_cir.pdf', 'rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.getPage(0) page_content = page.extractText() print (page_content.encode('utf-8')) 

Errors do not give, but the text does not see.

Tried to change:

  pdf_file = codecs.open('mail_cir.pdf', 'rb', encoding='utf-8') 

Then an error is issued:

TypeError: Can't convert 'bytes' object to str implicitly

And another question:

If you convert a verse, then after the end of the line, \n \n is output from scratch. How to get rid of these characters?

  • Are you sure the PDF has the text? It may just be the image of the page shows. - A1essandro
  • Of course the PDF has the text. I check - I insert in the text in Cyrillic the text in the Latin alphabet; the text is displayed, in Russian there. Thanks for the answer . - Olga
  • print (page_content.decode('cp1251').encode('utf-8')) try it, I haven't worked with python for a long time, and now there is no way to check if it can help - A1essandro
  • Gives an error: AttributeError: 'str' object has no attribute 'decode' - Olga
  • @ A1essandro, tested with the same library and extractText returned the text, but only ascii, Cyrillic was missing in it - gil9red

1 answer 1

You can use PDFMiner to get Russian text from pdf:

 #!/usr/bin/env python import sys import pdfminer.high_level # $ pip install pdfminer.six with open('mail_cir.pdf', 'rb') as file: pdfminer.high_level.extract_text_to_fp(file, sys.stdout) 

pdf2txt.py shows how this function can be used โ€” many options can be passed.

Input (in pdf)

 English ๐Ÿ‡ฌ๐Ÿ‡ง ะะฐ ั€ัƒััะบะพะผ ๐Ÿ‡ท๐Ÿ‡บ Smiley: โ˜บ non-BMP smiley: ๐Ÿ˜‚ 

Output (text in console)

 English ะะฐ ั€ัƒััะบะพะผ Smiley: โ˜บnon-BMP smiley: 

The Russian text is normally extracted, but the non-BMP smiley ๐Ÿ˜‚ (U + 1F602) and flags ๐Ÿ‡ฌ๐Ÿ‡ง (U + 1F1EC U + 1F1E7), ๐Ÿ‡ท๐Ÿ‡บ (U + 1F1F7 U + 1F1FA) were lost during the conversion.


Code with PyPDF2 , similar to the one in question, was able to extract only characters in the ASCII range.

  • Thank you so much for the answer !!! But I have a problem: from pdftypes import PDFObjectNotFound ImportError: cannot import name 'PDFObjectNotFound - Olga
  • @Olga: if you have difficulty installing the pdfminer.six package. Then ask a separate question specifically about installing this particular package. Indicate clearly what team installed? ( pip install pdfminer.six ). What is the axis? Python version? pdfminer.__version__ ? Add a minimal example of code that leads to ImportError (part of the code given in the answer above) and a full traceback. I ran the code in the answer โ€” you should not get an ImportError. - jfs
  • The issue is resolved! Thank you very much !!! - Olga