Trying to convert the Cyrillic from pdf to txt using PyPDF2 in Python :
import PyPDF2 pdf_file = open('mail_cir.pdf', 'rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.getPage(0) page_content = page.extractText() print (page_content.encode('utf-8')) Errors do not give, but the text does not see.
Tried to change:
pdf_file = codecs.open('mail_cir.pdf', 'rb', encoding='utf-8') Then an error is issued:
TypeError: Can't convert 'bytes' object to str implicitly
And another question:
If you convert a verse, then after the end of the line, \n \n is output from scratch. How to get rid of these characters?
print (page_content.decode('cp1251').encode('utf-8'))try it, I haven't worked with python for a long time, and now there is no way to check if it can help - A1essandroAttributeError: 'str' object has no attribute 'decode'- OlgaextractTextreturned the text, but only ascii, Cyrillic was missing in it - gil9red