Read Cyrillic from Python 3 rtf file

Question

Faced a problem on python3 - read Cyrillic text from a file (rtf, txt file). Displays incomprehensible values of type u / 2424 and so on to the terminal. I tried everything, I can not figure out how to re-encode it to output normally. And how to read one word from a file? Thanks for the answer! Just opened this method. The output does not change!

open('...', 'r', encoding='utf-8') file = open('some_text.rtf', 'r') print(file.readlines())

Here is the Cyrillic conclusion:

 ['\xd0\x92\xd0\xb0\xd1\x88 \xd1\x88\xd0\xb5\xd0\xb4\xd0\xb5\xd0\xb2\xd1\x80 \xd0\xb3\xd0\xbe\xd1\x82\xd0\xbe\xd0\xb2!\n', '\xd0\xa1 \xd0\xb4\xd1\x80\xd1\x83\xd0\xb3\xd0\xbe\xd0\xb9

See Help: How to create a minimal, self-sufficient and reproducible example
In open there is an encoding parameter; if it is not specified, the encoding is taken system-wide.
did not help, the output does not change also wrote at the beginning of the file #encoding utf-8 also did not help
@VadimVova, add to the question, with what encoding do you open, and what is displayed.

Accepted Answer · 2016-06-02T07:10:34

RTF is not a plain-text file and it will not read anything from it. It additionally stores tables of fonts, colors, styles, and what else is unknown. Moreover, apparently, the text is not stored there as u / 1234 bytes, but as 'u / 1234' strings (!) (Or for another suitable encoding, for example, win-1251 - in this case, the Cyrillic character is represented as \ 'b2, \' a4). Fortunately, there are a couple of old libraries. Here is an example for pyth (for a simple file with one line it worked):

 from pyth.plugins.rtf15.reader import Rtf15Reader doc = Rtf15Reader.read(open("doc.rtf", "r")) for paragraph in doc.content: for word in paragraph.content: print(word.__dict__["content"]) # Вывод в виде unicode строки

Documentation is not so good and it is not known exactly how the library handles tables, images.

Also open and text file (txt) is there a difference when working on a Mac or Windows (I mean, encoding)
@VadimVova, if in open you specify the correct encoding of the txt file being opened, then it will correctly open the files on the Mac, and on Windows, and on Linux.
@VadimVova, if you have questions on txt - create a separate question.
It seems that pyth works only on Python 2 and does not understand the \ucN command and surrogate pairs (needed for non-BMP characters)

Answer 2 · 2016-06-02T23:39:32

If the machine already has LibreOffice, then you can rely on it to support even astral symbols such as emoticons, flags, etc. pyth and most other rtf libraries can lose characters from an rtf document with characters that require utf-16 surrogate pairs , for example test.rtf :

 {\rtf1\ansi\ansicpg1251\uc0 test [\'ff] [\u9786] [\u-10187\u-9138] [\u-10180\u-8710\u-10180\u-8712].}

The $ rtf2txt test.rtf saves the text in test.txt and prints it:

 test [я] [☺] [𝑎] [🇺🇸].

where rtf2txt :

 #!/usr/bin/env python3 """Convert rtf-file(s) to plain text using LibreOffice. Usage: rtf2txt <rtf-file>... """ from getpass import getuser from pathlib import Path from subprocess import DEVNULL, check_call from sys import argv from tempfile import TemporaryDirectory filenames = argv[1:] with TemporaryDirectory('LibreOffice_Conversion_' + getuser()) as td: check_call([ 'soffice', '--headless', # implied by convert-to # https://wiki.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options '--infilter="Rich Text Format (StarCalc)"', # limit input formats # specify the encoding explicitly for the output '--convert-to', 'txt:Text (encoded):UTF8', # https://bugs.documentfoundation.org/show_bug.cgi?id=37531 '-env:UserInstallation=' + Path(td).as_uri() ] + filenames, stdout=DEVNULL) for path in map(Path, filenames): print(path.with_suffix('.txt').read_text('utf-8'))

Read Cyrillic from Python 3 rtf file

2 answers 2

More articles: