Faced a problem on python3 - read Cyrillic text from a file (rtf, txt file). Displays incomprehensible values ​​of type u / 2424 and so on to the terminal. I tried everything, I can not figure out how to re-encode it to output normally. And how to read one word from a file? Thanks for the answer! Just opened this method. The output does not change!

open('...', 'r', encoding='utf-8') file = open('some_text.rtf', 'r') print(file.readlines()) 

Here is the Cyrillic conclusion:

 ['\xd0\x92\xd0\xb0\xd1\x88 \xd1\x88\xd0\xb5\xd0\xb4\xd0\xb5\xd0\xb2\xd1\x80 \xd0\xb3\xd0\xbe\xd1\x82\xd0\xbe\xd0\xb2!\n', '\xd0\xa1 \xd0\xb4\xd1\x80\xd1\x83\xd0\xb3\xd0\xbe\xd0\xb9 
  • Show your code and write what it actually displays. - insolor
  • In open there is an encoding parameter; if it is not specified, the encoding is taken system-wide. Specify the file encoding in it. For example: open('...', 'r', encoding='utf-8') - gil9red
  • did not help, the output does not change also wrote at the beginning of the file #encoding utf-8 also did not help - Vadim Vova
  • @VadimVova, add to the question, with what encoding do you open, and what is displayed. Otherwise it will be a fortune telling. - insolor

2 answers 2

RTF is not a plain-text file and it will not read anything from it. It additionally stores tables of fonts, colors, styles, and what else is unknown. Moreover, apparently, the text is not stored there as u / 1234 bytes, but as 'u / 1234' strings (!) (Or for another suitable encoding, for example, win-1251 - in this case, the Cyrillic character is represented as \ 'b2, \' a4). Fortunately, there are a couple of old libraries. Here is an example for pyth (for a simple file with one line it worked):

 from pyth.plugins.rtf15.reader import Rtf15Reader doc = Rtf15Reader.read(open("doc.rtf", "r")) for paragraph in doc.content: for word in paragraph.content: print(word.__dict__["content"]) # Вывод в виде unicode строки 

Documentation is not so good and it is not known exactly how the library handles tables, images.

  • Also open and text file (txt) is there a difference when working on a Mac or Windows (I mean, encoding) - Vadim Vova
  • Does this library work on Python 3? - jfs
  • @VadimVova, if in open you specify the correct encoding of the txt file being opened, then it will correctly open the files on the Mac, and on Windows, and on Linux. - insolor
  • @VadimVova, if you have questions on txt - create a separate question. - m9_psy
  • It seems that pyth works only on Python 2 and does not understand the \ucN command and surrogate pairs (needed for non-BMP characters) - jfs

If the machine already has LibreOffice, then you can rely on it to support even astral symbols such as emoticons, flags, etc. pyth and most other rtf libraries can lose characters from an rtf document with characters that require utf-16 surrogate pairs , for example test.rtf :

 {\rtf1\ansi\ansicpg1251\uc0 test [\'ff] [\u9786] [\u-10187\u-9138] [\u-10180\u-8710\u-10180\u-8712].} 

The $ rtf2txt test.rtf saves the text in test.txt and prints it:

 test [я] [☺] [𝑎] [🇺🇸]. 

where rtf2txt :

 #!/usr/bin/env python3 """Convert rtf-file(s) to plain text using LibreOffice. Usage: rtf2txt <rtf-file>... """ from getpass import getuser from pathlib import Path from subprocess import DEVNULL, check_call from sys import argv from tempfile import TemporaryDirectory filenames = argv[1:] with TemporaryDirectory('LibreOffice_Conversion_' + getuser()) as td: check_call([ 'soffice', '--headless', # implied by convert-to # https://wiki.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options '--infilter="Rich Text Format (StarCalc)"', # limit input formats # specify the encoding explicitly for the output '--convert-to', 'txt:Text (encoded):UTF8', # https://bugs.documentfoundation.org/show_bug.cgi?id=37531 '-env:UserInstallation=' + Path(td).as_uri() ] + filenames, stdout=DEVNULL) for path in map(Path, filenames): print(path.with_suffix('.txt').read_text('utf-8'))