If the machine already has LibreOffice, then you can rely on it to support even astral symbols such as emoticons, flags, etc. pyth and most other rtf libraries can lose characters from an rtf document with characters that require utf-16 surrogate pairs , for example test.rtf :
{\rtf1\ansi\ansicpg1251\uc0 test [\'ff] [\u9786] [\u-10187\u-9138] [\u-10180\u-8710\u-10180\u-8712].}
The $ rtf2txt test.rtf saves the text in test.txt and prints it:
test [я] [☺] [𝑎] [🇺🇸].
where rtf2txt :
#!/usr/bin/env python3 """Convert rtf-file(s) to plain text using LibreOffice. Usage: rtf2txt <rtf-file>... """ from getpass import getuser from pathlib import Path from subprocess import DEVNULL, check_call from sys import argv from tempfile import TemporaryDirectory filenames = argv[1:] with TemporaryDirectory('LibreOffice_Conversion_' + getuser()) as td: check_call([ 'soffice', '--headless', # implied by convert-to # https://wiki.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options '--infilter="Rich Text Format (StarCalc)"', # limit input formats # specify the encoding explicitly for the output '--convert-to', 'txt:Text (encoded):UTF8', # https://bugs.documentfoundation.org/show_bug.cgi?id=37531 '-env:UserInstallation=' + Path(td).as_uri() ] + filenames, stdout=DEVNULL) for path in map(Path, filenames): print(path.with_suffix('.txt').read_text('utf-8'))
openthere is anencodingparameter; if it is not specified, the encoding is taken system-wide. Specify the file encoding in it. For example:open('...', 'r', encoding='utf-8')- gil9red