Replace Russian lines from one file with lines from another: find Russian lines in a file

Question

There are two text files ( test.txt , test1.txt ). The first contains one version of the Russian translation, and the second contains a translation for various languages, including Russian.

 (test.txt) Крушите всё! Нет! Вот, значит, до чего всё дошло… (test1.txt) Destroy everything!! ¡¡Destruyámoslo todo!! Руби и круши! Zerstört alles. Cassez tout ! Distruggi tutto!! Destrua tudo!! No!! And so it has come to this… ¡¡No!! Ya hemos llegado a esto... Нет! Вот, значит, до чего все дошло… Nein!! Euer Ende ist nah ... Non !! Tout ça pour en arriver là... No!! E quindi siamo arrivati a questo... Não!! E, assim, chegou a isso…

There are also dictionaries with the alphabet of numbers, whitespace-characters, Russian and English.

The question is how to replace the lines with the Russian language from the second file with the lines from the first?

Here are the best practices:

 eng = [ 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S', 'T','U','V','W','X','Y','Z', 'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s', 't','u','v','w','x','y','z', '0','1','2','3','4','5','6','7','8','9', '!','"','#','$','%','&','\'','(',')','*','+',',','- ','.','/',':',';','<','=','>','?','@','[','\\',']','^','_','`','{','|','}','~'] rus = [ 'А','Б','В','Г','Д','Е','Ж','З','И','Й','К','Л','М','Н','О','П','Р','С','Т', 'У','Ф','Х','Ц','Ч','Ш','Щ','Ы','Э','Ю','Я', 'а','б','в','г','д','е','ж','з','и','й','к','л','м','н','о','п','р','с','т', 'у','ф','х','ц','ч','ш','щ','ы','э','ю','я'] symbols = [ '0','1','2','3','4','5','6','7','8','9', '!','"','#','$','%','&','\'','(',')','*','+',',','- ','.','/',':',';','<','=','>','?','@','[','\\',']','^','_','`','{','|','}','~'] srus = set(rus) seng = set(eng) ssym = set(symbols) sl = lambda a,b: a.intersection(b) a1 = [] a2 = [] with open(u'test1.txt','r',encoding='utf-8') as fdata, open('test.txt','r',encoding='utf-8') as sdata: for i,v in enumerate(fdata): value = [j for j in v] value = [j.replace('\n','') for j in value] svalue = set(value) a1.append(''.join(value)) for i,v in enumerate(sdata): value = [j for j in v] value = [j.replace('\n','') for j in value] svalue = set(value) if sl(srus,svalue) | (sl(srus,svalue) & sl(seng,svalue)) | (sl(srus,svalue) & sl(ssym,svalue)): #если есть рус., англ. и прочие символы a2.append(''.join(value)) else: pass

@jfs No, in the second ( test1.txt ) file there is both Russian and other languages.

Accepted Answer · 2016-11-19T17:06:26

Here are two things:

Usually, you can’t just take and replace one line with another in a text file without rewriting all the contents to the end of the file. The usual solution is either to load the entire file into memory, change the lines, and rewrite the file at the end. Or use a temporary file where the result is written and rename the temporary file to the source file at the end.
Determine whether the string contains a Russian translation or not. Different criteria are possible here. For simplicity, we assume that the lines with the Russian translation contain the specified characters ( rus ) and only these lines contain similar characters (there is no language mixing).

Assuming that the order and number of translations for the Russian language coincide in both files:

 #!/usr/bin/env python3 from pathlib import Path from tempfile import NamedTemporaryFile path = Path('multilingual.txt') with path.open(encoding='utf-8') as multilingual_file,\ open('russian.txt', encoding='utf-8') as russian_file, \ NamedTemporaryFile('w', encoding='utf-8', dir=str(path.parent), delete=False) as output_file: for line in multilingual_file: if russian(line): line = next(russian_file) # replace it print(line.rstrip('\n'), file=output_file) Path(output_file.name).replace(path)

where russian(line) determines whether line is a russian(line) translation:

 def russian(line, alphabet=set(rus)): return not alphabet.isdisjoint(line) # содержит ли line символы из rus

regular expressions can be used:

 import re russian = re.compile(r'|'.join(map(re.escape, rus))).search

or explicit ranges of the type: [\u0400–\u04FF] (depends on the task, whether it is appropriate).

Strange, but in the text, starting from the 2214 lines, the text began to shift by one line and sometimes just deleted it, leaving punctuation marks.
@CockLobster is not strange, but it is expected that after a couple of thousand lines there may be an error in the file and the assumptions necessary for the operation of the code (for example, an unexpected newline \ n) are violated.
To avoid this, you need a more error-resistant file format to use.
For example, number all transfers (add a number before each transfer).
Then formatting errors in one translation will not affect the recognition of subsequent translations.
And maybe then divide the text into parts and then the code will work?
The code works (plus / minus bugs) if the input follows the assumptions explicitly stated in the answer.
If you think it is wrong, then give a minimal example of input files, the expected output and what happens instead, using the code in the answer, and add it all to the question.

Replace Russian lines from one file with lines from another: find Russian lines in a file

1 answer 1

More articles: