There is a text file:

автозапчасти лексус новосибирск автозапчасти лексус в туле запчасти для lexus ls 460 разборка lexus rx запчасти на лексус rx 330 бу разборка lexus rx 

Can I remove duplicate strings with Python 3?

4 answers 4

Found a working solution:

 file ='C:\\words.txt' uniqlines = set(open(file,'r', encoding='utf-8').readlines()) gotovo = open(file,'w', encoding='utf-8').writelines(set(uniqlines)) 

It removes duplicates. But unfortunately it also changes the layout of the lines. So the question remains relevant.

You can use the fileinput to change the file in place:

 #!/usr/bin/env python3 import sys import fileinput with fileinput.FileInput(inplace=True, backup='.bak', mode='rb') as file: seen = set() for line in file: if line not in seen: # first time seen.add(line) sys.stdout.buffer.write(line) # redirected to the file 

Example:

 T:\> python remove-duplicates-inplace.py C:\words.txt 

Strings are compared literally, that is, even if the difference is only in spaces, the lines are considered different. You can normalize spaces if necessary:

 for line in file: words = tuple(line.split()) if words not in seen: seen.add(words) sys.stdout.buffer.write(line) 

You can open the files manually:

 #!/usr/bin/env python3 from collections import OrderedDict filename = r'C:\words.txt' with open(filename, encoding='utf-8') as file: uniq = OrderedDict.fromkeys(file) with open(filename, 'w', encoding='utf-8') as file: file.writelines(uniq) 

Both solutions require that unique strings be loaded into memory. If this is not the case, then you can use an external sort so that duplicate lines go in a row in a file, and then delete them using an algorithm that does not load unique lines into memory. .

    Maybe. For example:

     def delete_string(): File = open('test.txt', 'r') str_list = [] for i in File.readlines(): if i not in str_list: str_list.append(i) File.close() File = open(a, 'w') for j in str_list: File.write(j) 

    The code is not quality. But come down :)

    • Does not work. ..Error Syntax Error: def delete_string (): <string>, line 2, pos 20 - leonidtime
    • Isn't that a mistake? After the name of the function, I forgot ":" Fixed - Pavel Durmanov

    mkdtemp | os | shutil

     from tempfile import mkstemp from os import close from shutil import move def write_lines(file='words.txt'): ft, temp = mkstemp() # создать temp-файл lines = [] # "уникальные" строки из file with open(temp, 'w') as t, open(file) as f: for line in f: # читать file построчно if line not in lines: # для line, отсутствующих в lines lines.append(line) # сохранить line в lines t.write(line) # записать line в temp-файл close(ft) # закрыть temp-файл move(temp, file) # переместить/переименовать temp-файл в file 
    • Try to write more detailed answers. Explain how your code works. - mymedia
    • There are also comments. Everything is obvious) - Pavel Durmanov
    • This is a quadratic algorithm (doubly it makes no sense to use a similar solution: if the input is small (and the time complexity is not important), then it does not make sense to use a temporary file: you can just read the lines in memory. If the input is moderately large, then the quadratic algorithm too slow be - jfs