Remove duplicate lines from file while maintaining line order

Question

There is a text file:

автозапчасти лексус новосибирск автозапчасти лексус в туле запчасти для lexus ls 460 разборка lexus rx запчасти на лексус rx 330 бу разборка lexus rx

Can I remove duplicate strings with Python 3?

I thought that there was a ready-made solution for such a common action.
Related question: How do you remove duplicates from a list in whilst preserving order?

leonidtime leonidtime 31 7 · Answer 1 · 2017-02-23T00:55:28

Found a working solution:

 file ='C:\\words.txt' uniqlines = set(open(file,'r', encoding='utf-8').readlines()) gotovo = open(file,'w', encoding='utf-8').writelines(set(uniqlines))

It removes duplicates. But unfortunately it also changes the layout of the lines. So the question remains relevant.

Twice just for some reason, the conversion to the set, and the files are not closed

Community spirit ♦ one · Answer 2 · 2017-02-23T07:03:29

You can use the fileinput to change the file in place:

 #!/usr/bin/env python3 import sys import fileinput with fileinput.FileInput(inplace=True, backup='.bak', mode='rb') as file: seen = set() for line in file: if line not in seen: # first time seen.add(line) sys.stdout.buffer.write(line) # redirected to the file

Example:

 T:\> python remove-duplicates-inplace.py C:\words.txt

Strings are compared literally, that is, even if the difference is only in spaces, the lines are considered different. You can normalize spaces if necessary:

 for line in file: words = tuple(line.split()) if words not in seen: seen.add(words) sys.stdout.buffer.write(line)

You can open the files manually:

 #!/usr/bin/env python3 from collections import OrderedDict filename = r'C:\words.txt' with open(filename, encoding='utf-8') as file: uniq = OrderedDict.fromkeys(file) with open(filename, 'w', encoding='utf-8') as file: file.writelines(uniq)

Both solutions require that unique strings be loaded into memory. If this is not the case, then you can use an external sort so that duplicate lines go in a row in a file, and then delete them using an algorithm that does not load unique lines into memory. .

Answer 3 · 2017-02-22T05:55:11

Maybe. For example:

 def delete_string(): File = open('test.txt', 'r') str_list = [] for i in File.readlines(): if i not in str_list: str_list.append(i) File.close() File = open(a, 'w') for j in str_list: File.write(j)

The code is not quality. But come down :)

..Error Syntax Error: def delete_string (): <string>, line 2, pos 20

Answer 4 · 2017-02-23T13:55:42

mkdtemp | os | shutil

 from tempfile import mkstemp from os import close from shutil import move def write_lines(file='words.txt'): ft, temp = mkstemp() # создать temp-файл lines = [] # "уникальные" строки из file with open(temp, 'w') as t, open(file) as f: for line in f: # читать file построчно if line not in lines: # для line, отсутствующих в lines lines.append(line) # сохранить line в lines t.write(line) # записать line в temp-файл close(ft) # закрыть temp-файл move(temp, file) # переместить/переименовать temp-файл в file

This is a quadratic algorithm (doubly it makes no sense to use a similar solution: if the input is small (and the time complexity is not important), then it does not make sense to use a temporary file: you can just read the lines in memory. If the input is moderately large, then the quadratic algorithm too slow be

Remove duplicate lines from file while maintaining line order

4 answers 4

More articles: