Remove duplicate lines in a file

Question

There is a text file, it has 1000 Emails, each Email is from a new line. Some of them are repeated. It is necessary that the output after processing was a file with only unique Emails. How to do this with Python 3 ?

Accepted Answer · 2017-01-08T06:53:40

The quickest and easiest way to remove duplicates from the list is to convert it to a set. The set () constructor accepts any iterable object, including a file descriptor. After that, it remains only to convert the set back to a string and write to another file:

with open('emails.txt') as in_fh, open('deduplicated.txt', 'w') as out_fh: out_fh.write(''.join(set(in_fh)))

@Ivan Ivshykov it is necessary to consider that Email from set will be recorded in random order
It is worth mentioning that the code in the answer can break if the last line does not contain a line break (or, in general, for lines that differ only by a space).

Community spirit ♦ one · Answer 2 · 2017-01-08T16:12:57

To print unique e-mails specified in the files specified on the command line, or from standard input:

 #!/usr/bin/env python import fileinput print("\n".join(set(map(str.strip, fileinput.input()))))

Example:

 $ dedup emails.txt >uniq-emails.txt

or:

 $ dedup < emails.txt >uniq-emails.txt

The code works even if there is an invisible space in the lines. For example, the last line in the file may have / not have a new line — the result will still be correct.

The presence of set() leads to the fact that the result is printed in an arbitrary order, which can vary from launch to launch. To emulate sort -u emails.txt , you can use groupby(sorted()) :

 #!/usr/bin/env python import fileinput from itertools import groupby for line, _ in groupby(sorted(map(str.strip, fileinput.input()))): print(line)

Usage is the same: input is read from files or stdin, output is printed to stdout.

For the case of e-mail, this is not necessary, but in the general case, to print only unique lines from large files that do not fit in memory — analog LC_ALL=C sort -u < input in Python:

 #!/usr/bin/env python3 import contextlib import heapq import sys from itertools import groupby from tempfile import TemporaryFile from operator import itemgetter def uniq(sorted_items): return map(itemgetter(0), groupby(sorted_items)) sorted_files = [] with contextlib.ExitStack() as stack: # sort lines in batches, write intermediate result to temporary files nbytes = 1 << 15 # read ~nbytes at a time for lines in iter(lambda f=sys.stdin.detach(): f.readlines(nbytes), []): lines.sort() file = stack.enter_context(TemporaryFile('w+b')) #NOTE: file is deleted on exit file.writelines(uniq(lines)) # write sorted unique lines file.seek(0) # rewind, to read later while merging partial results sorted_files.append(file) #NOTE: do not close the temporary file, yet # merge and write results sys.stdout = sys.stdout.detach() # suppress ValueError: underlying buffer has been detached sys.stdout.writelines(uniq(heapq.merge(*sorted_files)))

Example:

 $ sort-u < emails.txt >uniq-emails.txt

In this case, input is accepted only from the standard input and the lines are compared as sequences of bytes (it is assumed that all lines end with a new line character).

Related question: Sorting text file by using Python .

Remove duplicate lines in a file

2 answers 2

More articles: