How to collect unique lines from a very large file?

Question

Given : a very large file (many times larger than the amount of RAM in a computer) with many lines that contain, for example, 1000 characters each (Cyrillic, Latin, numbers, characters, spaces) in random order. It specifically scatters 15% of duplicate rows. It is necessary to process this file and create a new file that will contain only unique lines.

I work on python 3.6 . I tried to process "bundles" of N lines (until the RAM is 80% full), collecting these packs into sets and complementing the data of these sets with a new file. Unsuccessful.

I tried to do the same thing several times (that is, I also extracted only the unique values from the final "new" file and wrote it in a new file in a circle). Unsuccessful.

I tried to take packs from different places of the document and make the selection of unique lines several times. Also unsuccessful. I tried to collect sets by collecting every second (third, tenth, etc.) line. The result is also unfortunate.

Maybe there is an effective method?

Show the code as you tried to process the file, and if possible a link to download the file with strings.
What is interesting is not so much the implementation, as the algorithm for the established conditions.

Harry Harry 106k 9 54 132 · Answer 1 · 2018-06-11T09:25:08

Variants of external sorting are valid, such as merge sort using external files. Then just go through the order with the identification of duplicates.

But I would, since the lines are long - used hashing, and in something like HashMap - I don’t know how it is called there in Python - I kept pointers to places in the file.

Further, if the hashes are different, then the lines are 100% different, so in order to identify duplicates, you only need to work with lists within the same hash value, and with a well-chosen hash, this is not at all that much. Here, for each list, it is possible to download strings into memory, if necessary. As a result, after removing duplicates, only pointers to unique strings will remain.

How to collect unique lines from a very large file?

1 answer 1

More articles: