Given : a very large file (many times larger than the amount of RAM in a computer) with many lines that contain, for example, 1000 characters each (Cyrillic, Latin, numbers, characters, spaces) in random order. It specifically scatters 15% of duplicate rows. It is necessary to process this file and create a new file that will contain only unique lines.

I work on python 3.6 . I tried to process "bundles" of N lines (until the RAM is 80% full), collecting these packs into sets and complementing the data of these sets with a new file. Unsuccessful.

I tried to do the same thing several times (that is, I also extracted only the unique values ​​from the final "new" file and wrote it in a new file in a circle). Unsuccessful.

I tried to take packs from different places of the document and make the selection of unique lines several times. Also unsuccessful. I tried to collect sets by collecting every second (third, tenth, etc.) line. The result is also unfortunate.

Maybe there is an effective method?

  • Show the code as you tried to process the file, and if possible a link to download the file with strings. - Alexander
  • The task is set to itself. The file also generated itself. The generation algorithm is, in principle, described. I see it inappropriate to upload a file to 60GB somewhere. - Konstantin
  • Do you want to implement sort -u file in Python? - jfs
  • Probably yes. What is interesting is not so much the implementation, as the algorithm for the established conditions. - Konstantin

1 answer 1

Variants of external sorting are valid, such as merge sort using external files. Then just go through the order with the identification of duplicates.

But I would, since the lines are long - used hashing, and in something like HashMap - I don’t know how it is called there in Python - I kept pointers to places in the file.

Further, if the hashes are different, then the lines are 100% different, so in order to identify duplicates, you only need to work with lists within the same hash value, and with a well-chosen hash, this is not at all that much. Here, for each list, it is possible to download strings into memory, if necessary. As a result, after removing duplicates, only pointers to unique strings will remain.