Given : a very large file (many times larger than the amount of RAM in a computer) with many lines that contain, for example, 1000 characters each (Cyrillic, Latin, numbers, characters, spaces) in random order. It specifically scatters 15% of duplicate rows. It is necessary to process this file and create a new file that will contain only unique lines.
I work on python 3.6 . I tried to process "bundles" of N lines (until the RAM is 80% full), collecting these packs into sets and complementing the data of these sets with a new file. Unsuccessful.
I tried to do the same thing several times (that is, I also extracted only the unique values from the final "new" file and wrote it in a new file in a circle). Unsuccessful.
I tried to take packs from different places of the document and make the selection of unique lines several times. Also unsuccessful. I tried to collect sets by collecting every second (third, tenth, etc.) line. The result is also unfortunate.
Maybe there is an effective method?
sort -u filein Python? - jfs