How to remove duplicates with a file of 100GB in size?

Question

There is a txt file whose size is 107GB free on a 109 GB screw. What is better to use to quickly get rid of duplicate lines in a text file.

I tried the command "awk '! Seen [$ 0] ++' text.txt" It all started nicely and very quickly, but after 15-17 hours I saw how it does everything along the line and really started to blunt the comp.

I am looking towards uniq text.txt> text_new.txt but I don’t know how much better it will be than the previous command.

Who can advise what?

uniq requires strings to be sorted, if this is not the case, they must be sorted by sort , which means you can immediately specify the -u flag and only one unique strings remain without using uniq
As I understand it, I need to execute the following command cat text.txt |

Answer 1 · 2015-07-27T19:39:54

uniq will not work if the rows are not sorted. That is, everything will go down the drain. There are different options, but it all depends on the number of those duplicates. If there are not many duplicates and they can fit in 2 gigabytes, then you can do so. With grep, only those lines that begin with 'aa' (or aaa) are selected, there will be not many of these lines, and if there are duplicates among them, they will all be here. Gradually, passing through the file, you can select all groups. The code will be somewhere like this (I write on a pearl, since there is an awk and linux, it means there is a pearl).

 for (my $l = 'aaa'; $l ne 'qqq'; $l++) { `egrep '^$l' file.txt | sort -u >> res.txt` }

If there is not enough space, you can transfer the file with the result to another computer. It is clear that this method is good for the English text, but for the Cyrillic alphabet you need to grind.

Also, you can use the knobs to set the range and move slowly.

Option two. To do this, you will need mysql or something like that. We make there a table with one column and make the values in this field unique. Next, line by line (but better with bundles) fill in the lines. Now it is already a problem base to track duplicates. Since the base can be placed on another machine, the problem with the available space is solved :)

Option three. Since this is Linux, you can mount the file system from another machine where there is enough free space.

Option four. If the RAM is sufficient, then you can run the risk run right away sort -u. I literally recently sorted a file of about 800 megabytes (yes, it is 120 times smaller, but still) and it managed in a minute (on a machine with i3, 8Gb RAM) And I think that most of the time was spent reading / writing the file (yes, there is not ssd, but a regular disk).

Option five. There is already a need to look, and what is the data inside. If there are, for example, ip addresses, then you can simply bring a bitmap into memory (which will be 2 ^^ 32/8 = 512 mb!) And just put bits in linear time. Then in the same way, once again go through the resulting array and restore the addresses.

Option six. If the text in each line is different, then you can come up with a hash function that will encode each line in bytes (even the usual amount will do, but you need to experiment). For the same lines, the function will give the same hash code. But not vice versa (that is, there may be strings that have the same code, but the values are different - called a collision). Now, running through the file, we consider hash for each line. If it is equal to the specified one, write the line to a separate file. With the right approach, the file will be divided into 200-256 pieces, that is, half a gig each. And these pieces can already sort -u and get rid of. In principle, this is very similar to the first method; there the simple hash function was simple - the first two or three characters of the string.

UPD

At first, I did not notice that 109 Gigov was free there. It changes everything a lot. At first I thought that there are only 109 gigs and therefore only 2 are free :)

@ 4BAL0V, neither for the sort program, nor for uniq , the interpretation of the content does not matter: bytes are compared.
echo -e '\x00\n\x00\n\x01' | sort -u | wc -l returns, as expected, 2 (two unique strings).

Qwertiy ♦ Qwertiy 76k 17 73 198 · Answer 2 · 2015-07-27T20:49:01

If the order of the rows is not important, then you can implement a merge sort that can work with the data on the disk.

How to remove duplicates with a file of 100GB in size?

2 answers 2

More articles: