Remove duplicates in two files

Question

Actually there is file1 and file2, let’s say file1 has 100kk lines, and file2 is 20kk lines, you need to check file2 for file1, and write lines to new 2 files that are either in file1 or not, for how many line volumes are there really big ones? not resource-intensive solutions on c ++ or c #?

in theory, you can use StreamReader and do line-by-line reading and so on
However, both of your files will need to be fully loaded into memory, therefore not suitable for very large files.
I made it through a hashset, and I read the dictionary in chunks and compare the second file with it, but it still takes a very long time (

Answer 1 · 2017-12-17T22:37:45

See, a naive algorithm that will compare each line with each one will be too slow for you. Therefore, we will make an algorithm that will separate the data in one pass.

To begin to sort your files. This is actually the most resource-intensive operation. To do this, you may need external sorting if your files do not fit into memory.

Now that the files are sorted, you do this. Let you have the current line from the first file (let's call it a dictionary ), and the second (let's call it input ). Open both files and count the first line of each.

Further in the cycle, while the input or the dictionary has not ended, we do the following in the cycle:

compare current input line and dictionary line
if the dictionary line is larger than the input line, then the input line is not in the dictionary, append it to the second output file, read the next input line, and proceed to the next iteration
if the dictionary line is equal to the input line, then the input line is in the dictionary, append it to the first output file, read the next input line, and proceed to the next iteration
if the dictionary line is less than the input line, read the next line of the dictionary

If after the end of the cycle the input has not ended, all its lines also fall into the second output file.

Note that the algorithm is quite reminiscent of merge sorting, which, apparently, you will need for external sorting.

Remove duplicates in two files

1 answer 1

More articles: