Comparing large non-text files

Question

There are some files that the program should compare (they will not always text), and they can be large, so the byte-by-byte comparison is most likely impossible, because the speed should be normal.

I need to check files for whether they are equal or not, no need to look for any discrepancies.

Prompt good decisions, or at least which way to look.

Well, a stupid comparison of bytes per byte, I suspect, will be the best in speed.
@VadimProkopchuk it will be definitely slower byte-by-byte comparison
Without a detailed explanation of the phrase "compare files", an answer cannot be given.
You need to find out how the files are different, or just check if the files are byte-by-byte identical?
most importantly, when implementing byte-by-byte comparison, do not read files from disk byte-bye.

Accepted Answer · 2017-11-13T15:53:53

As noted in the comments, byte-by-byte comparison will be the fastest (if all the bytes are in memory, of course - however, in C # the opposite must be realized).

But for a start, files can simply be compared in size. If it is different, the files are obviously different.

If the dimensions are identical, and you still need to compare byte-bye, the files can be divided into pieces, and run the comparison in parallel mode.

I will give the code with an example of such a comparison. The code is not perfect and not universal, but you can push it off

using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Threading.Tasks; namespace FileCompare { class Program { static void Main(string[] args) { //Считываем файлы в память var file1 = File.ReadAllBytes("file1.bin"); //10 000 000 байт var file2 = File.ReadAllBytes("file2.bin"); //10 000 000 байт var partsCount = 10; //количество частей var partSize = 1000000; //размер части //делим файлы на списки частей var parts1 = new List<IEnumerable<byte>>(); var parts2 = new List<IEnumerable<byte>>(); for (int i = 0; i < partsCount; i++) { var skip = i * partSize; parts1.Add(file1.Skip(skip).Take(partSize)); parts2.Add(file2.Skip(skip).Take(partSize)); } //Создаём список тасков. Метод Task.Run запускает задачу сразу после создания //Так же обратите внимание на current = i в каждой итерации. var comparsionTasks = new Task<bool>[partsCount]; for (int i = 0; i < partsCount; i++) { var current = i; comparsionTasks[current] = Task.Run(()=> { Console.WriteLine($"Running {current}..."); return parts1[current].SequenceEqual(parts2[current]); }); } //ожидаем завершения всех задач Task.WaitAll(comparsionTasks); //вывод результатов в консоль foreach (var task in comparsionTasks) Console.WriteLine(task.Result); Console.ReadKey(); } } }

By the way, keep in mind that for small files this solution will only slow down the comparison. Here only to measure.

And one more thing : you can use CancellationToken to stop the comparison as soon as different bytes are detected.

It’s not a good idea to read the entire contents of a file from disk at once, especially if the chance that the files differ is quite high.
And in general, some kind of tin with reading in memory and pulling pieces through linq.
Is it really impossible to read sharpe block by block (say, 64K)?
And really in Windows (where they mostly write on Sharp) in the kernel, forward reading is not implemented?
If the answer to both questions is negative, then the optimal algorithm is obvious.

Answer 2 · 2017-11-13T16:15:44

Of course, the proposed frontal option is correct, but ineffective. It's easier to compute and compare file hashes (for example, md5). Since I haven’t written under C # for a long time, I’ll write some pseudocode:

  boolean identical=true; byte[] buffer; while(!file1.eof() && identical) { file1.read(buffer1); file2.read(buffer2); hash1=hashFunction(buffer1); hash2=hashFunction(buffer1); if(hash1!=hash2) identical=false; }

Colleagues more versed in C # will easily translate it into normal code.

Update

According to the results of the discussion, it is proposed to change the pseudocode, for each file we calculate the hash of its content:

  byte[] buffer; hash=null; while(!file.eof()) { file.read(buffer1); hash=hashFunction(buffer+hash); }

Further, when comparing files, if the file lengths are the same, you can already simply compare the previously saved hash values.

It is preferable to use a hash function such as SHA-512, which guarantees the lowest probability of collisions.

Are you trying to say that comparing two bytes is slower than calculating MD5 from one byte?
It doesn’t need to be hash counting. read the file as well.
And even in this answer, "But it is not the same as the files are equal "
Counting hashes makes sense if there are more than two files.
because then it really saves repeated readings of the same file when comparing it with others.
if your answer is attached to the question of topicaster, it will not help.
your pseudocode considers hash for each pairwise comparison.
write that the hash should be remembered without pairwise comparison, that if the hashes match, you still need to compare byte-byte - and the answer will be good.
@Barmaley, the question never says that more than 2 files will be compared at the same time.
And when comparing hashes, an incorrect result is possible due to collisions.
In an amicable way, if the hash comparison gave the answer that the files are the same, you need to go over them again and compare them byte-bye.

Comparing large non-text files

2 answers 2

More articles: