There is a file, the file is divided into blocks (sequences of bytes).
How to calculate the hash-function SHA256 for each block of a file using Threads multithreadedly?

  • 2
    I'm not sure that this is generally possible in principle, because Hash functions basically build a value from a subsequent block based on the previous one, and to build the final value, you need to sequentially drive the entire data stream through the hash function. - etki
  • Do you need to calculate hashes of only blocks or hash of the whole file (with blocking )? - andreycha
  • Well, judging by the task: divide the file into blocks and get a hash for each block. Actually getting a hash for the block and is made in Thread. - Grundy

2 answers 2

First, let's write a function for one block.

public static class Sha256Service { private const long blockSize = 16384; public static byte[] CalculateHash(string filename, int blockIndex) { using (var stream = new FileStream(filename, FileMode.Open, FileAccess.Read)) using (var hasher = new SHA256Managed()) { stream.Seek(blockIndex * blockSize, SeekOrigin.Begin); var buffer = new byte[blockSize]; var actualLength = stream.Read(buffer, 0, buffer.Length); return hasher.ComputeHash(buffer, 0, actualLength); } } } 

Now this function must be called in parallel for each block in the file. Independently create Thread laborious, I would use PLINQ.

The number of blocks is calculated according to a cunning formula that every programmer just has to memorize once:

 var blockCount = (fileSize + blockSize - 1)/blockSize; 

We add the second method for a parallel call of the first:

 public static byte[][] Calculate(string filename) { var blockCount = (int)((new FileInfo(filename).Length + blockSize)/blockSize); var result = new byte[blockCount][]; Parallel.For(0, blockCount, (i) => result[i] = calculate(filename, i)); return result; } 

The file will be open for reading in several streams. This is not a problem for RAID arrays and solid-state drives, but as far as I know on single hard drives, performance will decrease.

If done through Thread , the code will start to look much worse. Something like this:

 static byte[][] Calculate2(string filename) { var blockCount = (int)((new FileInfo(filename).Length + blockSize) / blockSize); var result = new byte[blockCount][]; var threads = new Thread[blockCount]; for (int i = 0; i < blockCount; i++) { var i1 = i; threads[i] = new Thread(() => result[i1] = CalculateHash(filename, i1)); threads[i].Start(); } for (int i = 0; i < blockCount; i++) threads[i].Join(); return result; } 

What do you need to pay attention to? First, how cunningly duplicates the variable i within the first for loop. This is a famous pattern for working with loop variables inside closures . Fearfully.

Secondly, Thread no WaitAll method, you have to write the same almost by hand.

ANSWER TO AN ADDITIONAL QUESTION IN COMMENTARY

The above code could also process files larger than RAM, if it were not for one “but” - it creates a stream for each block, and the stream has a default stack size of 1 megabyte.

Again, the theory states that such a number of threads will not be executed, and in fact in parallel, because there are much more threads than cores.

The task is a test one, and I would not bother, setting the number of threads to 4 or 8. Then it would be more convenient to consider block hashes as large chunks consisting of a large number of consecutive blocks. It could be done in 4 or 8 threads. The CalculateHash method would be more complicated:

 public static void CalculateHash2(string filename, int inclusiveStartBlock, int exclusiveEndBlock, byte[][] hashes) { var buffer = new byte[bufferSize]; using (var stream = new FileStream(filename, FileMode.Open, FileAccess.Read)) using (var hasher = new SHA256Managed()) { stream.Seek(inclusiveStartBlock * blockSize, SeekOrigin.Begin); for (int i = inclusiveStartBlock; i < exclusiveEndBlock; i++) { var actualLength = stream.Read(buffer, 0, buffer.Length); hashes[i] = hasher.ComputeHash(buffer, 0, actualLength); } } } 

The Calcualte method for 4 threads would look like this:

 public static byte[][] Calculate3(string filename) { var blockCount = (int)((new FileInfo(filename).Length + blockSize) / blockSize); var result = new byte[blockCount][]; var thread1 = new Thread(() => CalculateHash2(filename, 0, blockCount/4, result)); var thread2 = new Thread(() => CalculateHash2(filename, blockCount/4, blockCount/2, result)); var thread3 = new Thread(() => CalculateHash2(filename, blockCount/2, 3*blockCount/4, result)); var thread4 = new Thread(() => CalculateHash2(filename, 3*blockCount/4, blockCount, result)); thread1.Start(); thread2.Start(); thread3.Start(); thread4.Start(); thread1.Join(); thread2.Join(); thread3.Join(); thread4.Join(); return result; } 

We count the number of blocks, allocate memory for hashes, divide all blocks into 4 almost equal parts, and for blocks of each part, consider hashes in 4 different streams.

Look like that's it.

  • That's it for me through Threads, and you need it)) Why do you use byte [] [], and not, for example, string []? - Timon
  • Because the built-in .NET classes of type SHA256Managed return exactly the byte array. - Mark Shevchenko
  • one
    @Timon Added through Thread . As he said, all this is crying and gnashing of teeth. Tell your torturers that they will have an interview, and then the law of universal justice may finally work. :) - Mark Shevchenko
  • Thank you, but how can you stick to the condition of processing files, large RAM, is it possible without complicated tricks?) - Timon
  • one
    @Timon Completed the answer. - Mark Shevchenko

If you just need to count the block hashes, then there is nothing difficult: read N bytes from the file, feed these N bytes to the algorithm and get the answer. Each stream receives a block number and reads bytes from N * i to max(N * (i + 1), file_length) , where i is a block number starting from zero.

If you need to calculate the hash for the entire file block by block, use hash tree (Russian article contains a description of Tiger tree hash using Tiger hash, but in fact you can use a different algorithm). The idea is that for each block you calculate the hash, then calculate the hashes for each pair (or more) and so on up the tree until you get a single hash. This algorithm is used in P2P networks to verify the integrity of files.

Describe in more detail what exactly you need to do and what the difficulties have arisen with, and I will supplement the answer.

  • I need to implement the first option, the problem is how to implement it through Threads, that is, you cannot use TPL, ThreadPool, BackgroundWorker. - Timon
  • @Timon Mark has already written the code for you, so I will leave my answer purely theoretical :). - andreycha