There is a large file. The file is divided into blocks of a given length. For each block, the hash function SHA256 is calculated. Calculations should be performed using multi-threading with maximum system load. Each calculated block in size may exceed the RAM of the computer.

As I see the solution to this problem. I have a thread that reads data from a file. Because it is impossible to read the whole block at once, due to the lack of memory, I divide it into several parts and write it into a queue of limited size. I also have several threads that take these parts of the block and calculate the intermediate hash value using the TransformBlock method of the SHA256Managed class. After all the parts of the block are calculated, we call TransformFinalBlock and get the hash of the block.

The problem is that, as I understand it, this method requires strict adherence to the order of the calculated parts of the block, otherwise the resulting value will be different. Actually I don’t know how to provide it. The streams calculate parts of the block in parallel and return the result in a different order.

How to synchronize threads so that they produce results in the right order?

Code running inside threads:

 public class CalculatingSha256 { private static object lockObj = new object(); private static SortedList sha256DataList; public CalculatingSha256() { sha256DataList = new SortedList(); sha256DataList = SortedList.Synchronized(sha256DataList); } public void CalculateSha256(PartBlock block) { int index = sha256DataList.IndexOfKey(block.IdFullBlock); SHA256Data sha256Data = null; if (index >= 0) { sha256Data = (SHA256Data)sha256DataList.GetByIndex(index); } if (sha256Data == null) { sha256Data = new SHA256Data(); if (block.SizeFullBlock != 0) { sha256Data.BlockSize = block.SizeFullBlock; } sha256Data.Sha256 = new SHA256Managed(); lock (lockObj) { sha256DataList.Add(block.IdFullBlock, sha256Data); } } Console.WriteLine(block.Id+" "+block.SizeFullBlock+" "+block.IdFullBlock); sha256Data.Sha256.TransformBlock(block.Data, 0, block.Size, block.Data, 0); sha256Data.CurrentCalculated += block.Size; bool bBlockCalculateIsOver = sha256Data.CurrentCalculated == sha256Data.BlockSize; if ((sha256Data.BlockSize != 0) && (bBlockCalculateIsOver)) { byte[] input = new byte[0]; sha256Data.Sha256.TransformFinalBlock(input, 0, 0); string hashString = ToString(sha256Data.Sha256.Hash); Console.WriteLine("Блок номер {0}:{1}", block.IdFullBlock+1, hashString.ToUpper()); sha256DataList.Remove(block.IdFullBlock); } } } public abstract class Block { public int Id { get; set; } } public class PartBlock : Block { public int Size { get; set; } public byte[] Data { get; set; } //номер целого блока которому принадлежит public int IdFullBlock { get; set; } public int SizeFullBlock { get; set; } } public class SHA256Data { public int BlockSize { get; set; } public int CurrentCalculated { get; set; } public SHA256Managed Sha256 { get; set; } } 

    1 answer 1

    As far as I understand, the SHA256 algorithm does not parallelize : TransformBlock must be called strictly in order following the end of the previous one, so it will not work to parallelize to several threads.

    With this in mind, you can not slip into the low level of a manual call TransformBlock , and do just

     byte[] hash; using (var fileStream = File.Open(path)) hash = SHA256Managed.Create().ComputeHash(fileStream); 

    In this case, problems with the file size go away, the file stream will be read in chunks. (You lose at the same time reports on the progress of the operation, really.)


    If you really want to cut the file into pieces of a given length, you can write an auxiliary partial file stream (did not check):

     class PartialFileStream : Stream { public PartialFileStream(FileStream fs, long start, long length) { if (!fs.CanSeek || !fs.CanRead) throw new ArgumentException("Seekable and readable stream required", nameof(fs)); this.fs = fs; fs.Position = start; this.PartStart = start; this.PartLength = length; } FileStream fs; long PartStart, PartLength; public override bool CanRead => true; public override bool CanSeek => true; public override bool CanWrite => false; public override long Length => PartLength; public override long Position { get { return fs.Position - PartStart; } set { fs.Position = value + PartStart; } } public override void Flush() => fs.Flush(); public override int Read(byte[] buffer, int offset, int count) { var maxAllowedLength = PartLength - Position; return fs.Read(buffer, offset, (int)Math.Min(count, maxAllowedLength)); } public override long Seek(long offset, SeekOrigin origin) { switch (origin) { case SeekOrigin.Begin: return fs.Seek(offset + PartStart, SeekOrigin.Begin) - PartStart; case SeekOrigin.Current: return fs.Seek(offset, SeekOrigin.Current) - PartStart; case SeekOrigin.End: return fs.Seek(offset + PartStart + PartLength, SeekOrigin.Begin) - PartStart; } throw new ArgumentException(nameof(origin)); } public override void SetLength(long value) { throw new NotSupportedException(); } public override void Write(byte[] buffer, int offset, int count) { throw new NotSupportedException(); } } 

    Having such an auxiliary class, the following is simple:

     class Program { static void Main(string[] args) { new Program().Run().Wait(); } async Task Run() { var path = тут-путь-к-файлу; long size = new FileInfo(path).Length; long chunkSize = тут-размер-одного-куска; long nchunks = (size + chunkSize - 1) / chunkSize; var encodeTasks = new List<Task<byte[]>>(); for (int i = 0; i < nchunks; i++) { var length = chunkSize; if (i == nchunks - 1) // last chunk length = size - i * chunkSize; encodeTasks.Add(Task.Run(() => ComputeHash(path, i * chunkSize, length))); } var results = await Task.WhenAll(encodeTasks); // тут у вас есть массив хешей, делайте с ним что-то } byte[] ComputeHash(string path, long startPos, long length) { using (var fileStream = File.Open(path)) using (var partialStream = new PartialFileStream(fileStream, startPos, length)) return SHA256Managed.Create().ComputeHash(partialStream); } } 

    If you don't want to mess around with your own thread classes, you can manually:

     byte[] ComputeHash(string path, long startPos, long length) { using (var fs = File.OpenRead(path)) { fs.Position = startPos; using (var sha = SHA256Managed.Create()) { var buffer = new byte[4096]; long bytesToGo = length; while (bytesToGo > 0) { var bytesRead = fs.Read(buffer, 0, (int)Math.Min(buffer.Length, bytesToGo)); if (bytesRead == 0) throw new EndOfStreamException(); sha.TransformBlock(buffer, 0, bytesRead, null, 0); bytesToGo -= bytesRead; } sha.TransformFinalBlock(buffer, 0, 0); return sha.Hash; } } } 

    Some relevant parts of the discussion (see the full discussion in the chat ):

    @PavelMayorov - When I made my PartialStream , I removed the support of Seek from it - and it PartialStream out quite simple. In the constructor, only the length was specified, the offset was calculated from the place where the thread was created.

    @VladD - I think it is not necessary to shift the creation of the internal stream inwards, otherwise somebody will try to put on several partial streams with a single understandable effect on one file stream.

    @PavelMayorov - I do not think. Still, two StreamReader also not good to hang onto one stream at the same time - but this does not prevent them from being used.

    @PavelMayorov - Generally speaking, according to the interface, TransformBlock returns how many bytes it has been able to chew. But for HashAlgorithm it always returns inputCount .

    @PavelMayorov - By the way, here is another possible way. You can create a CryptoStream in Write mode over a NullStream and write to it

    @redfenix - The code in the stream will first load the disk while reading from the stream, and then the processor when it calculates the hash. I need computer resources to be used with maximum efficiency.

    @redfenix - will the proposed option evenly use computer resources?

    @VladD - If there are a lot of pieces, then there will be, of course. But calculating the hash from a chunk, you still can't parallelize it. If you calculate the hash of the next piece in the new stream, it is no better than the calculation in the old stream: one stream is still used for this calculation.

    • Comments are not intended for extended discussion; conversation moved to chat . - Nofate
    • can add them in response to PS? - Nofate