C #. Unzipping a file with GZip in a multi-threaded environment

Question

I am trying to figure out how to write the compression and decompression process using GZip in a multi-threaded environment.

When unzipping a file in a multi-threaded environment, it constantly produces the following errors: StackOverFlow , System.OutOfMemory .

In debug mode on lines:

 blockLength = BitConverter.ToInt32(lengthBuffer, 4);

and

 _dataSize = BitConverter.ToInt32(compressedData[i], blockLength - 4)

in the second thread, cosmic variables are assigned, which lead to fatal errors. I tried to put 4 bytes, I was "played" in every way, trying to implement it in different ways. Outcome one. Please help !. Code below.

 public abstract class GZip { protected static bool _cancelled = false; protected static bool _success = false; protected string sourceFile, destinationFile; protected static int _threads = Environment.ProcessorCount; protected const int buffer_size = 1024 * 1024; protected static int blockSize = 10000000; protected static byte[][] lastBuffer = new byte[_threads][]; protected static byte[][] compressedData = new byte[_threads][]; public GZip(string input, string output) { this.sourceFile = input; this.destinationFile = output; } public int CallBackResult() { if (!_cancelled && _success) return 0; return 1; } public void Cancel() { _cancelled = true; } public abstract void Launch(); } class Decompressor : GZip { public Decompressor(string input, string output) : base(input, output) { } public override void Launch() { Console.Write("Decompressing"); try { using (FileStream _compressedFile = new FileStream(sourceFile, FileMode.Open)) { using (FileStream _decompressedFile = new FileStream(sourceFile.Remove(sourceFile.Length - 3), FileMode.Append)) { int blockLength; int _dataSize; byte[] lengthBuffer = new byte[8]; Thread[] tPool = new Thread[_threads]; while (_compressedFile.Position < _compressedFile.Length) { for (int i = 0; (i < _threads) && (_compressedFile.Position < _compressedFile.Length); i++) { Console.Write("."); _compressedFile.Read(lengthBuffer, 0, lengthBuffer.Length); blockLength = BitConverter.ToInt32(lengthBuffer, 4); compressedData[i] = new byte[blockLength]; lengthBuffer.CopyTo(compressedData[i], 0); _compressedFile.Read(compressedData[i], 8, blockLength - 8); _dataSize = BitConverter.ToInt32(compressedData[i], blockLength - 4); lastBuffer[i] = new byte[_dataSize]; tPool[i] = new Thread(Decompress); tPool[i].Start(i); } for (int i = 0; (i < _threads) && (tPool[i] != null);) { if (tPool[i].ThreadState == ThreadState.Stopped) { _decompressedFile.Write(lastBuffer[i], 0, lastBuffer[i].Length); i++; } } } } } } catch (Exception ex) { Console.WriteLine("Error is occured!\n Method: {0}\n Error description {1}", ex.TargetSite, ex.Message); _cancelled = true; } } public static void Decompress(object i) { using (MemoryStream _memoryStream = new MemoryStream(compressedData[(int)i])) { using (GZipStream cs = new GZipStream(_memoryStream, CompressionMode.Decompress)) { cs.Read(lastBuffer[(int)i], 0, lastBuffer[(int)i].Length); } } } } }

Depends on protected static int _threads = Environment.ProcessorCount;
In my case, this is 2. But, however, on the second thread, the errors described above are knocked out.
On small files, the program "successfully ends", but for some reason everything happens in one stream, about 16,000 bytes are written and this completes.
Try adding a volatile label to thread variables, for compressedData and other participants
The length of the compressed file is 87952607 The first thread: blockLength = BitConverter.ToInt32 (lengthBuffer, 4);
-in this line, the variable is assigned 9284445 _dataSize = BitConverter.ToInt32 (compressedData [i], blockLength - 4);
- and in this 520132758 Second stream: blockLength is already equal to 67145198, and _dataSize 1236379320, which leads to an error

Answer 1 · 2017-01-26T13:48:21

Your code implicitly assumes that GZipStream can feed a random fragment of a compressed file, and it can decompress it.

This is not true. The compressed file has service information that GZipStream cannot find if it only gets a piece of the file.

Apparently, random data is perceived by the code as control information, which leads to problems.

The gzip format consists of several butt pieces. Whatever the size of each piece, the archiver unpacks the pieces one by one. At the beginning of each piece there is a ten-byte header containing information about the piece. But in this heading there is no information about the size of the packed data. The header is followed by information packaged by the Deflate algorithm , followed by a checksum and the size of the unpacked data.

Therefore, it is easy to glue a file out of pieces; you can simply stack them one by one. But in order to split the file into the correct pieces, you want to know the boundaries of the piece. For the time being, I don’t see how to do this without deflate being decompressed.

Your code essentially takes random borders of the pieces, so you cannot interpret the data on the border of the piece as a length.

Before compression, it is divided into equal portions, compressed alternately, then written to a file.
The resulting archive can be easily edited with the standard Windows archiver Winrar or Zip.
(Preferably, small.) Another question - when unpacking with Winrar, do you get back exactly all the data, or just the first piece?
Example archive: my-files.ru/ivu573 If I understood the question correctly ... then that's it.

Mikezar Mikezar 56 6 · Answer 2 · 2017-02-12T12:02:08

After much suffering, I still figured out how to implement it humanly. You can see the code in my Git profile: https://github.com/Mikezar/GZip

C #. Unzipping a file with GZip in a multi-threaded environment

2 answers 2

More articles: