Archiving with GZip

Question

I was asked to write a task to write an archiver that can work effectively in a multiprocessor environment, while it must process files whose size exceeds the amount of available RAM (without the thread pool and BackgroundWorker).

I have a question: is there enough for this task the usual functions for working with threads or, for example, is it better to use tpl?

I was told that it is best to divide the incoming data stream into equal parts and run the streams to compress each of the parts, and then merge them into one stream. But I do not quite understand how to implement it. If you can easily throw off links to the application / article where it is all covered.

I also do not quite understand how to solve the issue of the dependence of archiving more files on RAM.

У меня возник вопрос: хватит ли для этой задачи обычных функций для работы с потоками или, например, лучше использовать tpl?
The question is rather the opposite should be :) TPL is for convenience a wrapper.
For the second part, you can try using Producer / Consumer (look at the website in the Исследования section) The size of your “equal” part depends on the amount of memory.
That is, divide the file into parts so that there is enough RAM (well, accordingly, there should be a limit on the number of simultaneously processed parts)
Overclocking subtasks by streams is the simplest thing if the packaging algorithm is present.
2. “Normal” threading functions are much less convenient than TPL.
I don't understand where the requirement “without a pool of threads and BackgroundWorker” came from.
How exactly multithreading is implemented should not be part of the TK.
3. The problem with the dependence on RAM is solved simply: you read a piece of the source file, archive it, write the archived piece to another file, and free up memory.
@VladD, if you remember, I recently asked about XML here, so this is practically the same, that is, the task from the employer.

Accepted Answer · 2014-07-08T14:34:32

@VladD , not enough space in the comments, so leave as an answer. Perhaps I will have questions on the same topic, so I would not want to close this topic yet. I'm just getting settled, and the company is quite serious, at first glance. Perhaps, on the basis of my first post, you appreciated the complexity of the task, apparently I made up a little bit myself, but the fact is that they only want multithreading, GZip compression and that's it. About the division of data flow into blocks, I myself began to argue, because if done without it, then the implementation of this task is quite trivial. I just wanted to stand out and go on such difficulties, I think this will be a good plus. Yes, and the work is very necessary at the moment.

UPD 1: Then what should I do, immediately write these blocks to a file? I just have no other options yet. The extension is known in advance, and I think the location can be determined by the location of the file that needs to be compressed.

UPD 2: Okay, today I will try to make the alternate packing of blocks. I just can not make out the source of this project, it's hard, would leave at least comments on the methods (

UPD 3: It seems to be written, but I just don’t understand how to start a task in a cycle for compressing a block. Here is the code . I'm already completely confused. And yes, you need to free up memory after the block has been compressed? It is necessary to rewrite the CompressChunk method, otherwise I’ve freaked out something wrong there ...

UPD 4: In general, my strength is exhausted and hope to finish everything in time, too. I sent the letter and everything I could write. @VladD , thank you so much for your help and explanations, otherwise I would have trampled on the spot. Also thanks @Veiked . I'll finish this thing sometime later.

Are you talking about System.IO.Compression.GzipStream or connecting a third-party library?
You need to add exceptions and regulars to handle incoming requests.
I am embarrassed by the fact that I can’t do it so that when I unpacked, the name was entered once at the input, while copying the memory stream to the file stream, you need to specify the name with pens, which is wrong.
Can it just not return the memory stream but do everything at once in the same method?
@VladD, 1) The assignment says that you need to run methods using the command line, i.e. compress [file name] [archive extension] 2) gzip is based on the deflate algorithm (a combination of the LZ77 and Hoffman algorithms), and as far as I Understood just the partitioning of the data stream into blocks.
Therefore, I did without it, but I think by the evening I will try to do it with the division into blocks.
Can you share a link to the article where they show an example of how to cope with the dependence on RAM?
3) Well, for example, write your own implementation of Stream, almost like a FileStream, which will not hold data in memory, but write to a file, starting at a given offset (important!).
Well, or as a temporary inefficient solution, unpack the part into a MemoryStream, and at the end, drop it into a file and free up memory (it will fall very well on async / await).
But, the only question is, after I divide the data stream into blocks and compress them, do I immediately write them to a file?
Simply, as I understand it, they, after compression, still have to be connected together and only then written to the file.
If, however, they have to be connected together before recording, then I do not understand how to implement it.
At the same time, they should be stored somewhere, the most optimal option that came to mind is to store compressed blocks in an intermediate file, because if you store their List, it will be released too "memorably".

Archiving with GZip

1 answer 1

More articles: