Tell me on the implementation of compression static Huffman method. How best to store a Huffman tree in a file?

  • Then you need to count all the saved characters and build a Huffman tree on them? - DarkGenius
  • What do you mean when you write "optimally"? Do you need the smallest file size or the processing speed of this file when loading and then building a tree? - uzumaxy
  • The smallest file size. - DarkGenius
  • one
    So far it seems to me the best solution. Store in the file information about the structure of the tree. For the node, write the left subtree, the right subtree and the bit "0", for the sheet, write the symbol itself and the bit "1". We start saving, obviously, with the recording of the root. Such a tree serialization will allow to restore it unequivocally. - DarkGenius
  • To make a walk through the tree is necessary. We go to the left - put zero, to the right - one. So each character code will get its own. By linking the characters and their codes to build the Hoffman table is no problem. It must be stored together with the string encoded. When decoding needs to be done, the tree will be built back only turning to the side appropriate for the bit. - uzumaxy

3 answers 3

@DarkGenius , you have already written a good option in the comment -

Store in the file information about the structure of the tree. For the node, write the left subtree, the right subtree and the bit "0", for the sheet, write the symbol itself and the bit "1". We start saving, obviously, with the recording of the root.

Indeed, it turns out quite effectively, especially since the top of the tree can only have no descendants, or there can be two. 2 bits * (number of characters - 1) + the characters themselves (most likely, 8 bits each).

But I asked a colleague who did more cunning. He got about 1 bit per character +, of course, the characters themselves. The idea is as follows: rebuilding the tree so that it is skewed strictly in one direction. Now we go through all the sheets from left to right and record the depth at which they are located. For

/ \ a /\ d /\ cb 

it turns out

 adcb 1 2 3 3 

It now remains to write the characters in the resulting order, the initial depth and "grows, does not grow." True, it can grow by two and three ... In this case, you need to do some adaptive arithmetic coding. But basically there should be 0 and 1, which are encoded by one bitik. In general, according to a colleague, bit is saved, but it is very time consuming to implement.

Upd : or not arithmetic, but to encode groups of numbers with a fixed tree Huffman. Or store the positions of numbers other than 0, 1.

By skewing, I meant something like sorting. For example,

  / \ /\ / \ /\ cd /\ abef 

need to transform by swapping subtrees in

  / \ /\ / \ с d /\ /\ abef 
  • @Michael M, I didn’t really understand how to "twist" a tree if it has a more complex structure than in your example. And it is not very clear what the "initial depth and" grows-does not grow "mean, and how to restore the tree with this information. I thought a little more here, and figured out how to save all the information in 478 bytes, regardless of the size of the input data, later I will write in my answer. - dzhioev
  • > 478 bytes, regardless of the size of the input data, I will write later> in your answer Be sure to write. We will redo all archivers. - uzumaxy
  • The initial depth is the depth of the leftmost sheet. The first change is the difference between the depth of the second from the left and the leftmost one. The tree is restored uniquely. We look at the initial depth - we leave so much to the left. Add a sheet in accordance with the first recorded symbol. If further "grows" - on his "brother" we go so deep into it, if not - add a sheet at the same level. Etc. 478. You have probably come up with some kind of trick with a bit mask of all 256 characters. Probably, it is possible to play like that on spreading trees. Nevertheless, 478> 2 bits * 255 + 256 bytes - Mikhail M
  • @uzumaxy, wrote. - dzhioev

It is not necessary to save the tree. You can simply store counters for each character at the beginning of the file, and restore the tree before decoding. If you store counters of 8 bytes, you will need 8B * 256 = 2KiB, which is not much.

UPD : if it's a pity 2KiB, then you can store the counters more compactly. For example, enter the following format for the counter: prefix + counter. The prefix is ​​a three-bit integer equal to the size of the counter in bytes. Examples:

  • The counter is 0. Saved in 000 (3 bits).
  • The count is 222. Saved to 00111011110 (11 bits).
  • The counter is 1024. It is stored in 0100000010000000000 (19 bits).

Those. the size of the saved data for the counter n will be equal to ceil (ceil (log_2 (n)) / 8) * 8 + 3 bits.

UPD2 : came up with another way that, I think, is better than the previous one. First, I will describe what implementation of the tree construction algorithm I am going to use:

 nodes = [Node(WEIGHT[c], c) for c in ALPHABET].sort(reverse=True) while nodes.size() != 1: l, r = nodes[0:2] nodes = nodes[2:] new_node = create_node_with_children(l, r) insert_index = 0 if nodes.size() > 1: # на 2-x последних шагах уже не важен порядок while insert_index < nodes.size() and nodes[insert_index].weight > new_node.weight: insert_index += 1 nodes[insert_index:insert_index] = [new_node] 

If you look closely at the code, you can understand that to restore the tree topology we have enough knowledge of the initial arrangement of characters in the nodes and insert_index at each of the steps. Calculate how much memory it takes.

To store the original permutation, you will need 255 bytes (the first / last character can be not stored, but obtained using the exception method).

Further, we note that the number of iterations in the algorithm is fixed (255) and at the i-th iteration (counting from zero) insert_index lies in the range [0, 255 - i), so we can allocate less bits for storage insert_index at later iterations. And more specifically:

  i | число бит на индекс [0, 127) | 8 [127, 191) | 7 [191, 223) | 6 [223, 239) | 5 [239, 247) | 4 [247, 251) | 3 [251, 253) | 2 [254, 256) | 0 // тут не надо ничего сохранять 

Those. for storing insert indexes, you need 127 * 8 + 64 * 7 + 32 * 6 + 16 * 5 + 8 * 4 + 4 * 3 + 2 * 2 = 1784 bits = 223 bytes.

Total we get 255 + 223 = 478 bytes, and this number does not depend on the size of the input data, in contrast to the method described above.

  • What is a minus? - dzhioev
  • @dzhioev, for an extremely uneconomical solution (especially 8 bytes) - Mikhail M
  • Is this really extreme? 4 bytes may not be enough. Doesn't it bother you that in x86-64 addresses are 8 bytes each? - dzhioev
  • Very wasteful storage) And if the source file is only 1KB, then the compressed will be more than 2KB) - DarkGenius
  • @ Michael M, @DarkGenius, updated the answer. - dzhioev

I do not claim anything, but I did it myself, then I just kept the characters in order of decreasing frequency and there is nothing more.

Because Hafmen optimal prefix code, then the code can be uniquely determined by the order of the character in the sequence.

  • Come on! ... :) In the same place, the two lowest-frequency symbols are combined and their frequency is summed up. If it exceeds the frequency of the 4th from the end of the symbol, then the 3rd and 4th will be combined, and if it does not exceed, the new vertex and the 3rd will be combined. Those. You will write dcba, and whether ba / \ dc or dc-ba in the tree is not clear. Is not it so? - Michael M
  • Indeed, such ambiguity arises. How then to bypass it? - DarkGenius 2:41 pm
  • Damn, I generally wanted to write a comment, but it turned out ... In short, as I understand this thing: There are symbols, frequencies, even so: a-15, b-10, c-3, d-1, e-1 So I always I can change the frequencies without changing the order: a-5, b-4, c-3, d-2, e-1 Since the code is optimal, this overweigh should not affect compression, since the code must always be unique, we have not changed the number of characters and should not be reflected on the frequencies. Therefore, keeping: a, b, c, d, e I actually store a table with weights. Or am I wrong? - Vladislav Pyatkov
  • Just can not be changed. If the letters a, b, c, d are 100, 99, 2, and 1, respectively, in the source file, then this is not the same as 100, 3, 2, and 1. In the second case, it is clearly more profitable to do a single-bit code for Mikhail. M
  • Yes, I understand - wrong. Thanks for showing. - Vladislav Pyatkov