Rebuilding a hash table when the average number of items stored in a single slot (bucket) exceeds the threshold value

Question

Two examples and a question.

PHP example:

<?php $table = array(); $info = array(); for ($i = 0; $i < 65535; $i++) $table[$i] = ''; for ($i = 65535; $i < 65538; $i++) { $t = microtime(true); $table[$i] = ''; $t = microtime(true) - $t; $info[] = array($i, $t); } for ($i = 65538; $i < 131071; $i++) $table[$i] = ''; for ($i = 131071; $i < 131074; $i++) { $t = microtime(true); $table[$i] = ''; $t = microtime(true) - $t; $info[] = array($i, $t); } ?><pre><?php print_r($info) ?></pre>

Result:

 Array ( [0] => Array ( [0] => 65535 [1] => 5.0067901611328E-6 ) [1] => Array ( [0] => 65536 [1] => 0.0043141841888428 ) [2] => Array ( [0] => 65537 [1] => 9.5367431640625E-7 ) [3] => Array ( [0] => 131071 [1] => 3.0994415283203E-6 ) [4] => Array ( [0] => 131072 [1] => 0.0083508491516113 ) [5] => Array ( [0] => 131073 [1] => 9.5367431640625E-7 ) )

Python example:

 from time import time from pprint import pprint table = {} info = [] for i in range(0, 21844): table[i] = '' for i in range(21844, 21847): t = time() table[i] = '' t = time() - t info.append((i, t)) for i in range(21847, 87380): table[i] = '' for i in range(87380, 87383): t = time() table[i] = '' t = time() - t info.append((i, t)) pprint(info)

Result:

 [(21844, 9.5367431640625e-07), (21845, 0.0022368431091308594), (21846, 9.5367431640625e-07), (87380, 2.1457672119140625e-06), (87381, 0.0048618316650390625), (87382, 1.9073486328125e-06)]

It seems that, as I understand it, the complexity of inserting an element, taking into account the re-formation of the table, is still O (1), but if you somehow voice to the interpreter in advance what maximum number of keys in the table I expect, you can significantly reduce the constant before O, as well as avoid many memory reservations.

As I understand it, PHP and Python "out of the box" do not know how.

The question is: what scripting languages can this do? Exceptions: C ++ and Java answers will work, as well as an explanation of why this is not necessary at all.

Answer 1 · 2017-03-27T22:33:27

Perl language:

 #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw(time); my $start=time(); ### Сохраняем время старта my @a; ### Объявляем пустой массив ### Раскоментарить следующие 3 строки для второй части теста: # my $t=time(); # $#a=200000; ### Вот тут резервируем 200k элементов ($#a - количество элементов массива @a) # print "Reservation: ",time()-$t,"\n"; for(my $i=0;$i<200000;$i++) { my $t=time(); $a[$i]=$i; my $r=time()-$t; print "$i: $r\n" if( $r > 9e-5 ); ### Лимит времени подобрать под вашу машину } print "Run time is ", time()-$start, "\n";

The output of the program with the reserved memory strings is:

 8190: 0.000108003616333008 16382: 0.000221967697143555 32766: 0.000423908233642578 65534: 0.00084996223449707 131070: 0.00179505348205566 Run time is 0.421634912490845

With repeated startup, the failures in the speed of the operation of selecting the next element remain stable on the same elements of the array.

And now let's uncomment the lines before the loop, the essence of which is simply to change the number of elements in the existing array ( $#a=200000; ), the result of the execution:

 Reservation: 0.00147509574890137 Run time is 0.407889842987061

It is noticeable that, without prior reservation of memory, the “failure” time of adding an element if it is necessary to reorganize an array increases with its size. And what is important, the reorganization time on the element 131070 is comparable to the time of memory backup for 200k elements. And the total program execution time with redundancy is less than without it. So the benefits on the face.

UPD :

True, it was just an array, hashes in perl are other structures. But they can do something for optimization. We change the array in the program for a hash, for this we change the declaration my @a to my %a , change the reference to the element $a[$i] to $a{$i} . Reserving the number of buckets for a hash is done using keys(%a)=200000; (It should be noted that 262144 is actually reserved, i.e. grade 2 buckets, checked as $a{1}=1; keys(%a)=200000; print scalar(%a); result: 1/262144 ). The results are curious ...

Without reservation:

 3401: 0.000152826309204102 4095: 0.000200033187866211 7497: 0.000319957733154297 8192: 0.00032496452331543 15689: 0.000678062438964844 16384: 0.000677824020385742 32074: 0.00168180465698242 32769: 0.00153708457946777 64841: 0.00349617004394531 65535: 0.00367498397827148 130379: 0.00778102874755859 130453: 0.000139951705932617 131074: 0.00826096534729004 174142: 0.000139951705932617 Run time is 0.533354043960571

With reservation:

 Reservation: 0.0016169548034668 3401: 0.000159978866577148 7497: 0.000318050384521484 15689: 0.000644922256469727 32074: 0.00147390365600586 64841: 0.00345706939697266 130379: 0.00780797004699707 130433: 0.000134944915771484 Run time is 0.531057119369507

Those. failures in the speed of the operation are noticeable, but on a smaller number of elements. The nature of this behavior is not known to me.

Already useful to watch the source :) With arrays everything seems to be simple.
Resize easy, and the conditions for its occurrence are clear.
There is a speed increase, not so much (I have 0.14478588104248 vs 0.137899875640869 with 200_000 elements, and 1.10408186912537 vs 1.04192900657654 with 2_000_000).
In theory, the resize should go when the size of a basket is exceeded (or maybe not just one), but is it really necessary to delve into it?
And it is clear that this is highly dependent on the input data.
@PinkTux, the average number of elements in a bouquet, as I understand it, is calculated equally in all implementations.
The number of elements (increment / decrement when adding / removing an element) divided by the number of bouquets.
The question is precisely that it was possible to preset the hash of the number of keys that will not be surpassed.
And why this can not be done in almost all popular scripting languages.
If this can be done on the cross, then this answer will also be a good one from my point of view.
As for the number of keys, we still can neither observe the ratio of one place -> one value, nor predict the distribution of values across the baskets.
In any case, in many implementations (including my own, including my own), you can set the initial number of baskets, and not hash elements.
@AlexanderZonov Why it is impossible ... you see the gain is small, surely depends not only on the number of keys but also, most importantly, on the distribution given by the hash function and for any function there may be data that still breaks the distribution uniformity.
And the overwhelming number of writers in these languages do not think about performance and for whom, then, to make this functionality

Rebuilding a hash table when the average number of items stored in a single slot (bucket) exceeds the threshold value

1 answer 1

More articles: