In a canonical hash table, in the case of collisions, elements with an equal hash are placed in a linked list. This leads to the fact that the search by container is O (n). Why not use a different data structure for the container? For example, red and black wood. Or another hash table with another hash. Search and delete in this case is cheaper. True, the insert becomes much more expensive. The price of inserting new items outweighs the price of the search? Are there any libraries in any of the modern languages ​​that allow you to choose a container for collisions? For example, one could come up with such a case: first, I fill in the hash table, and then I know for sure that I will need a quick search very much - so I will take and convert the containers into more suitable ones. Or I realize in advance the cost of insertion, but the search is still much more important.
- To begin with, lists for resolving collisions are not the only method. - VladD
- @VladD, not the only one, but are there ways to change it? It is clear that you can write your own class, but really nobody provides a flexible hash table, which I can customize to fit my needs, realizing all the minuses and advantages? - m9_psy
- oneNo, no, all search O (1). As the filling of the table increases, more than the load factor, the table is rearranged, so that the asymptotics O (1) is preserved. So, as the size of the table grows, the number of buckets grows from time to time. - VladD
- oneGood, interesting question. If you are really unlucky and all the data fall into a small number of "buckets" (chains), then you should rather think about the dynamic optimization of the hash function (with the subsequent reorganization of the table), rather than over the trees to resolve the collisions. In general, when thinking about optimization, we should not forget about the memory that will be occupied by service data (pointers, etc.), the size of the caches, and that in modern computers, sequential access is much faster than random access (even to data in the cache). Of course, I'm not talking about library implementations here ... - avp
- oneWith a properly selected load factor, there are few collisions, and the length of the list is on average a couple of elements. In this case, the linear search will most likely be faster than a complex search in the tree (the tree even for the search begins to become more efficient with a large number of elements, since more complex logic doesn’t need an equality check, but more / less comparison). - VladD
2 answers
The search will give O(n) in case of an extremely bad hash, when all elements will have the same hash. With a normally selected hash function, O(1) obtained.
However, in the case of algorithms, the standard is not written :), so yes, it is quite possible to use other methods for resolving collisions. The list can be replaced by a tree, an array, or even another hash table with a different hash function - after all, perfect hashing (see, for example, Kormen et al. Algorithms. Construction and Analysis ) does just that.
The question is in O() constants. With small chains, the search (and especially insertion when it plays a role) can be carried out faster at the expense of a small constant than in a faster but more complex structure.
Moreover, other factors, such as processor cache usage, etc., are already beginning to play their role. not exactly algorithmic little things.
So, to achieve maximum performance, perhaps, there is only one way - practical and experimental, and it can give its solution for each bundle task + machine ...
It all depends on the problem to be solved and for each task you can choose the appropriate data structure.
The fastest collection of elements is a peer-to-peer array, the access speed to which element is not much more than direct access, therefore the ideal hash collection for an int type is an array of the size int.Max .
According to my own research (and I did about a dozen different versions of the hash tables), the most universal option is just the “canonical” one, therefore, in the case of a well-distributed hash, it assumes the operation speed ~ O (1). I repeat, it is the most universal .
Of course, in its pure form, accessing the O (1) hash of the table is about 3 times lower than the O (1) array due to additional internal checks, but so far the systems do not have infinite memory with instantaneous allocation in any volume so that you can create arrays of maximum sizes .
On the other hand, if you have a strictly limited table size, say, 64 elements, and the hash of the data to be filled is, in most cases, a multiple of 64, then the usual variant would give ~ O (n) for any operation and all the beauty of using hash tables is lost. In this case, it is easier to use a binary tree in its pure form.