What function is used to create a hash in std :: hash and how much more preferable is it than the usual crc32? Preferences are fewer collisions, faster calculation. Basis - 32-bit system. Excerpts from the standard as a response are not welcome.

  • @Softinaria, it is clear that (judging by the independently found answer) you would in fact like to discuss, and which hash function is better (using the comparison example, say with CRC32 (for priming)). Unfortunately, now it seems that the rules are not the same (and the existing contingent basically corresponds to them) to ask such questions, the answers to which are ambiguous, and moreover depend on opinions for sure ... So, accept my sincere condolences. - avp
  • @avp yes, I wanted to discuss which is better. Doesn't this even follow from the topic of the question? And not for the seed, I took crc32 and std :: hash. Both returns a regular number. Therefore, the question arises, what is better to use to calculate the checksum. The standard solution is more convenient, but the standard does not specify a specific function. Not the fact that it will not change later. And if, say, for example, the client is "old" and will check for the given hash whether there is the same file on the "new" server, then it turns out garbage. (this is an example) At the same time, crc32 is more tested. - user212277
  • What does it mean more tested? And murmur (bernstain, crap, ...), think not? - avp
  • @avp Well, by itself, that the phrase is более проверен I did not mean the properties of the algorithms themselves. I meant exactly the use case itself. It is the fact that the function used in the standard will possibly change. And so it really will only show time. - user212277
  • one
    By the way, read the comments. Are you worried about the number of collisions? If you allow them at all (i.e. a collision is a normal case for an algorithm), then a few percent of collisions usually do not affect the total processing time at all. / Note, murmur is so fast only on systems that allow unallocated memory access (x86), on arm, for example, it will be slower. - avp

3 answers 3

Basis - 32-bit system.

Do you understand that crc32 and the bitness of your system do not correlate in any way? crc32 was also used in MS-DOS 16-bit OSs. And 32 is just the size of the final hash in bits.

What function is used to create a hash in std :: hash

std :: hash is not a function, but a template, so for hashing various types it will use different hashing algorithms, for example, std::hash for int 1 will return 1 , since the hash for a number is the number itself.

In addition, you did not say exactly which implementation of the standard C ++ library you are using. In theory, you can imagine a library that uses crc32 for all hashing (and it works very poorly).

crc32 is used for small data, but fast. md5 in slower, but much harder to run into a collision (32 bits <128 bits in md5)

Reply to comments:

I correlated crc32 with hash c ++ 11, and it, in turn, depends on the bit depth of the axis. And for a correct comparison, they should be the same bit depth

depends on the ISO standard, then on the compiler developer. how it implements hashing is its own business, if not specified in the standard, and yes, it is std :: hash that depends on the capacity of the system, you are right.

The actual hash functions are implementation-dependent, which means that both clang and msvc and gcc and everyone else can implement them differently. Did I ask about md5 and its speed? It is outdated. And where I need hashes with minimal collisions, I will not use it.

what a news. md5 is vulnerable to intentional attacks (you can create a collision specifically), so it is outdated as a cryptographic hash (thanks to @ D-side for editing), but as a function of hashing it is not. (good comparison md5 / sha1)

default function for hashing gcc strings - MurmurHashUnaligned2

and here is an absolutely perfect comparison of various hashing algorithms

  • one
    @strangeqargo Regarding the edit ... Yes. The hash calculation algorithm is not standardized, but it should return size_t. And this number, the size of which depends on the bit value. So no matter what algorithm the compiler developer implements, the result will be a number. In my subjective opinion, if the results of the hash are numbers with one digit capacity, then the probability of a collision will be the same. - Max ZS
  • one
    @strangeqargo But the processing speed will of course depend directly on the algorithm. And here, even for the same crc32, there may be differences. Well, for example, if he will pre-calculate the table, or it will be already full. - Max ZS
  • one
    @strangeqargo Yes. I understand. For this and the question was raised. The truth is, there is one more thing. I would like to use the standard function, but it turns out that it is not “standardized,” as it were. Those. if, in fact, it is used both on clients and on the server, and over time on the server when it is recompiled on the updated axis, it will change, then there will be garbage. - user212277
  • 2
    "outdated as a function for encryption" - I will help you a little with terms, this is called a "cryptographic hash" , it still does not have a relationship to encryption. - D-side
  • one
    I compared @strangeqargo on a byte array (it can be said that on strings) std :: hash and crc32. Most likely, you can't call my tests correct, but std :: hash really wins crc32 in speed. Collisions in general are zero for both. Only once crc32 produced two collisions with 100k lines of 2k characters in length. With smaller "volumes" of collisions, both algorithms do not. The shorter the string length, the smaller the difference in speed. If we compare words and phrases, then they will be equal in speed (which is logical, since most likely the main expenditure will be used for various kinds of initialization of variables). - user212277 1:51 pm

Obviously, the answer depends on: the type of data for which the hash is calculated, the version of the standard library used, bitness, etc. If you use libstdc ++, then for std :: string, we’ll get into the _Hash_bytes function from here: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/libscc% 2B% 2B/hash_bytes .cc

Obviously, this is something quite simple, chosen for performance reasons. No CRC. Hardly this hash has a name.

  • 3
    The name is specified in the comments to the code: 'Based on the public domain MurmurHashUnaligned2, by Austin // Appleby. murmurhash.googlepages.com ' - αλεχολυτ am
  • Yes, alexout is right, this is MurmurHash2. This followed from the link in my deleted comments. But the question remained open - what is better? I do not see the answer in the answer. - user212277
  • MurmurHash2 - faster (works with blocks of 32 or 64 bits). СRC - easier implemented by hardware (in the context of c ++ is not relevant). That makes all the difference. - Chorkov
  • @Chorkov embedded.com/design/programming-languages-and-tools/4438660/… good article on the topic - strangeqargo

Based on the comments, as I understand it, std::hash for strings will use the MurmurHashUnaligned2 hashing MurmurHashUnaligned2 . For ordinary numbers as such, the algorithm will not be - the result will be the number itself. Therefore, in the case of numbers, std::hash will certainly be faster than crc32 and in principle without collisions. But with strings, the situation is somewhat different. Therefore, I guess that the comparison between MurmurHashUnaligned2 and crc32 interesting. I think that based on the fact that the result of the calculations of both algorithms is the number of the same bit depth (as indicated in the question), then the probability of collisions will be approximately equal. In general, a good comparison of algorithms for speed and collisions for different types of data can be found here . murmur2 crc32 and murmur2 are present in the test.

  • Thanks for the answer. Well, judging by the tests cited also by strangeqargo, it turns out that murmurhash2 faster than crc32 , but it loses by the number of collisions. It always seemed to me that the hashing algorithm was for some reason, and by the number of collisions it should have been better than the control :( - user212277
  • priorities for the standard library, I think it was: speed / memory / collisions. I don’t know why the committee decided so, but, probably, they had ideas and, probably, you can even find reasons - strangeqargo
  • one
    @strangeqargo well so here and do not go to the fortuneteller :) Clear stump that for the standard came from some of their own considerations. And judging by the tests, obviously the collisions were “not held in high esteem.” Rather, speed and memory. For crc32, only the generated table will eat a kilobyte. - Max ZS