MD5, unicode and multibyte

Question

The question is simple, but I am not competent in it, I would like to hear "experienced".

There is a simple class that, in the constructor, accepts any string and converts it into what it needs to convert using the “MD5” algorithm, so I decided to compile the project in “UNICODE” and compare the results of the algorithms in “MULTIBYTE” and “UNICODE” encodings, redid Unicode class (since it was made for multibyte encoding) and I got different output from the algorithms (Unicode and multibyte), I know that 2 bytes are allocated for the Unicode character, respectively, the character code is different, so I would like to know MD5 and should give a different result Does the algorithm run in unicode or is it not correctly used in unicode at all?

Accepted Answer · 2016-07-21T10:22:35

MD5 is a message hashing algorithm. But no one says that the message must be text. MD5 (as well as other similar algorithms) perceives the data stream (bits, bytes). Therefore, the result obtained by you is quite natural.

PS: Search for other algorithms that will work with different encodings of strings, producing a similar result for the same strings in different encodings, does not make sense. Easier to bring everything to a single encoding.

the question is a little off topic - but still - can you recommend a library for working with algorithms?
I used to use beecrypt , but recently, when the need arises for a particular algorithm, I find separately the necessary 1-2 source files on the network and include them in the project.
Over time, all the necessary basic algorithms are already there.

KoVadim KoVadim 85.7k four 66 128 · Answer 2 · 2016-07-21T10:18:21

md5 works with bytes. They know nothing about characters and encodings. Therefore, it is absolutely expected that for different encodings the same text will produce different results. If you want the result to be the same, you can agree in the program that md5 is considered only Unicode or another given encoding. And before calculating md5, the string leads to the desired encoding.

Well, not exactly with bytes, rather with bits, although most implementations of course accept bytes.
If you look at the description, then yes, there bits are written everywhere, but at all stages there are multiple of them. And in most implementations, bytes consist of just 8 bits.
I wanted to say that md5 will be different for one single bit and byte 0x01.

MD5, unicode and multibyte

2 answers 2

More articles: