Convert double-byte characters to php

Question

Tell me, please, how can I convert a two-byte character from one encoding to another? from utf-8 to cp1251 for php. Those. let's say I want to convert the letter "g" from utf-8 to cp1251. As I understand it, I get two characters [syntax = php] P [/ syntax] and [syntax = php] i [/ syntax], i.e. two bytes, but now how can I fold them, or convert them to decimal representation, then perform some action, for example, subtract 848 to get the same number in the cp1251 system

It is the method that interests you, not the ready-made functions.

Accepted Answer · 2013-02-07T21:51:56

@platedz , whatever PCP.

The utf-8 characters must be translated into ucs codes, and they (if possible) in cp1251. Naturally, not all ucs (for example: latin-1, pseudographics, hieroglyphs, etc.) can be translated into cp1251.

How to translate utf-8 to ucs.

We look at the first (sign) bit of the utf-8 byte. If it is 0, then the ucs code is equal to the value of this byte (this is ascii).

If the first two bits are 10 or the byte value is 0xff or 0xfe, then this is an error in utf-8.

Now analyze the high byte bits. We select several 1, and then one 0. The number of units is equal to the number of utf-8 bytes encoding ucs. The rest of the byte is the high bit of the encoded ucs. In this case, all the following bytes of this symbol must begin with 10 and the remaining 6 bits encode the next part of ucs.

All Cyrillic is encoded with 2 utf-8 bytes. For example, the Russian A (ucs code 0x410) in utf-8 is 2 bytes 0xd0 0x90

1101 0000 1001 0000 запишем так (слева видим 110, значит всего будет 2 байта в utf-8) 110 10000 10 010000 выделим 11 бит (5 из первого и 6 из второго байт) из которых формируем ucs 10000010000 или разбив на полубайты 100 0001 0000 т.е. 0x410

Another example is the symbol No.

 № в utf-8 0xe2 0x84 0x96 1110 0010 1000 0100 1001 0110 0010 00 0100 01 0110 0010 0001 0001 0110 == 0x2116

In fact, it is faster to write a program (I find it easier to use C, but you are interested in PCP) than to explain it in Russian.

For 2 bytes in str [], getting the first 5 bits in b1, and the last 6 bits in b2

 int b1, b2, ucs; b1 = str[0] & 0x1f; b2 = str[1] & 0x3f; ucs = (b1 << 6) | b2;

or if there are no bit operations in PCP (I hope there is a remainder of the division), then

 b1 = str[0] % 32; b2 = str[1] % 64; ucs = b1*64 + b2;

Mr Trololo Mr Trololo 391 one 7 · Answer 2 · 2013-02-06T14:13:46

Believe me, for this it is easier to use the "ready-made function" - mb_convert_encoding or iconv . Well, if you still want to suffer, then study the information: one , two , three , four .

Denis Kotlyarov Denis Kotlyarov 2,014 14 37 · Answer 3 · 2016-07-14T18:45:30

Of course the question is old but many people will run to write. Don't understand what. There is a standard function to translate strings into an array of php bytes and vice versa.

 unpack('C*', $buffer) в нем вы получите массив int со значениями обычных байтов(в пхп нету байтовских массивов) pack('C*', $ta) обратная функция.

Convert double-byte characters to php

3 answers 3

More articles: