Tell me, please, how can I convert a two-byte character from one encoding to another? from utf-8 to cp1251 for php. Those. let's say I want to convert the letter "g" from utf-8 to cp1251. As I understand it, I get two characters [syntax = php] P [/ syntax] and [syntax = php] i [/ syntax], i.e. two bytes, but now how can I fold them, or convert them to decimal representation, then perform some action, for example, subtract 848 to get the same number in the cp1251 system

It is the method that interests you, not the ready-made functions.

  • Maybe iconv? - alexlz

3 answers 3

@platedz , whatever PCP.

The utf-8 characters must be translated into ucs codes, and they (if possible) in cp1251. Naturally, not all ucs (for example: latin-1, pseudographics, hieroglyphs, etc.) can be translated into cp1251.

How to translate utf-8 to ucs.

We look at the first (sign) bit of the utf-8 byte. If it is 0, then the ucs code is equal to the value of this byte (this is ascii).

If the first two bits are 10 or the byte value is 0xff or 0xfe, then this is an error in utf-8.

Now analyze the high byte bits. We select several 1, and then one 0. The number of units is equal to the number of utf-8 bytes encoding ucs. The rest of the byte is the high bit of the encoded ucs. In this case, all the following bytes of this symbol must begin with 10 and the remaining 6 bits encode the next part of ucs.

All Cyrillic is encoded with 2 utf-8 bytes. For example, the Russian A (ucs code 0x410) in utf-8 is 2 bytes 0xd0 0x90

1101 0000 1001 0000 запишСм Ρ‚Π°ΠΊ (слСва Π²ΠΈΠ΄ΠΈΠΌ 110, Π·Π½Π°Ρ‡ΠΈΡ‚ всСго Π±ΡƒΠ΄Π΅Ρ‚ 2 Π±Π°ΠΉΡ‚Π° Π² utf-8) 110 10000 10 010000 Π²Ρ‹Π΄Π΅Π»ΠΈΠΌ 11 Π±ΠΈΡ‚ (5 ΠΈΠ· ΠΏΠ΅Ρ€Π²ΠΎΠ³ΠΎ ΠΈ 6 ΠΈΠ· Π²Ρ‚ΠΎΡ€ΠΎΠ³ΠΎ Π±Π°ΠΉΡ‚) ΠΈΠ· ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… Ρ„ΠΎΡ€ΠΌΠΈΡ€ΡƒΠ΅ΠΌ ucs 10000010000 ΠΈΠ»ΠΈ Ρ€Π°Π·Π±ΠΈΠ² Π½Π° ΠΏΠΎΠ»ΡƒΠ±Π°ΠΉΡ‚Ρ‹ 100 0001 0000 Ρ‚.Π΅. 0x410 

Another example is the symbol No.

 β„– Π² utf-8 0xe2 0x84 0x96 1110 0010 1000 0100 1001 0110 0010 00 0100 01 0110 0010 0001 0001 0110 == 0x2116 

In fact, it is faster to write a program (I find it easier to use C, but you are interested in PCP) than to explain it in Russian.

For 2 bytes in str [], getting the first 5 bits in b1, and the last 6 bits in b2

 int b1, b2, ucs; b1 = str[0] & 0x1f; b2 = str[1] & 0x3f; ucs = (b1 << 6) | b2; 

or if there are no bit operations in PCP (I hope there is a remainder of the division), then

 b1 = str[0] % 32; b2 = str[1] % 64; ucs = b1*64 + b2; 

    Believe me, for this it is easier to use the "ready-made function" - mb_convert_encoding or iconv . Well, if you still want to suffer, then study the information: one , two , three , four .

      Of course the question is old but many people will run to write. Don't understand what. There is a standard function to translate strings into an array of php bytes and vice versa.

       unpack('C*', $buffer) Π² Π½Π΅ΠΌ Π²Ρ‹ ΠΏΠΎΠ»ΡƒΡ‡ΠΈΡ‚Π΅ массив int со значСниями ΠΎΠ±Ρ‹Ρ‡Π½Ρ‹Ρ… Π±Π°ΠΉΡ‚ΠΎΠ²(Π² ΠΏΡ…ΠΏ Π½Π΅Ρ‚Ρƒ байтовских массивов) pack('C*', $ta) обратная функция.