For example, the UTF-8 character Я encoded at 1101 0000 1010 1111 , and after decoding, the code point is 1 0000 10 1111 . So, who is responsible for turning this code point into a symbol and drawing it on the screen? That is, in windows and linux there are Unicode tables by which this code point is looked for and then the glyph corresponding to it is drawn? In Linux, everything is in UTF-8, and in windows in api functions do I need to pass UTF-16 with surrogate pairs?
2 answers
The answer is taken from the comment :
Code point is no longer translated anywhere. 1 0000 10 1111 - this is the code of the symbol я , the one and only.
Some fonts contain the glyph for this character, some do not.
The font machine is responsible for converting the code point to an image. Since fonts are indexed (in modern systems) by code points in Unicode, it is expected that the software that parses the string is able to deal with the specific encoding used. And, by the way, it does not matter in principle whether Unicode encoding is used or something else. Text display software should be able to convert a string of bytes into a sequence of Unicode code points. It doesn't matter what the string is: UTF-8, UTF-16, or Code.
"On Linux, everything is in UTF-8" is a myth. Undoubtedly, the kernel supports UTF-8 in many situations, but inside the core text console software, the characters are processed as 16-bit Unicode code points, i.e. in UCS-2, and they are usually stored (see /dev/vcsa n ) - in 8 or 9-bit code pages. File names? I don’t know where in the kernel and since when UTF-8 has been used for them; back in 2.x it was possible to specify conversion to different code pages when mounting FAT and SMB. As in modern Linux - add experts, if you know. And the application software in tty can work in any encoding for which it was not too lazy to create a locale.
Using the UTF-16LE in the Windows kernel speeds up the task when displaying Basic Plane characters (BMP; code point <0x10000), since the 16-bit word UTF-16 in this case is the same as the code point. When Windows stumbles upon a surrogate pair - yes, it is converted into a 32-bit code and a font search is performed on it.
- As far as I know, in the file names (at least in FS type ext4) there can be any bytes (of course, except for
/and0x00) and no one converts encodings from open (fopen) to the table of contents on the disk. But the names in the mounted "alien" file systems are converted to utf-8 in me (I suspect that it depends on the locale, can anyone check? Otherwise I’m experimenting laziness). - avp - @avp: I doubt that from the locale, since the locale does not recode file names on the path to system calls (in other words, in open (), as the byte string is included, so the same thing comes to the kernel). And the locale does not act on the core in principle, since implemented by userspace. Some file systems (including those that I mentioned) have
-o iocharset=and-o codepage=options when mounting . - Incnis Mrsi - I have now (mount output)
ushare on /media/sf_ushare type vboxsf (gid=1001,rw)is divided between VirtualBox in Windows and Linux. / Consider that in VboxAdditions (apparently in the open implementation for a specific FS in vfs) utf-8 is wired? / But in principle, locale data (as environmentLANG=en_US.UTF-8) from userspace within kernel is accessible (how longer you can use them there is another question). - avp - @avp: If I understand what this is about, the host system mounts the virtual Windows filesystem, acting as a file server. Since in Windows NT + the file names are encoded in UTF-16, it is logical to convert them to UTF-8 and not to something 8-bit restrictive, in order to avoid inaccessible files and other troubles. Yes, the kernel can peek at the locale. Only his behavior should be based not on the whim of a system programmer, but on semantics. It so happened that open () on UNIX systems takes a string of bytes, not characters. To change this for all applications is unrealistic, but gradually - glitches and security holes. - Incnis Mrsi
- Streaming bytes is just fine. But the fact that I can not put in the Windows file with the name not in utf-8 (for example,
\377.txt(я.txtin cp-1251)), which itself cannot be recoded into Windows UTF-16 - this is not normal. However, I hope, soon the Windows (after puncturing the spying 10-ke), we will only remember. - avp
1 0000 10 1111- this is the character codeя, the one and only) Some fonts contain a glyph for this character, some do not. - Egor Skriptunoff