The definition of the encoding string. Any way

Question

System.out.println(tag.getFirst(FieldKey.ALBUM));

I get a string, display it on the screen: ??????? ???? ?????? ??????? ???? ??????

By practical consideration I found out that this is UTF-16 encoding (probably ?!)

 System.out.println(new String(tag.getFirst(FieldKey.ALBUM).getBytes("UTF-16"), "windows-1251"));

Brought to the screen: юя С н а ч а л а Б ы л о С о л н ц е ach a юя С н а ч а л а Б ы л о С о л н ц е . !!! At the same time, any text is copied, but this one from the console is not inserted here.

??? The first question is: what is the "yu" at the beginning of the line, on the phone of these letters there is no track in the tags.

Wrote the code:

 byte masByte[] = tag.getFirst(FieldKey.ALBUM).getBytes("UTF-16"); for (int i = 0; i < masByte.length; i++) { System.out.print(masByte[i]); }

Displays: -2-10-470-190-320-90-320-210-320320-630-50-210-180320-470-180-210-190-100-27

The main question: is there a library of encoding definitions, or an algorithm? PS To determine any popular encoding.

Please specify - Which library do you use for extracting tags from a file?
Redo the output byte loop on System.out.println (Arrays.toString (masByte));

VladD VladD 183k 16 gold signs 228 silver marks 434 bronze marks · Answer 1 · 2015-10-02T11:42:50

There is no concept of "string encoding".

there is

either string
either the set of bytes that represents this very string, plus the encoding.

The task of determining the encoding of a set of bytes is insoluble correctly. You can conduct a frequency analysis, but this is a terrible, disgusting, not working (except in easy cases) solution.

If tag.getFirst returns a string, it must be a valid string. If not - swear with the developers of the library.

"not working (except for light cases)" - rather, on the contrary, almost always working.
@Qwertiy: Because simple tests mostly come across easy cases?
@VladD, and an easy case, is this when all the text is long and in one encoding?
As far as I remember, the statistics of a language is already manifested in dozens of symbols of meaningful text.
For example, when in the Russian text there are no English inserts.
Plus it is very difficult to separate English from other European languages in this way => there may be a problem with the auto-selection of the encoding.
And what are the English letters, numbers, punctuation, etc.?
They simply do not take into account when collecting statistics.
But to separate parts of cp1251 from koi8-r (or to make a choice between them for a short text) is very difficult.

Qwertiy ♦ Qwertiy 76.6k 17 golden marks 74 silver marks 203 bronze marks · Answer 2 · 2015-10-02T12:05:15

юя is byte order mark: -2 -1 . is not a space, but a character with code 0. Just because UTF16 is 2 bytes per character. But this is not UTF16 - there Russian letters have completely different codes. This is Win1251 with the zeroes inserted between the bytes, issued as UTF16 with BOM.

It is necessary to work with bytes in arrays of bytes, with strings - in strings. Push bytes into strings, and then try to do something bad. And no one will understand for you what specific crutch for a particular case you need to write.

The definition of the encoding string. Any way

2 answers 2

More articles: