System.out.println(tag.getFirst(FieldKey.ALBUM)); 

I get a string, display it on the screen: ??????? ???? ?????? ??????? ???? ??????

By practical consideration I found out that this is UTF-16 encoding (probably ?!)

 System.out.println(new String(tag.getFirst(FieldKey.ALBUM).getBytes("UTF-16"), "windows-1251")); 

Brought to the screen: юя С н а ч а л а Б ы л о С о л н ц е ach a юя С н а ч а л а Б ы л о С о л н ц е . !!! At the same time, any text is copied, but this one from the console is not inserted here.

??? The first question is: what is the "yu" at the beginning of the line, on the phone of these letters there is no track in the tags.

Wrote the code:

 byte masByte[] = tag.getFirst(FieldKey.ALBUM).getBytes("UTF-16"); for (int i = 0; i < masByte.length; i++) { System.out.print(masByte[i]); } 

Displays: -2-10-470-190-320-90-320-210-320320-630-50-210-180320-470-180-210-190-100-27

The main question: is there a library of encoding definitions, or an algorithm? PS To determine any popular encoding.

  • Please specify - Which library do you use for extracting tags from a file? - Sergey Mitrofanov
  • Redo the output byte loop on System.out.println (Arrays.toString (masByte)); and shorter and looks. And then your bytes stuck together - it is not clear. - Sergey Mitrofanov
  • "System.out.print (masByte [i]);" - and who between them will be spaces? -2 -1 0 -47 0 -19 0 ... - Qwertiy
  • Jaudioatgger Library - Eugene

2 answers 2

There is no concept of "string encoding".

there is

  • either string
  • either the set of bytes that represents this very string, plus the encoding.

The task of determining the encoding of a set of bytes is insoluble correctly. You can conduct a frequency analysis, but this is a terrible, disgusting, not working (except in easy cases) solution.

If tag.getFirst returns a string, it must be a valid string. If not - swear with the developers of the library.

  • "not working (except for light cases)" - rather, on the contrary, almost always working. - Qwertiy
  • @Qwertiy: Because simple tests mostly come across easy cases? :-P - VladD
  • @VladD, and an easy case, is this when all the text is long and in one encoding? As far as I remember, the statistics of a language is already manifested in dozens of symbols of meaningful text. - avp
  • @avp: Yeah. For example, when in the Russian text there are no English inserts. Plus it is very difficult to separate English from other European languages ​​in this way => there may be a problem with the auto-selection of the encoding. - VladD
  • And what are the English letters, numbers, punctuation, etc.? can prevent? They simply do not take into account when collecting statistics. But to separate parts of cp1251 from koi8-r (or to make a choice between them for a short text) is very difficult. - avp 4:26 pm

юя is byte order mark: -2 -1 . is not a space, but a character with code 0. Just because UTF16 is 2 bytes per character. But this is not UTF16 - there Russian letters have completely different codes. This is Win1251 with the zeroes inserted between the bytes, issued as UTF16 with BOM.

It is necessary to work with bytes in arrays of bytes, with strings - in strings. Push bytes into strings, and then try to do something bad. And no one will understand for you what specific crutch for a particular case you need to write.