There is a line "ЇаЁўҐўҐ" how to read it? Those. it needs to be converted from one encoding to another, how to do it in C #?


Addition: I answer the question why it is needed and where such lines are?
If you take the old floppy disks with files created in the last century in MS-DOS, then the file names are approximately like “aYo.txt”, as viewed in Windows.

  • Comments are not intended for extended discussion; conversation moved to chat . - Nick Volynkin

3 answers 3

Specify the encoding when reading the contents of the file. Those. for reading ("transcoding" when reading) from 866, it is enough just to specify Encoding:

File.WriteAllText(@"c:\temp\test.txt", "тест!", Encoding.GetEncoding(866)); var text = File.ReadAllText("test.txt", Encoding.GetEncoding(866)); 

If you had a specific case, for example, you received already corrupted text as a string, then it is enough just to save it back to bytes indicating the wrong encoding, and read indicating the correct one:

 static void Main(string[] args) { string bad = "ЇаЁўҐв"; string good = Convert(bad, 1251, 866); } static string Convert(string source, int from, int to) { byte[] bytes = Encoding.GetEncoding(from).GetBytes(source); return Encoding.GetEncoding(to).GetString(bytes); } 

True, this will only work if reading bytes in the wrong encoding (by a happy coincidence!) Turns out to be reversible. The following is an example of when this is not the case.


Concerning "recoding":

You are trying to fix the consequences, not the problem itself.

How does this problem arise:

  1. You have an old file encoded in 866.
  2. You read it into a string without specifying an encoding. The system does not find the BOM, and reads the file in the Encoding.Default Encoding.Default .
  3. You are trying to "transcode read line".

Example:

 // создали старый файл с содержимым в 866 File.WriteAllText("test.txt", "тест!", Encoding.GetEncoding(866)); // Открыли без указания кодировки, увидели кракозяблы: Console.WriteLine(File.ReadAllText("test.txt")); 

The solution you are trying to apply is "convert a string". Those. You hope the following code works:

 static void Main(string[] args) { // создали старый файл с содержимым в 866 File.WriteAllText(@"c:\temp\test.txt", "тест!", Encoding.GetEncoding(866)); // Открыли без указания кодировки, увидели кракозяблы: var text = File.ReadAllText("test.txt"); Console.WriteLine(text); text = Convert(text, 866, 1251); Console.WriteLine(text); } static string Convert(string source, int from, int to) { byte[] bytes = Encoding.UTF8.GetBytes(source); byte[] newBytes = Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(from), bytes); string newStr = Encoding.GetEncoding(to).GetString(newBytes); return newStr; } 

There is a weak point in this solution - it assumes, strings in .net are just a kind of byte set. Those. no matter in what form the line is read - it can be converted back into the same bytes from which it was read. In fact, it is not. The example above is non-working.

If you do not guess the encoding of the file when reading - it will not work to write back.

 File.WriteAllText(@"c:\temp\test.txt", "тест!", Encoding.GetEncoding(866)); var text = File.ReadAllText("test.txt"); File.WriteAllText(@"test2.txt", text); 

Suddenly, this code produces two different files, although there was no "transcoding".


  • Well, if the encoding in the "wrong" code page was lossless, and it can be turned. - VladD
  • @VladD thanks, added a note - PashaPash
  • Cool, thanks! - VladD
  • @PashaPash "File.ReadAllText (" test.txt ") ... Suddenly" - the fact is that you write 866, and read UTF8. see the source - Stack
  • @Stack and the problem that you described in the question is precisely this. Someone somewhere wrote a file in 866 (or another old encoding). They read it with an indication of a wrong encoding (or no indication at all, and read it like utf-8) - they got cracks. The essence of the answer - you must read in the correct encoding immediately, indicating it when reading. And do not fix the consequences of "conversion". - PashaPash
 string Convert(string source, int from, int to) { byte[] bytes = Encoding.UTF8.GetBytes(source); byte[] newBytes = Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(from), bytes); string newStr = Encoding.GetEncoding(to).GetString(newBytes); return newStr; } 

Using:

 string str = "Привет"; string result = Convert(str, 866, 1251); => ЇаЁўҐв string result2 = Convert(result, 1251, 866); => Привет 
  • 2
    converting to UTF8 is unnecessary in the Convert method. You can immediately get the bytes in the desired encoding - Encoding.GetEncoding(from).GetBytes(source) and do without calling Encoding.Convert - PashaPash

There is a line "ЇаЁўҐўҐ". How to read it?

Your 'hi'. 1251 and 866 are both single-byte, support Cyrillic and cover the lossless code range if interpreted incorrectly (866-1251, 1251-866).

If you only need to read, you do not need to convert anything. It is enough to choose the correct code page for interpreting the text (as noted by colleagues earlier - you have cp866) and set it when reading an array of bytes or from a stream.

Options are listed above. That's just not Сonvert , otherwise you will get the same thing, since Convert produces a comparison, not a replacement of characters.