Faced the following problem: there is a file in ANSI encoding and I read it line by line, but as far as I understood correctly, in C # this read line is stored in Unicode and in it I observe some hieroglyphs. What I do (on the example of the 1st line of the file):

StreamReader read = new StreamReader(@"D:\(path)"); StreamWriter write = new StreamWriter(@"D:\(path)"); Encoding ANSI = Encoding.GetEncoding(1252); Encoding UTF8 = Encoding.UTF8; byte[] utf8_bytes,ansi_bytes; utf8_bytes = UTF8.GetBytes(read.ReadLine()); ansi_bytes = Encoding.Convert(UTF8, ANSI, utf8_bytes); string ansi_str = ANSI.GetString(ansi_bytes); write.WriteLine(ansi_str); read.Close(); write.Close(); 

But for some reason this does not work: the new file still displays many incomprehensible question marks. Thank you in advance.

  • By the way, in the constructor StreamReader \ StreamWriter itself you can set the encoding and then everything will be much easier, but I would like to understand why it does not work my way. - nvse
  • - show a piece of text input file. There is a suspicion that he is in Cyrillic (win-1251), and you are manipulating with the help of 1252 (there are cracks on the place of Russian letters) - what is the ultimate goal of these manipulations? - mantigatos
  • 1) Yes, confused, there is 1251 for ANSI, but Encoding ANSI = Encoding. GetEncoding (1251) does not work either. The file itself is an ANSI Russian dictionary of words (at least just in Notepad ++, it is set) 2) The final goal is There is a dictionary of synonyms (~ 20k words) and from each line I remove the definitions of the terms themselves, leaving only the words themselves. - nvse
  • I just have the feeling that when writing to a file goes, it is automatically converted back to Unicode - nvse

2 answers 2

Take a closer look at the code:

  1. read text recorded in a UTF-8 file (default for StreamReader)
  2. write text in UTF-16LE string (class System.String in .NET)
  3. get a text representation in UTF-8 encoding as a byte array
  4. convert text from UTF-8 to ANSI by getting an array of bytes
  5. convert ANSI text to UTF-16LE for writing to System.String
  6. write text (which is now in System.String) using StreamWriter (which by default writes in UTF-8 encoding)

Total it turns out that the text passes the following transformations:

UTF-8 (StreamReader) -> UTF-16LE (System.String) -> UTF-8 (byte []) -> ANSI (byte []) -> UTF-16LE (System.String) -> UTF-8 ( StreamWriter)

I think that now it became obvious what the problem is.

  • Yeah, it is clear, thank you. The only thing is, how do I convert the byte array in ANSI to a string with the appropriate encoding, and not UTF-16LE? That is, as far as I correctly understood, in my example, in the ansi_str line there will be a string in UTF-16LE. - nvse
  • The string System.String in .NET is always UFT-16LE encoded; The Encoding.GetString() method performs this conversion. - AlexeyM
  • Something I already do not understand: if System.String in .NET is always in UFT-16LE encoding, and Encoding.GetString () returns System.String. Then immediately it is necessary to write an array of bytes in this stream somehow? - nvse
  • Yes exactly. Create a stream to write to a file ( FileStream ) and use the Stream.Write() method. - AlexeyM
 StreamReader read = new StreamReader(@"D:\(path)"); StreamWriter write = new StreamWriter(@"D:\(path)"); Encoding ANSI = Encoding.GetEncoding(1251); Encoding UTF8 = Encoding.UTF8; byte[] utf8_bytes,ansi_bytes; utf8_bytes = UTF8.GetBytes(read.ReadLine()); ansi_bytes = Encoding.Convert(UTF8, ANSI, utf8_bytes); string ansi_str = ANSI.GetString(ansi_bytes); write.WriteLine(ansi_str); read.Close(); write.Close(); 

Try 1251 instead of 1252, then Russian characters will be displayed correctly.