Encoding: UTF8 -> ANSI

Question

Faced the following problem: there is a file in ANSI encoding and I read it line by line, but as far as I understood correctly, in C # this read line is stored in Unicode and in it I observe some hieroglyphs. What I do (on the example of the 1st line of the file):

StreamReader read = new StreamReader(@"D:\(path)"); StreamWriter write = new StreamWriter(@"D:\(path)"); Encoding ANSI = Encoding.GetEncoding(1252); Encoding UTF8 = Encoding.UTF8; byte[] utf8_bytes,ansi_bytes; utf8_bytes = UTF8.GetBytes(read.ReadLine()); ansi_bytes = Encoding.Convert(UTF8, ANSI, utf8_bytes); string ansi_str = ANSI.GetString(ansi_bytes); write.WriteLine(ansi_str); read.Close(); write.Close();

But for some reason this does not work: the new file still displays many incomprehensible question marks. Thank you in advance.

By the way, in the constructor StreamReader \ StreamWriter itself you can set the encoding and then everything will be much easier, but I would like to understand why it does not work my way.
There is a suspicion that he is in Cyrillic (win-1251), and you are manipulating with the help of 1252 (there are cracks on the place of Russian letters) - what is the ultimate goal of these manipulations?
1) Yes, confused, there is 1251 for ANSI, but Encoding ANSI = Encoding. GetEncoding (1251) does not work either. The file itself is an ANSI Russian dictionary of words (at least just in Notepad ++, it is set) 2) The final goal is There is a dictionary of synonyms (~ 20k words) and from each line I remove the definitions of the terms themselves, leaving only the words themselves.
I just have the feeling that when writing to a file goes, it is automatically converted back to Unicode

AlexeyM AlexeyM 1,883 ten 20 · Answer 1 · 2012-05-01T09:53:13

Take a closer look at the code:

read text recorded in a UTF-8 file (default for StreamReader)
write text in UTF-16LE string (class System.String in .NET)
get a text representation in UTF-8 encoding as a byte array
convert text from UTF-8 to ANSI by getting an array of bytes
convert ANSI text to UTF-16LE for writing to System.String
write text (which is now in System.String) using StreamWriter (which by default writes in UTF-8 encoding)

Total it turns out that the text passes the following transformations:

UTF-8 (StreamReader) -> UTF-16LE (System.String) -> UTF-8 (byte []) -> ANSI (byte []) -> UTF-16LE (System.String) -> UTF-8 ( StreamWriter)

I think that now it became obvious what the problem is.

Yeah, it is clear, thank you. The only thing is, how do I convert the byte array in ANSI to a string with the appropriate encoding, and not UTF-16LE?
That is, as far as I correctly understood, in my example, in the ansi_str line there will be a string in UTF-16LE.
The string System.String in .NET is always UFT-16LE encoded;
Something I already do not understand: if System.String in .NET is always in UFT-16LE encoding, and Encoding.GetString () returns System.String.
Then immediately it is necessary to write an array of bytes in this stream somehow?
Create a stream to write to a file ( FileStream ) and use the Stream.Write() method.

klutch1991 klutch1991 2,018 eight 22 · Answer 2 · 2017-03-27T08:25:31

 StreamReader read = new StreamReader(@"D:\(path)"); StreamWriter write = new StreamWriter(@"D:\(path)"); Encoding ANSI = Encoding.GetEncoding(1251); Encoding UTF8 = Encoding.UTF8; byte[] utf8_bytes,ansi_bytes; utf8_bytes = UTF8.GetBytes(read.ReadLine()); ansi_bytes = Encoding.Convert(UTF8, ANSI, utf8_bytes); string ansi_str = ANSI.GetString(ansi_bytes); write.WriteLine(ansi_str); read.Close(); write.Close();

Try 1251 instead of 1252, then Russian characters will be displayed correctly.

Encoding: UTF8 -> ANSI

2 answers 2

More articles: