The task is this. There is a word. For example, "Paranormal". One of the letters here is written on the English layout. It is necessary to determine the layout of the word (by the greatest concentration), to identify characters from a different layout (I think UTF has similar symbols not only between Russian and English layouts) and replace them with similar symbols of the layout, to which most characters belong. Can you tell me a quick way to do this in C #?
2 answers
To search for letters that belong to a particular language, you can use named blocks of regular expressions.
But the conversion tables will have to be compiled manually (in the example, only two similar letters in the dictionaries).
string text = "Паранoрмальное"; var basicLatinToCyrillicDictionary = new Dictionary<char, char> { ['a'] = 'а', ['o'] = 'о' }; var cyrillicToBasicLatinDictionary = new Dictionary<char, char> { ['а'] = 'a', ['о'] = 'o' }; string basicLatinPattern = @"\p{IsBasicLatin}"; string cyrillicPattern = @"\p{IsCyrillic}"; var basicLatinMatches = Regex.Matches(text, basicLatinPattern); var cyrillicMatches = Regex.Matches(text, cyrillicPattern); int basicLatinCount = basicLatinMatches.Count; int cyrillicCount = cyrillicMatches.Count; var sb = new StringBuilder(text); if (cyrillicCount > basicLatinCount) { foreach (Match m in basicLatinMatches) { char basicLatinChar = m.Value[0]; char cyrillicChar = basicLatinToCyrillicDictionary[basicLatinChar]; sb.Replace(basicLatinChar, cyrillicChar, m.Index, 1); } } else { // обратная замена } text = sb.ToString(); Console.WriteLine(text); If there are more than two languages, the number of recoding tables becomes awesome ...
I looked closely at the code point ranges: judging by BasicLatin, not only letters get there. So the method may not be suitable.
- Of course, there may be several languages. There are a lot of such dictionaries. Perhaps it is better to use not Dictionary then, but Tuple or DataTable? - iRumba
- Yes, and how will I enter characters from other languages? I have only 2 layouts on the keyboard. We need character codes. - iRumba
- @iRumba: Ctrl-C / Ctrl-V? - VladD
- @VladD, this is a mockery :( - iRumba
- @iRumba: Why mockery? If there is a ready table, why not copy it into code? Not one character at a time, but all together. - VladD
|
Use ASCII codes, English characters begin from 65 hex to 122hex Russians - hex 192-255 Here you can see the codes
You need to compare each character just
string value = "Паранoрмальное"; // Convert the string into a byte[]. byte[] asciiBytes = Encoding.ASCII.GetBytes(value); - Well, why immediately use ASCII? What are the advantages of this encoding over Unicode? - Pavel Mayorov
|
ABEKMHOPCTXaeopcyx. - Sasha Chernykh