For example, there are letters that are similar between the Russian and Latin alphabets: a, H, and so on ...

There is some dictionary.

There is some database where information could be entered incorrectly => if you perform a search in the dictionary, you will not be able to find it.

It is necessary to sort through all combinations of similar Latin and Cyrillic characters.

For example, the word was given Rhinoceros. (A false example)

You need to generate sequences (black is highlighted in Latin):

H orozh
But about sorog
Nose about horn
Nosor about g
But sorog

Can you suggest an abstract algorithm to solve this problem?

The first thing that comes to mind is to create a dictionary and write down the positions of the characters that I am going to change.

Next, I do not know which way to go.

  • And why do you need all the options. what is the ultimate goal. I see that if we need to find such words, then we will look for words consisting of two types of letters with a simple regular schedule. If you need to find real words, then we bring all the letters into one type and look for (in the dictionary we are looking for, maybe the same we give everything to one letter) - Mike
  • 2
    IMHO no generation is necessary. It is necessary to write a function to compare 2 strings character by character, and if 2 characters from the same family of similar characters, they are equal - tym32167
  • Right. Write Comparer to solve your problem. - Andrey NOP
  • @ tym32167, I do not want to load the entire dictionary into memory and sort through line by line. Faster queries will be faster to the database, since the profit from the indices will be. - iluxa1810
  • @Mike, I think that makes sense. I will create a search field in the dictionary, which is reduced to one type of characters, and the word itself will also be reduced to one type. - iluxa1810

2 answers 2

Well, like so.

You find groups of “identical” characters, and you compose a mapping of a character to its group. (To optimize, you can omit singleton groups.) This is done once, before the program starts.

Then go through the characters of the line. For each of them you have a set of characters, you can replace it with a short one. Your code will be:

var sb = new StringBuilder(s); var results = GetAllCombinations(sb, 0); 
 IEnumerable<string> GetAllCombinations(StringBuilder sb, int index) { if (index >= sb.Length) { yield return sb.ToString(); yield break; } char currC = sb[index]; List<char> table = GetSimilarSymbols(currC); if (table == null) table = new List<char>() { currC }; foreach (var c in table) { sb[index] = c; foreach (var s in GetAllCombinations(sb, index + 1)) yield return s; } sb[index] = currC; } 
  • I'm not sure, but it looks like you are going from left to right changing the characters => in the sequence there will be no case when only the last letter has changed - iluxa1810
  • @ iluxa1810: Why not? In each recursive call, GetAllCombinations will have one element in the table , except the last. - VladD 5:58 pm
  1. Identify reference letters once.
  2. Make a table of replacements of non-reference letters with reference ones.
  3. All words in the dictionary are replaced by spelling through reference letters.
  4. Replace all non-reference letters in the received text with reference ones.
  5. Perform a simple dictionary search (the match is now unequivocal).