In general, there is an online dictionary of the Kabardino-Circassian language on Django. This language has the Cyrillic alphabet, but the trick of this alphabet is that it has such a concept as letter combinations, for example: "kkh", "tl", "pl", "g", "hu", etc. Another feature is that the alphabetic order differs from the standard Cyrillic alphabet. Wikipedia article . And in this regard, there were two tasks:

Problem number 1: You need to make sure that the machine perceives these letter combinations as one letter.

Problem number 2: It is necessary to sort the words in the database in accordance with the order of the Kabardino-Circassian alphabet.

I am a beginner pythonist and I don’t even know which side to approach this problem, so I ask for your help / advice, colleagues.

  • one
    Perhaps, a good solution would be a preliminary "translation" of the word to its sorted version. That is, two values ​​are stored in the database: the word itself and its variant, where the letter combinations are replaced with one character. Sort by this field. However, wait for the experts. - ReinRaus
  • Develop a sorter, encoder for the language and implement in all standards. unicods there are all sorts, locale. Then in the new versions of the lfyys bases [you can specify the corresponding locale, and everything will be sorted by itself. - Sergey
  • Good idea @ReinRaus, but considering that there are already more than 4 thousand words in the database, you need to find a way to automate this process. But there is a beginning, thanks! - Bootuz
  • @Sergey, well, this is probably the most ideal option, but as a beginner I don’t want to go into this jungle so far) - Bootuz
  • See whether there is already a Unicode collator for the Kabardino-Circassian language (here is an example for Russian language rank = icu.Collator.createInstance(icu.Locale('ru')).getSortKey ). If not, then 1- break the input text into a sequence of letter combinations (by analogy with regex.findall(r'\X', text) ) 2- write a rank function that can compare individual letter combinations. - jfs

1 answer 1

  1. Writes a function that parses the string into the characters of the language (letter combinations)
  2. A dictionary is compiled, in which each character of the language is assigned its sequence number.
  3. A function is written that runs along the compared words in parallel, parsing the characters and comparing their numbers from the tables.

Code like this:

 def kab_cher_cmp(w1, w2): '''Функция сравнения слов для кабардино-черкесского языка @param w1: Первое слово @param w2: Второе слово @return: 0 - слова равны, <0 - первое раньше, >0 - второе раньше ''' for c1, c2 in zip(kab_cher_smb(w1.lower()), kab_cher_smb(w2.lower())): if c1 == c2: continue return KAB_CHER_ORD[c1] - KAB_CHER_ORD[c2] return 0 KAB_CHER_ORD = { 'а': 1, 'э': 2, 'б': 3, 'в': 4, 'г': 5, 'гу': 6, ...} import re _re_kab_cher_smb = re.compile('кхъу|хъу|кхъ|къу|кІу|....', re.I) def kab_cher_smb(w): '''Функция разбора строки на символы кабардино-черкесского языка''' for m in _re_kab_cher_smb.finditer(w): yield m.group() 

For the final decision, you need to understand what to do if you get a symbol not from the Kabardino-Circassian language, such as a space or a digit. :) Well, in kab_cher_smb, instead of a regular, you can use an honest state machine. :)