Sort records in the database by non-standard alphabet

Question

In general, there is an online dictionary of the Kabardino-Circassian language on Django. This language has the Cyrillic alphabet, but the trick of this alphabet is that it has such a concept as letter combinations, for example: "kkh", "tl", "pl", "g", "hu", etc. Another feature is that the alphabetic order differs from the standard Cyrillic alphabet. Wikipedia article . And in this regard, there were two tasks:

Problem number 1: You need to make sure that the machine perceives these letter combinations as one letter.

Problem number 2: It is necessary to sort the words in the database in accordance with the order of the Kabardino-Circassian alphabet.

I am a beginner pythonist and I don’t even know which side to approach this problem, so I ask for your help / advice, colleagues.

Perhaps, a good solution would be a preliminary "translation" of the word to its sorted version.
That is, two values are stored in the database: the word itself and its variant, where the letter combinations are replaced with one character.
Develop a sorter, encoder for the language and implement in all standards.
Then in the new versions of the lfyys bases [you can specify the corresponding locale, and everything will be sorted by itself.
Good idea @ReinRaus, but considering that there are already more than 4 thousand words in the database, you need to find a way to automate this process.
@Sergey, well, this is probably the most ideal option, but as a beginner I don’t want to go into this jungle so far)
See whether there is already a Unicode collator for the Kabardino-Circassian language (here is an example for Russian language rank = icu.Collator.createInstance(icu.Locale('ru')).getSortKey ).
If not, then 1- break the input text into a sequence of letter combinations (by analogy with regex.findall(r'\X', text) ) 2- write a rank function that can compare individual letter combinations.

tonal tonal 1,456 3 6 · Accepted Answer · 2016-04-16T17:31:47

Writes a function that parses the string into the characters of the language (letter combinations)
A dictionary is compiled, in which each character of the language is assigned its sequence number.
A function is written that runs along the compared words in parallel, parsing the characters and comparing their numbers from the tables.

Code like this:

 def kab_cher_cmp(w1, w2): '''Функция сравнения слов для кабардино-черкесского языка @param w1: Первое слово @param w2: Второе слово @return: 0 - слова равны, <0 - первое раньше, >0 - второе раньше ''' for c1, c2 in zip(kab_cher_smb(w1.lower()), kab_cher_smb(w2.lower())): if c1 == c2: continue return KAB_CHER_ORD[c1] - KAB_CHER_ORD[c2] return 0 KAB_CHER_ORD = { 'а': 1, 'э': 2, 'б': 3, 'в': 4, 'г': 5, 'гу': 6, ...} import re _re_kab_cher_smb = re.compile('кхъу|хъу|кхъ|къу|кІу|....', re.I) def kab_cher_smb(w): '''Функция разбора строки на символы кабардино-черкесского языка''' for m in _re_kab_cher_smb.finditer(w): yield m.group()

For the final decision, you need to understand what to do if you get a symbol not from the Kabardino-Circassian language, such as a space or a digit. :) Well, in kab_cher_smb, instead of a regular, you can use an honest state machine. :)

Sort records in the database by non-standard alphabet

1 answer 1

More articles: