There is a text-set of sentences. His words are converted to basic form by a stemmer.

Input: String. Usually it is about 20 kB of text. Article from the proposals of the Russian text. Encoding (utf8) and other similar parameters can be made by anyone, this is hardly essential.

Example “Oleg has a friend Oleg”

After the work of the stemmer, you get an object that is not yet clear what to store. For example, it can be a list or a tuple: ['Y', 'Oleg', 'appear', 'Oleg'] by processing this list we find the most frequency word

At the exit: you need to get a line (source text) in which the keywords are wrapped with html tags, for example, bold text tags: «У <b>Олега</b> появился друг <b>Олег</b>»

How to do it better? The only thing that comes to mind is to insert the original words into the list, then after the stemmer, knowing the number of the word in the list, already wrap it with tags. And then from the list back to the plain text.

Maybe there is a simpler version of what?

    1 answer 1

    Since the stemmer is aimed at finding the basis of the word, and the basis of the word is the unchangeable part of the word, in my opinion a good approach would be to select the text using a regular expression. Regular search will look something like this [^\s]*Олег[^\s]* . To wrap matching expressions in tags, you will need to use groups.

    • “Since the stemmer is aimed at finding the basis of the word, and the stem of the word is the unchangeable part of the word,” I probably used the word stemmer incorrectly, apparently, you are right that this is what cuts off suffixes-endings. It seems to me that a lemmatizer (?) Is something that is also clever enough to bring the word to a basic form. For example, "been" -> "to be" there is clearly not a simple clipping. But I use Yandex mystem, for some reason they call their product a stemmer .. - treugolnik
    • one
      then I think that the option that you suggested in the text of the question will do. I don’t know what principle mystem works on, but I would also do a preliminary text analysis, removing the stop words from it (prepositions, conjunctions, numbers, etc.) - rusnasonov
    • @ Nexus Yes, when searching for the most frequent - and I’m removing the stop words and generally everything up to 3 characters in length - is thrown out. - treugolnik