There is a file with different lines. For example:

lol kek ferrals0 ferrals100 ferrals102 ferrals107 ferrals108 ferrals110 ferrals111 ferrals114 buba6 

It is required to delete the lines that are cleared only by a few characters on the right and left of the line and leave only 1 immutable word. The correct result is:

 lol kek ferrals buba6 
  • And what if in the input file will appear more "lir" and "kat"? Leave instead of pairs lir / lol and kek / kat "l" and "k" respectively? - Alexander Prokoshev
  • @AlexanderProkoshev As a rule, there are lines with varying "many" characters (about 10-20 occurrences) and such as lol and kek 1 piece. - LorDo
  • one
    ferrals107, ferralscuko and ferrals10ferrals - the same thing? And if there is a ferraluminium - fold them all up to "ferral"? If not, why not? In general, the criterion "distinguished by several characters on the right and left," in my opinion, requires serious clarification. - Alexander Prokoshev
  • The "few" characters can be the length of the entire string, for example. - Hellseher
  • one
    A note on the verification text, from where did the line "ferrals" come from? - Hellseher

1 answer 1

It is clear that no strictly algorithm can be proposed here. You can try to use a variant of the algorithm "collapse into heaps" (clustering) proposed by M. Weinzweig and M. Bongard in 1973. Well, I will not describe the theory here (it is not very simple), but you can estimate the implementation.

  1. Create a function that calculates the difference between two words. For example, it can work on the following principle:
    • If words of the same length, then pairwise match letters and each not matching increases the difference value. The closer the compared letters to the center of the word, the greater the weight of their difference.
    • If words of different lengths, then the shorter "chase" along the longer one, complementing the free part with spaces. As a result, we take the minimum value of the difference function
  2. For all words in the list we calculate pairwise differences.
  3. Perform the actual collapse:
    • We take a random word from the initial list and collect a new list, which includes those words, the difference values ​​with which there is less than some "delta".
    • Selected words are removed from the source list.
    • Repeat the process until the source list is empty.
  4. In each heap, choose the shortest word.

It is clear that bash is not write. But you require the elements of AI, in fact.