Remove duplicate lines by linux mask

Question

There is a file with different lines. For example:

lol kek ferrals0 ferrals100 ferrals102 ferrals107 ferrals108 ferrals110 ferrals111 ferrals114 buba6

It is required to delete the lines that are cleared only by a few characters on the right and left of the line and leave only 1 immutable word. The correct result is:

 lol kek ferrals buba6

And what if in the input file will appear more "lir" and "kat"?
Leave instead of pairs lir / lol and kek / kat "l" and "k" respectively?
@AlexanderProkoshev As a rule, there are lines with varying "many" characters (about 10-20 occurrences) and such as lol and kek 1 piece.
ferrals107, ferralscuko and ferrals10ferrals - the same thing?
And if there is a ferraluminium - fold them all up to "ferral"?
In general, the criterion "distinguished by several characters on the right and left," in my opinion, requires serious clarification.
The "few" characters can be the length of the entire string, for example.
A note on the verification text, from where did the line "ferrals" come from?

Sergey Sergey 6.372 four nineteen · Accepted Answer · 2018-08-20T03:06:33

It is clear that no strictly algorithm can be proposed here. You can try to use a variant of the algorithm "collapse into heaps" (clustering) proposed by M. Weinzweig and M. Bongard in 1973. Well, I will not describe the theory here (it is not very simple), but you can estimate the implementation.

Create a function that calculates the difference between two words. For example, it can work on the following principle:
- If words of the same length, then pairwise match letters and each not matching increases the difference value. The closer the compared letters to the center of the word, the greater the weight of their difference.
- If words of different lengths, then the shorter "chase" along the longer one, complementing the free part with spaces. As a result, we take the minimum value of the difference function
For all words in the list we calculate pairwise differences.
Perform the actual collapse:
- We take a random word from the initial list and collect a new list, which includes those words, the difference values with which there is less than some "delta".
- Selected words are removed from the source list.
- Repeat the process until the source list is empty.
In each heap, choose the shortest word.

It is clear that bash is not write. But you require the elements of AI, in fact.

Remove duplicate lines by linux mask

1 answer 1

More articles: