There is a file file.txt containing lines:

536|разный текст|еще какойто текст|какие то цифры 6352|разный текст|еще какойто текст|какие то цифры 723|разный текст|еще какойто текст|какие то цифры 37|разный текст|еще какойто текст|какие то цифры ....и тд. 

it is necessary to delete those lines whose digit at the very beginning of the line (preceded by the character "|") is already found in another line in the same place (that is, at the very beginning before the first character "|").

that is, remove duplicates, but a duplicate, if not the entire row is the same, but only the first “column”, but the entire row must be deleted,

not both options should be deleted, some one should remain.

There can be more than two such "duplicates" of a single line in a file.

  • one
    You have most of the questions regarding regex , so the task of regex is to find the sequence according to the pattern, not duplicates. The maximum that can be done here through regex is to separate the text from the beginning of the line to the icon | - nick_n_a

1 answer 1

With the help of the sed program, this is quite problematic.

using the sort program is elementary:

 $ sort -t '|' -k 1,1 -u файл > результат 
  • -t символ - field separator
  • -k 1,1 - sort by first field
  • -u - remove duplicates

see man sort details


if it is important to preserve the order of lines, then the construction is slightly longer

 $ nl -s '|' файл | sort -t '|' -k 2,2 -u | sort -n | cut -f 2- -d '|' > результат