remove duplicate lines from file, keeping order

Question

there is a file with strings, among which doubles come across.

How to use gnu / coreutils utilities from the gnu operating system (compatibility with the posix standard is not required) to remove duplicates without disturbing the string order?

Accepted Answer · 2016-06-22T12:53:59

for example, you can use this construction:

$ nl исходный-файл | sort -k 2 -u | sort -n | cut -f 2- > отсортированный-файл

explanation:

nl - will produce stdin strings from stdin (or read from files given by arguments), adding consecutive numbers at the beginning of strings; the number and the rest of the default contents will be separated by a tab character
sort -k 2 -u - sorts the transferred list by the second and subsequent ( -k 2 ) fields (the field separates the tab character by default) and deletes duplicates ( -u ) in the same second field (ignoring the first); just "delete duplicates" without sorting the sort program "does not know how"
sort -n - sorts the list into a numeric ( -n ) sequence; since there are numbers at the beginning of the lines, you will get a list of lines sorted in the same order as “at the very beginning” (only with gaps)
cut -f 2- - will leave only the fields from the second “on” ( -f 2- ); the default fields are, as usual, separated by tabs.

additional reading:

 $ info coreutils

If the info program is not installed, you can read the online documentation or individual man pages (but there is usually less information):

 $ man nl $ man sort $ man cut

Vladimir Gamalyan Vladimir Gamalyan 5,702 3 20 49 · Answer 2 · 2016-06-22T13:25:34

Awk option

 $ awk '!a[$0]++' исходный-файл

Here, each line of the file becomes the key for the associative array a . If the string is encountered the first time, then the array does not yet contain such an element, and the negation ! gives true for such a string, i.e. she goes to the exit. If the string is encountered again, then we already have a non-zero element with a string key in the array (note the post-increment), respectively, the expression is evaluated as false and the string will be ignored.

I think the absence of the awk program in gnu / coreutils can be neglected.
@alexanderbarakin really doesn’t, why it seemed that awk was from coreutils.
in the gnu operating system there is, of course, its own awk implementation, but this is a separate package, with coreutils not directly connected.

remove duplicate lines from file, keeping order

2 answers 2

More articles: