Which tool can combine several dictionaries into one file and then sort and clear from duplicates?

100+ txt, dic, doc - dictionaries, more than 300 GB

The tool requires UTF-8 support and does not remove spaces at the end of the line.

  • for example, the sort program. $ sort -u файлы > результат (files, of course, must be text). - aleksandr barakin 5:03 pm
  • @aleksandrbarakin for my task rather cat dict/*.* | sort | uniq > output.txt cat dict/*.* | sort | uniq > output.txt cat dict/*.* | sort | uniq > output.txt but can it handle such a large amount? In priority, the speed of execution - Andrew
  • An absolutely unnecessary increase in processes from one (proposed by me) to three (proposed by you) procedures will not exactly speed up. and whether your disk can handle this amount of information, you know better. - aleksandr barakin 5:38 pm
  • @aleksandrbarakin Yes, really, I didn’t know that sort has the -u option - Andrew

1 answer 1

if these are text files, then the sort program is enough:

 $ sort -u файл(ы) > результат 

the -u option - “remove duplicates” (“leave only unique strings”).


about the required resources - you can read the answers to this question: How could the UNIX sort command be a very large file?

briefly: an external sort is used (using the n-way merge method), which means that the file system where the temporary directory is ( $TMPDIR or /tmp or explicitly specified with the -T каталог ) must be (as far as I understand) at least as much free space (for temporary files ), how much is the original data.