Sorting unique strings

Question

Good day to all!

The problem is the following: there is a text file ( example.txt ), in which there are about half a million lines (both unique and not). It is necessary to write from it to the new file ( out.txt ) the top of duplicate lines + the number of repetitions of this line in the original file.

Please help with the implementation. I would be very grateful.

Community spirit ♦ one · Answer 1 · 2015-10-18T18:06:06

To find the 10 most frequent lines in a file and print them in lexicographical order:

 #!/usr/bin/env python3 import sys from collections import Counter c = Counter(line.strip() for line in sys.stdin) for line, n in sorted(c.most_common(10)): print(n, line)

Example, if you save this code in the топ-cтрок file and give permission to execute chmod +x топ-cтрок :

 $ ./топ-cтрок <example.txt >out.txt

Or on Windows (if .py not present in %PATHEXT% ) :

 C:\> py топ-cтрок <example.txt >out.txt

To sort the input file that does not fit in memory, you can use temporary files . To find duplicates in an already sorted file, you can use itertools.groupby() .

The bytes in the file are decoded into Unicode text using sys.stdin.encoding or if the reading comes from a file opened inside the script ( open(filename) ), then the locale.getpreferredencoding(False) encoding is used (something like cp1251 on Windows and utf-8 on Mac, Linux). If you need to use a different encoding, then you must set the encoding parameter of the open() call or set the PYTHONIOENCODING environment variable to change the input / output encoding (stdin / stdout / stderr).

By default, sorting occurs in lexicographical order for Unicode characters ( ord(c) ), for example, the sort order of large ( А ) and small letters ( а ) may differ from the results of the sort utility, which uses the current default locale (compare the results: sort and LC_ALL=C sort ). To take into account the characteristics of sorting in a given language, you can use russian_strings.sort(key=icu.Collator.createInstance(icu.Locale('ru')).getSortKey) . It is also possible to normalize strings ( unicodedata.normalize() ) in order to u'\u0451' ( u'\u0451' ) and (( u'\u0435\u0308' ) in the u'\u0435\u0308' .

If there are problems with printing Unicode strings on Windows, then install the win-unicode-console package, see: How can I output a Unicode string from Windows to the Windows console? On other systems, it is enough to configure the locale and / or set the LANG , LC_ALL , LC_CTYPE , PYTHOIOENCODING (the latter - if the output is redirected), for example:

 $ LC_ALL=C.UTF-8 ./топ-cтрок <example.txt

user6550 · Answer 2 · 2013-04-09T13:28:19

Get a dictionary in which the key is a string, the value is the number of repetitions of the string. We read a line, there is no such in the dictionary - we create a key with the value 1. There is - we increase the value. Then we sort and display the key: value pairs.

Code examples: Writing and reading files Dictionaries and sorting them

inso inso 1,116 5 silver marks 7 bronze marks · Answer 3 · 2013-04-09T14:15:06

Under * nix, you can:

 cat example.txt | sort | uniq -c | sort -nr > out.txt

If you need only 10 top, for example:

 cat example.txt | sort | uniq -c | sort -nr | head -n 10 > out.txt

Notify the speed of such a command, the use of RAM while running for your million lines.

Sorting unique strings

3 answers 3

More articles: