To find the 10 most frequent lines in a file and print them in lexicographical order:
#!/usr/bin/env python3 import sys from collections import Counter c = Counter(line.strip() for line in sys.stdin) for line, n in sorted(c.most_common(10)): print(n, line)
Example, if you save this code in the топ-cтрок file and give permission to execute chmod +x топ-cтрок :
$ ./топ-cтрок <example.txt >out.txt
Or on Windows (if .py not present in %PATHEXT% ) :
C:\> py топ-cтрок <example.txt >out.txt
To sort the input file that does not fit in memory, you can use temporary files . To find duplicates in an already sorted file, you can use itertools.groupby() .
The bytes in the file are decoded into Unicode text using sys.stdin.encoding or if the reading comes from a file opened inside the script ( open(filename) ), then the locale.getpreferredencoding(False) encoding is used (something like cp1251 on Windows and utf-8 on Mac, Linux). If you need to use a different encoding, then you must set the encoding parameter of the open() call or set the PYTHONIOENCODING environment variable to change the input / output encoding (stdin / stdout / stderr).
By default, sorting occurs in lexicographical order for Unicode characters ( ord(c) ), for example, the sort order of large ( А ) and small letters ( а ) may differ from the results of the sort utility, which uses the current default locale (compare the results: sort and LC_ALL=C sort ). To take into account the characteristics of sorting in a given language, you can use russian_strings.sort(key=icu.Collator.createInstance(icu.Locale('ru')).getSortKey) . It is also possible to normalize strings ( unicodedata.normalize() ) in order to u'\u0451' ( u'\u0451' ) and (( u'\u0435\u0308' ) in the u'\u0435\u0308' .
If there are problems with printing Unicode strings on Windows, then install the win-unicode-console package, see: How can I output a Unicode string from Windows to the Windows console? On other systems, it is enough to configure the locale and / or set the LANG , LC_ALL , LC_CTYPE , PYTHOIOENCODING (the latter - if the output is redirected), for example:
$ LC_ALL=C.UTF-8 ./топ-cтрок <example.txt