To find the 10 most frequent lines in a file and print them in lexicographical order:
#!/usr/bin/env python3 import sys from collections import Counter c = Counter(line.strip() for line in sys.stdin) for line, n in sorted(c.most_common(10)): print(n, line)
Example, if you save this code in the топ-cтрок
file and give permission to execute chmod +x топ-cтрок
:
$ ./топ-cтрок <example.txt >out.txt
Or on Windows (if .py
not present in %PATHEXT%
) :
C:\> py топ-cтрок <example.txt >out.txt
To sort the input file that does not fit in memory, you can use temporary files . To find duplicates in an already sorted file, you can use itertools.groupby()
.
The bytes in the file are decoded into Unicode text using sys.stdin.encoding
or if the reading comes from a file opened inside the script ( open(filename)
), then the locale.getpreferredencoding(False)
encoding is used (something like cp1251
on Windows and utf-8
on Mac, Linux). If you need to use a different encoding, then you must set the encoding
parameter of the open()
call or set the PYTHONIOENCODING
environment variable to change the input / output encoding (stdin / stdout / stderr).
By default, sorting occurs in lexicographical order for Unicode characters ( ord(c)
), for example, the sort order of large ( А
) and small letters ( а
) may differ from the results of the sort
utility, which uses the current default locale (compare the results: sort
and LC_ALL=C sort
). To take into account the characteristics of sorting in a given language, you can use russian_strings.sort(key=icu.Collator.createInstance(icu.Locale('ru')).getSortKey)
. It is also possible to normalize strings ( unicodedata.normalize()
) in order to u'\u0451'
( u'\u0451'
) and (( u'\u0435\u0308'
) in the u'\u0435\u0308'
.
If there are problems with printing Unicode strings on Windows, then install the win-unicode-console
package, see: How can I output a Unicode string from Windows to the Windows console? On other systems, it is enough to configure the locale and / or set the LANG
, LC_ALL
, LC_CTYPE
, PYTHOIOENCODING
(the latter - if the output is redirected), for example:
$ LC_ALL=C.UTF-8 ./топ-cтрок <example.txt