Good day to all!

The problem is the following: there is a text file ( example.txt ), in which there are about half a million lines (both unique and not). It is necessary to write from it to the new file ( out.txt ) the top of duplicate lines + the number of repetitions of this line in the original file.

Please help with the implementation. I would be very grateful.

    3 answers 3

    To find the 10 most frequent lines in a file and print them in lexicographical order:

     #!/usr/bin/env python3 import sys from collections import Counter c = Counter(line.strip() for line in sys.stdin) for line, n in sorted(c.most_common(10)): print(n, line) 

    Example, if you save this code in the топ-cтрок file and give permission to execute chmod +x топ-cтрок :

     $ ./топ-cтрок <example.txt >out.txt 

    Or on Windows (if .py not present in %PATHEXT% ) :

     C:\> py топ-cтрок <example.txt >out.txt 

    To sort the input file that does not fit in memory, you can use temporary files . To find duplicates in an already sorted file, you can use itertools.groupby() .

    The bytes in the file are decoded into Unicode text using sys.stdin.encoding or if the reading comes from a file opened inside the script ( open(filename) ), then the locale.getpreferredencoding(False) encoding is used (something like cp1251 on Windows and utf-8 on Mac, Linux). If you need to use a different encoding, then you must set the encoding parameter of the open() call or set the PYTHONIOENCODING environment variable to change the input / output encoding (stdin / stdout / stderr).

    By default, sorting occurs in lexicographical order for Unicode characters ( ord(c) ), for example, the sort order of large ( А ) and small letters ( а ) may differ from the results of the sort utility, which uses the current default locale (compare the results: sort and LC_ALL=C sort ). To take into account the characteristics of sorting in a given language, you can use russian_strings.sort(key=icu.Collator.createInstance(icu.Locale('ru')).getSortKey) . It is also possible to normalize strings ( unicodedata.normalize() ) in order to u'\u0451' ( u'\u0451' ) and (( u'\u0435\u0308' ) in the u'\u0435\u0308' .


    If there are problems with printing Unicode strings on Windows, then install the win-unicode-console package, see: How can I output a Unicode string from Windows to the Windows console? On other systems, it is enough to configure the locale and / or set the LANG , LC_ALL , LC_CTYPE , PYTHOIOENCODING (the latter - if the output is redirected), for example:

     $ LC_ALL=C.UTF-8 ./топ-cтрок <example.txt 

      Get a dictionary in which the key is a string, the value is the number of repetitions of the string. We read a line, there is no such in the dictionary - we create a key with the value 1. There is - we increase the value. Then we sort and display the key: value pairs.

      Under * nix, you can:

       cat example.txt | sort | uniq -c | sort -nr > out.txt 

      If you need only 10 top, for example:

       cat example.txt | sort | uniq -c | sort -nr | head -n 10 > out.txt 
      • And really - so much easier. Thank. - Pathfinder
      • Notify the speed of such a command, the use of RAM while running for your million lines. - pincher1519