We have a text file of the form:

foo two word bar # cat tea - five o'clock 666 

These are not necessarily words, any characters as a matter of fact, there can be several words in one line, incl. separated by spaces. A text file can be large and even huge - up to hundreds of gigabytes or a couple of terabytes.

What I want: give random combinations of a similar list of strings to the required length of the number of rows of the list (see example below)

You can take this script as a basis. It performs all that is needed, but without randomization, that is, it produces a consistent output of lines, going through all the options, starting from the first line, as a brute force generator, and has a built-in function for selecting the number of output lines — starting with the minimum and maximum output lines . Run like this: python3 script.py -f spisokslov.txt -min 2 -max 3 and will have the output:

 foo two word foo bar foo # foo cat foo tea - five o'clock foo 666 two word foo two word bar two word # * удалил тут часть строк для сокращения примера, и последняя строка: 666 tea - five o'clock cat 

The script inserts one space between the lines, but after the last word there is no space in the output line. This is optimal - if you wish, you can then suppress gaps in the pipe. In general, you need all the same thing, just add here a random conclusion, so that the output is not sequential, but chaotic, but with the same functionality of selecting the number of lines: -min 2 -max 3

 tea - five o'clock two word # foo foo 666 # cat foo two word bar tea - five o'clock 

the script produces only combinations without duplicates (if they are of course not in the text file itself), it is desirable to save it, but if it is difficult with randomization, then it is possible without it.

As an alternative, you can use the principle as a basis, like the combinator.bin and combinator3.bin utilities in the Kali linux hashcat-utils set - they iterate, but also sequentially lists 2 or a maximum of 3 files between them: ./combinator3.bin spisokslov1.txt spisokslov2.txt spisokslov3.txt (here -combinator3.c) Maybe it will be easier - we create several different separate text files by the number of combinable combinations, and then randomly select a line from the first list, then randomly from the second and so on , however, then lists for large ones take up a lot of space ...

If the construction with the minimum and maximum number of rows complicates, we can neglect it, it suffices then to choose one fixed length. The script must have its own cycle of issuing lines, essentially to infinity, or indicating the maximum number of lines produced.

In any case, I will be glad to any random option, if someone helps. thank

  • one
    What exactly are your difficulties? - user218976
  • My teapot qualification is not enough to redo it - TWOfish

1 answer 1

There is a long sequence (A,B,C,D...)

You need to create random combinations from Min to Max

I assume that access to the sequence is sequential, not arbitrary, so choosing from a random place is difficult (to select a line with a certain number from a very long file, you will have to re-read the entire file to this place or store indexes).

Create a list of lists of lists. Dimension on the first dimension Max + 1. The top-level list with index k stores lists of lists with the length of the innermost lists of k (let's call it lists of rank k).

First we have an empty list.
We go in sequence.
Choose the next element with probability q (or ignore it with probability (1-q)). (If we want to create all possible combinations (myriads), then q = 1)

Now we insert the next element into all places of the available lists of rank lower than Max (including the empty one) with probability p. If an insertion is made to the list, then its not modified copy is also saved, and the new list is transferred to lists of higher rank.

For example, one of the available lists [C, A, B]. The element X can be inserted into 4 places: [X, C, A, B], [C, X, A, B], [C, A, X, B], [C, A, B, X].

If, in addition to the insert, you do and replace ([C, A, B] => [C, X, B]), then the distribution will be somewhat different, but duplicates are possible.

An example of the generation of all (i.e., the probability of a single choice) combinations of length from 0 to 3 (here I did not divide the lists by the upper level, only sorted by rank)

 [] [] [A] [] [B] [A] [A,B] [B,A] [] [C] [B] [A] [C,B] [B,C] [C,A] [A,C] [A,B] [B,A] [C,A,B] [A,C,B] [A,B,C] [C,B,A] [B,C,A] [B,A,C] 

The third stage as a list of lists of lists:

 [[[]], [[C] [B] [A]], [[C,B] [B,C] [C,A] [A,C] [A,B] [B,A]], [[C,A,B] [A,C,B] [A,B,C] [C,B,A] [B,C,A] [B,A,C]]] 

Edit
If the assumption of random access is incorrect, then it is enough to calculate the number N of combinations of the desired length, generate a random number R within N and output the Rth combination.

  • The question is not yet on the merits of your answer. I pointed to the link on combinator3.c on github. It seems to be written in C. I don’t know if this is real or not, it’s impossible to simply forward it in its code so that it works not only with three lists, but with the right amount? for in some cases, such a functional is needed, and in a series, the one described in this question. - TWOfish 1:16 pm
  • Apparently, perhaps (probably the easiest is recursive), but the code is not commented. On Python it will be easier to write using itertools.product - MBo
  • Found a similar question. Stackoverflow.com/questions/480219/… - TWOfish