Randomly pull multiple values from a txt file

Question

There is a txt file:

rama mama papa deda koza dama repa и т.д.

It is necessary to randomly pull out 3 words from there, but so that the first word is in the txt file, using the example of this rama, and any other 2 words, but so that they do not repeat with the old words.

Please tell me how to implement all this in Python 3.

Accepted Answer · 2016-01-11T15:15:09

 import random with open('source.txt', 'r') as source: l = source.readlines() word1 = l[0] word2 = '' word3 = '' while True: word2 = l[random.randint(1, len(l) - 1)] if word2 != word1: break while True: word3 = l[random.randint(1, len(l) - 1)] if word3 != word1 and word3 != word2: break print(word1, word2, word3)

1) It is assumed that the file has one word in each line.

2) It will go into an infinite loop if there are no three different words in the file.

3) It is extremely inefficient in memory with a large amount of the source file.

4) Since randint works with the [a; b] segment, when you get a value equal to len(l) , you will get out of range error.
From the documentation: random.randint(a, b) => return a random integer N such that a <= N <= b .
somehow you are quite difficult to approach the solution: words = [next(source)] + random.sample(set(source), 2)
@jfs, well, yes, by the number of code your version is clearly shorter and more elegant).
The question is in the price of set(source) compared to my cycles ...
@ andy.37: both set() and readlines() read the entire file.

Community spirit ♦ one · Answer 2 · 2016-08-04T16:22:50

To read the first line and select two more random lines from a small file:

 #!/urs/bin/env python3 import random with open('input.txt') as file: lines = [next(file)] + random.sample(list(file), 2) print(*map(str.strip, lines))

next(file) reads the first line from the file (the files are iterators above the lines in Python). random.sample() selects a pair of items from the list without substitutions. If the words in the input file are not repeated, the result always contains unique words.

If words can be repeated in a file, then set() can be used so that only unique words remain:

 #!/urs/bin/env python3 import random with open('input_with_dups.txt') as file: first_word = next(file).strip() words = set(map(str.strip, file)) - {first_word} # unique words print(first_word, *random.sample(words, 2)) #NOTE: use random.sample() #to avoid relying on #PYTHONHASHSEED behavior

In this case, the probability that a word is selected does not depend on how often it occurs in the file — all words (except the first) have the same weight.

str.strip() used to remove spaces from input lines so that in each line only the word itself remains, otherwise 'word' , 'word\n' , or 'word ' would be treated as different words.

If the file is large, but contains only distinguished words, then you can use the reservoir_sample() function, which implements the linear algorithm R :

 #!/urs/bin/env python3 with open('input.txt') as file: lines = [next(file)] + reservoir_sample(file, 2) print(*map(str.strip, lines))

This solution does not read the entire file into memory at once and therefore can work even for large files. Where reservoir_sample() :

 import itertools import random def reservoir_sample(iterable, k, randrange=random.randrange, shuffle=random.shuffle): """Select *k* random elements from *iterable*. Use O(n) Algorithm R https://en.wikipedia.org/wiki/Reservoir_sampling """ it = iter(iterable) sample = list(itertools.islice(it, k)) # fill the reservoir if len(sample) < k: raise ValueError("Sample larger than population") shuffle(sample) for i, item in enumerate(it, start=k+1): j = randrange(i) # random [0..i) if j < k: sample[j] = item # replace item with gradually decreasing probability return sample

The probability of choosing an arbitrary line in the file is constant and equal to k / n , where n number of lines in the file.

In the general case (if the words can be repeated in the input file and it can be large). It is necessary to modify the reservoir_sample() algorithm so that only unselected elements are considered:

 #!/urs/bin/env python3 import itertools import random def choose_uniq(iterable, k, chosen, randrange=random.randrange): j0 = len(chosen) it = (x for x in iterable if x not in chosen) for x in itertools.islice(it, k): # NOTE: add one by one chosen.append(x) if len(chosen) < (j0 + k): raise ValueError("Sample larger than population") for i, item in enumerate(it, start=k + 1): j = randrange(i) # random [0..i) if j < k: # replace item with gradually decreasing probability chosen[j0 + j] = item with open('input_with_dups.txt') as file: chosen_words = [next(file).strip()] # first word choose_uniq(map(str.strip, file), 2, chosen_words) print(*chosen_words)

(x for x in iterable if x not in chosen) filters out the already selected elements. This works because the elements are generated "lazily": one at a time. Since k == 2 in this case, x not in chosen is a quick operation even for the list. For large к , you can use the set type in this expression to get O(1) behavior.

choose_uniq() does not behave like random.sample() , so shuffle() removed. The resulting distribution is not quite uniform: depending on the order of the lines in the source file, a frequently repeated line can be chosen more often than if only unique words would be considered (the result differs from
set(map(str.strip, file)) - {first_word} solutions).

If uniform distribution is required (all unique words are selected with the same probability), then for large files that do not fit in memory, you can use external sorting , which later allows you to eliminate duplicates without additional memory costs (in O(1) memory), for example, using itertools.groupby() which in turn will allow to use reservoir_sample() again without changes.

If a strictly uniform distribution is not required, then in order not to read the entire potentially large file (for speed), you can select words from a random position in the file. For convenience, you can use the mmap module , which allows you to treat the file as a string (a sequence of bytes), even if the file size is larger than the available memory:

 #!/urs/bin/env python3 import locale import mmap import random import re with open('input_with_dups.txt', 'rb') as file, \ mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s: first_nonspace_pos = re.search(br'\S', s).start() # skip leading space chosen = set([get_word(s, first_nonspace_pos), b'']) # get 1st word while len(chosen) != 4: # add two more random non-empty words chosen.add(get_word(s, random.randrange(len(s)))) encoding = locale.getpreferredencoding(False) print(*[w.decode(encoding) for w in chosen if w])

where get_word() returns a word from the line near the specified position in the file:

 def get_word(s, position, newline=b'\n'): """Return a word from a line in *s* at *position*.""" i = s.rfind(newline, 0, position) # find newline on the left start = (i + 1) if i != -1 else 0 i = s.find(newline, position) # find newline on the right end = i if i != -1 else len(s) return s[start:end].strip() # a space is not part of a word, strip it

The file may contain empty lines (containing only spaces) —the code with first_nonspace_index and b'' makes it possible to avoid selecting an empty word. The code assumes that there are more than two different words in the input file, otherwise an infinite loop is possible. Unicode spaces (such as U + 00A0) are not considered.

The probability of choosing a word in this case may depend on the length of the words, the frequency of their repetition in the file, and even on the encoding used (that is, the distribution is uneven).

on a large file it is possible to read in chunks, discarding the first and last lines of the piece, because
may not be complete, the rest are standard random.sample from set
@Igor I do not see how this will help to choose words randomly, without breaking the uniform distribution (if the comment is on the choose_uniq() case).
Or do you hope to improve the performance of the reservoir_sample() solution?
(which is probably limited only by the speed of the disk — IO bound).
In the initial problem of uniform distribution, nothing was said, we do not need to bring the data to a normalized form.
Otherwise, you need to understand whether we need numbers or delete them, how many letters are in a word, etc.
The task only indicates the need to select 3 words from the file = lines.
If the file is huge then line-by-line reading is not required in full.
@Igor the phrase "randomly select" implies a uniform distribution, otherwise you could just read the first three different words from the file.
There is no need to ignore terms without grounds and * without mentioning that you ignore them.
Once again: what did you want to achieve (speed, simplicity of the code)?

Answer 3 · 2016-08-03T07:39:56

 f = open('111.txt', 'r') first = next(f) other = set(f) - {first} result = first + other.pop() + other.pop() print(result)

rama

koza

mama

Answer 4 · 2016-08-04T21:38:30

 import random f = open("data.txt").readlines() first = {f[0]} # первая строка lists = set(f) - first if len(lists) > 1: print(first.union(random.sample(lists, 2))) else: print('в файле нет 3 разных строк')

Igor

988 2 25

This is incorrect if there are duplicate words in the input or if the lines are only separated by spaces (it may not be visible to the eye). It is bad to ignore input errors if the input is too small, the program should break out loud, if no other requirements are explicitly indicated. Twice to read the file is not necessary. - jfs
The program should not go into an infinite loop, as in the marked variant, to break loudly is architecturally correct, but on real data it is simpler to try except with ignoring everything that is not suitable, especially with regard to doorways. - Igor
I do not see endless cycles in this answer. Once again: the decision is wrong if the words in the input file can be repeated. - jfs
random.sample - the choice is not repetitive, it means it can repeat only with the first one. Here I agree. On an infinite loop, 3 elements are checked, but if in file 2 are not duplicates, then there is no solution for that algorithm. - Igor
Do you mean the solution from the @ andy.37 answer ? I tell you: "your answer is wrong." you say to me: “but there is another answer in which the solution does not always work” (the difference is that the answer in andy.37 explicitly states when its solution does not work). Why do you mention a decision from another answer here? - jfs

|

Randomly pull multiple values from a txt file

4 answers 4

More articles:

Randomly pull multiple values ​​from a txt file

4 answers 4

More articles:

Randomly pull multiple values from a txt file