To read the first line and select two more random lines from a small file:
#!/urs/bin/env python3 import random with open('input.txt') as file: lines = [next(file)] + random.sample(list(file), 2) print(*map(str.strip, lines))
next(file) reads the first line from the file (the files are iterators above the lines in Python). random.sample() selects a pair of items from the list without substitutions. If the words in the input file are not repeated, the result always contains unique words.
If words can be repeated in a file, then set() can be used so that only unique words remain:
#!/urs/bin/env python3 import random with open('input_with_dups.txt') as file: first_word = next(file).strip() words = set(map(str.strip, file)) - {first_word} # unique words print(first_word, *random.sample(words, 2)) #NOTE: use random.sample() #to avoid relying on #PYTHONHASHSEED behavior
In this case, the probability that a word is selected does not depend on how often it occurs in the file — all words (except the first) have the same weight.
str.strip() used to remove spaces from input lines so that in each line only the word itself remains, otherwise 'word' , 'word\n' , or 'word ' would be treated as different words.
If the file is large, but contains only distinguished words, then you can use the reservoir_sample() function, which implements the linear algorithm R :
#!/urs/bin/env python3 with open('input.txt') as file: lines = [next(file)] + reservoir_sample(file, 2) print(*map(str.strip, lines))
This solution does not read the entire file into memory at once and therefore can work even for large files. Where reservoir_sample() :
import itertools import random def reservoir_sample(iterable, k, randrange=random.randrange, shuffle=random.shuffle): """Select *k* random elements from *iterable*. Use O(n) Algorithm R https://en.wikipedia.org/wiki/Reservoir_sampling """ it = iter(iterable) sample = list(itertools.islice(it, k)) # fill the reservoir if len(sample) < k: raise ValueError("Sample larger than population") shuffle(sample) for i, item in enumerate(it, start=k+1): j = randrange(i) # random [0..i) if j < k: sample[j] = item # replace item with gradually decreasing probability return sample
The probability of choosing an arbitrary line in the file is constant and equal to k / n , where n number of lines in the file.
In the general case (if the words can be repeated in the input file and it can be large). It is necessary to modify the reservoir_sample() algorithm so that only unselected elements are considered:
#!/urs/bin/env python3 import itertools import random def choose_uniq(iterable, k, chosen, randrange=random.randrange): j0 = len(chosen) it = (x for x in iterable if x not in chosen) for x in itertools.islice(it, k): # NOTE: add one by one chosen.append(x) if len(chosen) < (j0 + k): raise ValueError("Sample larger than population") for i, item in enumerate(it, start=k + 1): j = randrange(i) # random [0..i) if j < k: # replace item with gradually decreasing probability chosen[j0 + j] = item with open('input_with_dups.txt') as file: chosen_words = [next(file).strip()] # first word choose_uniq(map(str.strip, file), 2, chosen_words) print(*chosen_words)
(x for x in iterable if x not in chosen) filters out the already selected elements. This works because the elements are generated "lazily": one at a time. Since k == 2 in this case, x not in chosen is a quick operation even for the list. For large к , you can use the set type in this expression to get O(1) behavior.
choose_uniq() does not behave like random.sample() , so shuffle() removed. The resulting distribution is not quite uniform: depending on the order of the lines in the source file, a frequently repeated line can be chosen more often than if only unique words would be considered (the result differs from
set(map(str.strip, file)) - {first_word} solutions).
If uniform distribution is required (all unique words are selected with the same probability), then for large files that do not fit in memory, you can use external sorting , which later allows you to eliminate duplicates without additional memory costs (in O(1) memory), for example, using itertools.groupby() which in turn will allow to use reservoir_sample() again without changes.
If a strictly uniform distribution is not required, then in order not to read the entire potentially large file (for speed), you can select words from a random position in the file. For convenience, you can use the mmap module , which allows you to treat the file as a string (a sequence of bytes), even if the file size is larger than the available memory:
#!/urs/bin/env python3 import locale import mmap import random import re with open('input_with_dups.txt', 'rb') as file, \ mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s: first_nonspace_pos = re.search(br'\S', s).start() # skip leading space chosen = set([get_word(s, first_nonspace_pos), b'']) # get 1st word while len(chosen) != 4: # add two more random non-empty words chosen.add(get_word(s, random.randrange(len(s)))) encoding = locale.getpreferredencoding(False) print(*[w.decode(encoding) for w in chosen if w])
where get_word() returns a word from the line near the specified position in the file:
def get_word(s, position, newline=b'\n'): """Return a word from a line in *s* at *position*.""" i = s.rfind(newline, 0, position) # find newline on the left start = (i + 1) if i != -1 else 0 i = s.find(newline, position) # find newline on the right end = i if i != -1 else len(s) return s[start:end].strip() # a space is not part of a word, strip it
The file may contain empty lines (containing only spaces) —the code with first_nonspace_index and b'' makes it possible to avoid selecting an empty word. The code assumes that there are more than two different words in the input file, otherwise an infinite loop is possible. Unicode spaces (such as U + 00A0) are not considered.
The probability of choosing a word in this case may depend on the length of the words, the frequency of their repetition in the file, and even on the encoding used (that is, the distribution is uneven).