Algorithm for working with files in Python: copying randomly selected files before reaching the size limit in MB

Question

There is a folder containing several thousand files of the same type, but with different names and sizes. When you run the program, you need to enter some size in MB (for example, 10 MB). During program execution, files from the source folder are randomly copied from to the specified folder until the size of the specified folder becomes larger or equal to the size value entered earlier (for example, the 10 MB).

Well, actually, as written in the question, exactly in the code and write, everything is described correctly and the code is translated very easily.
What exactly are the problems that have required to create this question?
(although there is one nuance, but it is insignificant and will be discussed later)
@MichaelPak whether to consider the size of the directory by the sum of the bytes in the files or by the clusters occupied on the FS.
For if you type 10 megabytes of files, for example, 512 bytes, then you will actually be busy about 80 megabytes if clusters are 4 kilobytes

Accepted Answer · 2016-02-29T05:48:00

Something like this:

import os, random, shutil old_dir = "old/" # старая директория new_dir = "new/" # новая директория size = 10 * 1024 * 1024 # getsize() возвращает байты file_list = os.listdir(old_dir) # список файлов с старой директории while sum(os.path.getsize(f) for f in os.listdir(new_dir)) < size: file = file_list.pop(random.randint(0, len(file_list))) shutil.copy(old_dir + file, new_dir + file)

1- This will not work: os.listdir() returns the relative path 2- L.pop(randint(0, len(L))) will not work because len(L) incorrect index.
Or it is simpler to shuffle(L) call 3 once. This is O(n*k) (quadratic) algorithm ( n is the number of files in the old directory, k is the number of files in the new directory).
@jfs, firstly, you can simply run the script from the root directory.
And secondly: yes, your version is much more productive, +1;)

Community spirit ♦ one · Answer 2 · 2016-02-29T15:56:32

To copy randomly selected * .type files from the source folder to the destination folder, so that the total size of the copied files does not exceed the limit in bytes specified in the command line ( limit = 10MiB by default):

 #!/usr/bin/env python3 """Usage: copy-random-files [<size-limit-MiB>]""" import random import shutil import sys from pathlib import Path MiB = 1024 * 1024 limit = (int(sys.argv[1]) if len(sys.argv) > 1 else 10) * MiB size = 0 paths = list(Path("source").glob("*.type")) random.shuffle(paths) for path in paths: size += path.stat().st_size if size > limit: break shutil.copy(str(path), str(Path("destination") / path.name))

To take all entries from the folder indiscriminately, you can use Path("source").iterdir() instead of Path("source").glob("*.type")
random.shuffle() paths in random order
The folder size is calculated by the sum of the sizes of the files it contains in this case. Actual disk space may vary. See Find the total size of all regular files in a directory, recursively bypassing all subdirectories.

This algorithm is O(n) linear in memory and time, where n is the total number of *.type in the source folder.

“ set() guarantees a random order every time the program is started” - I’m not sure that it’s good to be so brazenly tied to a specific implementation - PyPy3 mixes up the same way every time (although I didn’t find the pathlib in it, but that's another story)
Python 3 support, which my answer obviously uses is bad in it.
This is a Pypy problem (my code does not use the details of the CPython implementation).
You can easily write a response that will work both on Python 2 and 3 (for example, using random.shuffle() ).
The purpose of the code to be executable pseudo-code in this case (it is enough that it is read and works on standard Python).
1- "this is all the more bad" - what do you mean - point by point.
2- "Like the use of <<" - therefore 10MiB next to indicated for beginners (for experienced developers 1 << 20 and so read).
I mean, if you write a pseudo-code, and not a production-ready super-version, then you need to write it as clearly and clearly

Algorithm for working with files in Python: copying randomly selected files before reaching the size limit in MB

2 answers 2

More articles: