Site - bulletin board, all images are stored in one folder, the number of the order of 40 thousand. Images are displayed directly, simply via url (src = "/ imgs / 1253573.jpg") Is there any sense to scatter on the subdirectories (for example, subsections of the site), will there be a big gain in resources or download speed? Are there any recommendations on the number of files in the directory, after which the speed decreases or the load increases?
- Good question. I think, if you do not make a listing (after all, your mechanism, which gives pictures, does not do this?), Then there should be no particular problems. - approximatenumber
- Why not test? So you will know for sure) - Nick Volynkin ♦
5 answers
Introduction
It’s definitely worth breaking if the number of files exceeds 5-7 thousand. With 10k files in the same folder, the file system already usually "becomes bad" - opening the file takes longer, the load on the disk increases. With 10k files in the folder, simply deleting all the files was already costly (that is, just rm * doesn’t work like that - bash tries to expand * into the list of names, this list will gather for a long time, but then it turns out to be too long and bash cannot execute the command The truth is somewhere in 2011-2012 is a bit pofikisili, but still hard). I had experience with similar and several attempts to rewrite "correctly."
Theory
It should be understood that different file systems react differently. For example, ntfs when opening a folder trying to update the last access time. If there are many files, then this is for a long time.
How much can theoretically accommodate?
- ntfs The theoretical limit is 2 to 32 . Real - about 100k . But they write that all tupit strongly. Also in msdn it says that GetTempFileName will not work if there are more than 65535 files in the directory with temporary files.
- ext3. everything is difficult , the maximum number of files you need to watch.
- ext4 is the same as ntfs, 4 billion .
You can also read another question on SO - The maximum number of files in the folder Linux and Windows
Method one
The way of folding by dates is bad - it is not uniform. That is, there will be folders where there are a lot of files and folders where there are almost no files. Since the folders in most file systems are not free. The second drawback is the complexity of sharding. That is, if the size of the files begins to exceed the size of the file system, then problems begin. Another drawback is duplicates. They are very difficult to track. But this method has only one good plus - it is easy to find old files.
Second way
But there is a better way, tested and used by many. Before placing the file in the "storage", for the file is considered md5 / sha1. Now, based on this hash, a “storage path” is formed. I used this - the first two characters define the folder in which the folder is created, the name of which coincides with the next two characters. And the file itself is stored inside the folder. Example. Suppose there is a file "test.jpg" and md5 amount from its contents is such a "b3e6b7290309113a2d2b392bf1e2084e". The repository itself is located in /srv/archiv . So this file will be stored in this way.
/srv/archiv/b3/e6/b7290309113a2d2b392bf1e2084e_test.jpg I still left the original file name. Sometimes it is needed. But you can save the file names in the database, then you can not write it.
Pluses of this method.
- even distribution of files by directories (provided by a hash function).
- Easily spread files across servers. For example, the first two characters can determine the server name in a cluster (and the clusters themselves are mounted inside).
- duplicates are easily detected - they will have the same hash.
- If we assume that the file system easily holds 1000 files in a folder, then the whole system will be able to hold about 65 million files (256 * 256 * 1000). If you take the minimum files (4k), then it is at least 250 gigabytes (plus file system overhead, usually 10-20 percent of the size).
- The use of this database together with this method greatly expands the possibilities.
Minuses:
- Hashing is not free.
- theoretically, it may turn out that two different files may have the same hash.
- It’s hard to remove old files, but find can help.
Turnkey solutions
There is an example implementation on php - habr .
To see:
- The method, of course, is good (yes, git), but for dumping files, where they are not going to be deleted is complicated. Plus, according to the file, to calculate the hash to save is yes, but to receive the numeric id from the database is no longer, it is necessary to refine the database. - Qwertiy ♦
Good I believe that it should be broken down by directories, but not by sections of the site, but by dates. And necessarily in the American format: 2016-11-21 (or monthly, as proposed by Sergey). Apply nginx to process statics is mandatory (these same pictures) - then they will be given to the user much faster
It all depends on the file system and its settings. For example in ext4 can end inodes see https://toster.ru/q/122827 or https://www.linux.org.ru/forum/general/4496222
More appropriate to do in this form: /2014/02/*.* . So more structured.
And in fact, in my opinion, no difference. There is a practice when there are more than 400,000 files in the directory, and nothing. Of course, if you log in via FTP and browse the files, the answer will be long ...
If the file name is consecutive numbers, I suggest simply splitting 1000 files in a folder. Well, or pick up the desired value based on performance measurements.
For what is added as an archive and is never deleted, this method is ideal: an equal number of files in folders; just understand where to put, just get based on id from the database.
If the files are still deleted, you can try to swap the result and the remainder of the division, for example, place the file 12345 as 345/12 instead of 12/345 . According to the probability theory, folders will turn out to be approximately the same regardless of deletions (well, one cannot say that files with even id are deleted more often than with odd ones?).