I would like to ask some questions about building directories and storing files in them (pictures, for example).

Before addressing, I read quite a few different articles and I had some questions.

It is not a secret for anyone that for large projects it is not advisable to throw user-uploaded files into one folder and therefore you need to scatter them in directories, but how to do it correctly?

Many advise doing this mark. way: I will briefly describe, generate the file name using md5 hash. Then, take the first 2 characters from the name and create a folder and put all the files that start with the same characters, in this way we can create 256 folders in which, for example, we can place 1000 files. Of course, you can make nested levels of 3 and 4 characters of 5 and 6, and so on.

But I can not understand what will happen if one of the folders clogs much earlier than the others? How, then, to be?

My version that came to mind (I think I am not the first): Create a folder with the current year number, for example, then the current month and create a folder with the day of the month in the folder with the month number and create a folder with the name 1 in it and hammer it until there are 1000 files in it, then we create a folder with the number 2 and we hammer in it, well and in the same spirit and further.

It turns out that if we store up to a thousand folders in a folder and there are up to a thousand files in these folders, then we can store up to 30,000,000 files in one folder that is called the date of the month. After the month is over we move to another, the year is over, goes to another ... The file system in this case will not have to rake tons of files, the maximum number of files with which it will have to work is 1000

For clarity: 2015/04/15/1

Scoring a folder with the name 1 to 1000 files, then creating a folder with the name 2, after 3, and so on until 1000

The day is over, then we create a folder 16 in folder 04 and then work on the described principle. Of course, folders are created only when files are loaded into them.

What are your thoughts on this? What file storage structure do you use? And how is she good? Is the structure that I described acceptable from your point of view? What do you think about performance?

  • one
    In general, folders were invented for a person so that he could group files and structure them. The file system is sneezing deeply on how files are stored there. They are simply spread over the entire disk space. In the current reality, the access speed (read, write) to the data is a characteristic of the storage device. - 0x5a4d
  • Well, that's what I mean. What is the best way to store files so that the file system is easier to read, write, delete? - Hit-or-miss
  • I repeat once again - FS is all the same. The concept of an abstract folder . If you want speed, then pay money for SSD hosting - 0x5a4d
  • And why then is the speed of reading from a directory where 1000 files are much faster than reading from a directory where there are 50,000 files? - Hit-or-miss
  • 2
    Your directory structure (year / month / day / ...) will also need an index file (or rather a structure of several such files) to search for a file by name. Otherwise, in order to make sure that the requested file in the system does not have to browse the contents of the entire directory tree. / Regarding the last question (in comments) - the time to search for a file in a directory does not depend on the file size. - avp

3 answers 3

The answer is quite general, without any binding to PHP.

If you need to search for files by name, I would still try to make a structure based on MD5 (or another hash function) in hexadecimal form (for MD5, the whole name is 32 characters from 0 to f), but not with rigid organization directory levels, but from dynamic.

Directory names are given, for example, by triples of characters. Accordingly, the directory can contain up to 4096 other directories.

To begin, we begin to place the files themselves in the directory (with the MD5 name). You can also add a service file there to display the hash in the name (it can be useful if you want to know the real names of the stored files) and synonyms (if this happens). However, the structure of such a file is a separate issue.

When 4096 files are collected in the directory, we are reorganizing. We make directories on the first three characters of file names and move files to them.

I hope further obvious.

    What happens if one of the folders clogs much earlier than the others?

    Weird question. Why should one be beaten earlier? This is a random hash, not a custom file name on IMG_. Well, yes, the hash must be generated wisely. But this is a hash generation problem, not a principle itself. That is, I do not understand the logic behind this argument. It is like deciding what a borscht is, and abandoning a spoon just because someone, perhaps, does not know how to hold it. Well, it is necessary to teach one Krivorukov, and not everyone to gobble over the edge.

    The simpler the system, the better it works.
    In a simple system, there is nothing to break.

    If you initially set an algorithm that will not require any recalculations and reorganizations, then it will work without the need to recount the files in the folder every time (!).

    The best option is md5 () from the content. It will provide not only a completely random distribution, but also save space in the case of duplicates (only when deleting, you need to be careful and see if there are other records referring to the same file).

    Plus, you can transfer a hash from a 16-system to a 36-system, using all the letters of the alphabet, not just the first 6. That will reduce the length of the hash and at the same time increase the number of options for splitting, reducing the required folder depth.

    • one
      Well, the depth (from the transition to 36 or 64-level) does not decrease (if we want to limit the number of files in each directory). - avp 7:09 pm
    • Ooo, like for 36, also use it) - David Manjula

    Once I did photo hosting, and I did as follows: According to the idea of ​​the site, for each image there is a link in the format example.com/vnjum . As in the shortcut links. Then the picture is saved in the folder example.com/images/v/n/j/u/m/image.jpg Then from the link to the site you can get a link to the file itself and vice versa)


    Here is another method. The document will have two identifiers - the load time and a random number from 100,000 to 999,999.

    For example, the time will be 1540670259710 (this is the UNIX format), and the random number will be 125793 .

    We translate the date to the 36-th system (in JavaScript - .toString(36) ), we get jnrvakn2 . Divide into pieces of 1-2-2-3 characters. We kn2 j - nr - va - kn2 .

    Next, take our random number - 125793 . In a 36-inch system, this will be 2p29 .

    We /j/nr/va/kn2/2p29/filename.jpg address /j/nr/va/kn2/2p29/filename.jpg