How many files can I painlessly keep in the same directory in HDFS?
And how do they work? I understand that they are virtual and exist only for the user?

    1 answer 1

    On the one hand, we can say that the number of files in HDFS is limited only by the configuration of the node on which the NameNode daemon is running. The number of files in a separate directory is due to the limitations of Java (on which all this is written). For example, the method for retrieving files in a directory returns an array, and, as you know, in one array can not be stored more than Integer.MAX_VALUE elements (which is equal to 2 31 -1, and that, if this is enough memory).

    In any case, instead of a heap of small files, it’s better to keep one large in HDFS, since any small file in HDFS still occupies at least one block (is it worth it to occupy 64 MB of disk space on each disk of 10 KB, and even in three places, taking into account replication?).

    Read more about the HDFS device I recommend reading here: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html .

    • A small file (smaller than the block size) does not take up disk space by block size, it takes up as much as it weighs, i.e. 2 mb file will take exactly 2 mb, and not the block size in HDFS - Andrey Gorodetsky
    • @AndreyGorodetsky in general, yes, with this, I got excited about something :) - Alex Chermenin