The company decided to reorganize the data storage and processing system.

The essence of the question : there are a huge number of machines that generate several small files every day. But in general, the number of files per day exceeds 100,000. The file structure is almost the same. And the management wants these files to be combined into one or a database and analyzed on a Hadoop cluster. Since the data is structured, it is more logical to do the analysis on Hive.

How can such files be merged? Download everything in hdfs and aggregate there? For too long - Hadoop is not designed for this.

Who will advise what? Preferably ready-made solutions.

  • Will self copying or analysis take a long time? For analysis, you can theoretically look towards Spark over Hive for example - Chubatiy
  • The fact of the matter is that the analysis is long. That is why we are looking for some kind of solution. And so it would be possible to arrange the download of files for example through Flume and analyze them. - Artem Reshetnikov
  • You can look in the direction of Filebeat to merge the contents of files as they become available. - Alex Chermenin

1 answer 1

Not surprisingly, everything is slow, Hadoop is poorly suited for processing heaps of small files - this creates a load not only on Namenode, but also generates a bunch of mappers for each small file, and this is a long-known problem.

Solution one: merge small files into large ones. To solve this problem, there are already several tools, for example, filecrash , in general, this problem is easily googled: hadoop small file problem.

PS Again, this problem is relevant if you use the MapReduce framework, because Hive can work not only on top of MapReduce, but also on Tez, Spark and so on.