The company decided to reorganize the data storage and processing system.
The essence of the question : there are a huge number of machines that generate several small files every day. But in general, the number of files per day exceeds 100,000. The file structure is almost the same. And the management wants these files to be combined into one or a database and analyzed on a Hadoop cluster. Since the data is structured, it is more logical to do the analysis on Hive.
How can such files be merged? Download everything in hdfs and aggregate there? For too long - Hadoop is not designed for this.
Who will advise what? Preferably ready-made solutions.