I've started on a analytics project. The use cases are to understand the customer buying patterns and the data sources like Web logs, Relational Databases (which holds product master, customer master). The relational database team, the hadoop team are entirely different. During the Architecture discussion it was discussed the Master data (Product, Customer, ) would be a one-time load and incremental updates would be a daily sqoop from oracle to hdfs and using Hive need to generate a current view (with all the latest product changes). Started with the Product details.
- The product master is approx 10G on Oracle side.
- The daily increment varies from 5 MB to 100 MB.
Based on my understanding creation of such small files would be a load on the name node on a long run.
As anybody come across such solution and how are you handling it?
I don't see any problem yet. If you are starting with one big file and adding 1 file each day you will result with ~1000 files after a year which isn't a problem (at least not for the name node).
Still, its not optimal to hold small file in HDFS no matter the amount.
I'd recommend you take a applicative approach to this and merge the files after enough time has passed, For example :
Combining those two tables using
union all
in aview
and creating a loading process to load the data once in a while from the HDFS to the master table when its possible.If you do encounter a name node contention regarding too many small files you can read about ways of solving the "small files problem" here.