Best way to automatate getting data from Csv files to Datalake

455 views Asked by At

I need to get data from csv files ( daily extraction from différent business Databasses ) to HDFS then move it to Hbase and finaly charging agregation of this data to a datamart (sqlServer ).

I would like to know the best way to automate this process ( using java or hadoops tools )

2

There are 2 answers

3
OneCricketeer On

Little to no coding required? In no particular order

  • Talend Open Studio
  • Streamsets Data Collector
  • Apache Nifi

Assuming you can setup a Kafka cluster, you can try Kafka Connect

If you want to program something, probably Spark. Otherwise, pick your favorite language. Schedule the job via Oozie

If you don't need the raw HDFS data, you can load directly into HBase

1
Robin Moffatt On

I'd echo the comment above re. Kafka Connect, which is part of Apache Kafka. With this you just use configuration files to stream from your sources, you can use KSQL to create derived/enriched/aggregated streams, and then stream these to HDFS/Elastic/HBase/JDBC/etc etc etc

There's a list of Kafka Connect connectors here.

This blog series walks through the basics: