Data ingestion in Hadoop using Distcp

587 views Asked by At

I understand that distcp is used for inter/intra cluster transfer of data. Is it possible to use distcp to ingest data from the local file system to HDFS. I understand that you can use file:///.... to point to a local file outside of HDFS but how reliable and fast is that compared to the inter/intra cluster transfer.

1

There are 1 answers

0
RojoSam On

Distcp is a mapreduce job that is executed inside the hadoop cluster. For hadoop cluster perspective, your local machine is not a local file system. Then you can't use your local file sytem with distcp. An alternative could be configure a FTP server in your machine that hadoop cluster can read. The performance depends on the network and the protocol used (ftp with hadoop has a very bad performance).

Use hdfs dfs -put command could be better for small amount of data but it isn't work in parallel like distcp.