One single distcp command to upload several files to s3 (NO DIRECTORY)

Question

One single distcp command to upload several files to s3 (NO DIRECTORY)

681 views Asked by tprebenda At 23 February 2022 at 21:08

I am currently working with the s3a adapter of Hadoop/HDFS to allow me to upload a number of files from a Hive database to a particular s3 bucket. I'm getting nervous because I can't find anything online about specifying a bunch of filepaths (not directories) for copy via distcp.

I have set up my program to collect an array of filepaths using a function, inject them all into a distcp command, and then run the command:

files = self.get_files_for_upload()
if not files:
    logger.warning("No recently updated files found. Exiting...")
    return

full_path_files = [f"hdfs://nameservice1{file}" for file in files]
s3_dest = "path/to/bucket"
cmd = f"hadoop distcp -update {' '.join(full_path_files)} s3a://{s3_dest}"

logger.info(f"Preparing to upload Hive data files with cmd: \n{cmd}")
result = subprocess.run(cmd, shell=True, check=True)

This basically just creates one long distcp command with 15-20 different filepaths. Will this work? Should I be using the -cp or -put commands instead of distcp?

(It doesn't make sense to me to copy all these files to their own directory and then distcp that entire directory, when I can just copy them directly and skip those steps...)

Original Q&A

There are 1 answers

**OneCricketeer** · Answer 1 · 2022-02-24T16:52:21+00:00

-cp and -put would require you to download the HDFS files, then upload to S3. That would be a lot slower.

I see no immediate reason why this wouldn't work, however, reading over the documentation, I would recommend using -f flag instead.

E.g.

files = self.get_files_for_upload()
if not files:
    logger.warning("No recently updated files found. Exiting...")
    return

src_file = 'to_copy.txt'
with open(src_file, 'w') as f:
    for file in files:
        f.write(f'hdfs://nameservice1{file}\n')

s3_dest = "path/to/bucket"
result = subprocess.run(['hadoop', 'distcp', '-f', src_file, f's3a://{s3_dest}'], shell=True, check=True)

If the all files were already in their own directory, then you should just copy the directory, like you said.

TechQA.

One single distcp command to upload several files to s3 (NO DIRECTORY)

There are 1 answers

Related Questions in PYTHON

Related Questions in HADOOP

Related Questions in DISTCP

Related Questions in S3DISTCP

Popular Questions

Trending Questions