How can i save data from hdfs to amazon s3

65 views Asked by At

I am working on webarchives, and extracting some data, initially i used to store this data as txt in my hdfs, but due it is massive size i will have to store the output in amazon s3 buckets, how can i achieve this? i have tried to use s3a connector but it throws me an error saying credentials are wrong. the size of the txt file is in TBs and is there anyway i can store it in hdfs as i was doing before and upload it to s3 and then delete from hdfs, or any other effective way of doing this?

for bucket in buckets[4:5]:
    filenames = get_bucket_warcs(bucket)
    print("==================================================")
    print(f"bucket: {bucket}, filenames: {len(filenames)}")
    print("==================================================")
    jsonld_count = sc.accumulator(0)
    records_count = sc.accumulator(0)
    exceptions_count = sc.accumulator(0)
    rdd_filenames = sc.parallelize(filenames, len(filenames))
    rdd_jsonld = rdd_filenames.flatMap(lambda f: get_jsonld_records(bucket, f))
    rdd_jsonld.saveAsTextFile(f"{hdfs_path}/webarchive-jsonld-{bucket}")

    print(f"records processed: {records_count.value}", f"jsonld: {jsonld_count.value}", f"exceptions: {exceptions_count.value}")

    sc.stop()

this is my code and i would like to save rdd_jsonld in amazon s3 bucket.

1

There are 1 answers

1
stevel On

If the s3a connector is reporting that the credentials are wrong then either you haven't set up the credentials or you have configured the client to talk to the wrong public/private S3 store.

Look for the online documentation for the s3 connector (hadoop s3a or EMR s3) and read it, especially the sections on authentication and troubleshooting.