HDFS commands in pyspark script

381 views Asked by At

I am writing a simple pyspark script to copy hdfs files and folders from one location to another. I have gone through many docs and answers available online but i could not find a way to copy folders and files using pyspark or to execute hdfs commands using pyspark(particularly copy folders and files)

Below is my code

hadoop = sc._jvm.org.apache.hadoop
Path = hadoop.fs.Path
FileSystem = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
fs = FileSystem.get(conf)
source = hadoop.fs.Path('/user/xxx/data')
destination = hadoop.fs.Path('/user/xxx/data1')

if (fs.exists(Path('/user/xxx/data'))):
    for f in fs.listStatus(Path('/user/xxx/data')):
        print('File path', str(f.getPath()))
        **** how to use copy command here ? 

Thanks in advance

1

There are 1 answers

0
OneCricketeer On

Create a new Java object for the FileUtil class and use its copy methods, not hdfs script commands

How to move or copy file in HDFS by using JAVA API

It might be better to just use distcp rather than Spark, though, otherwise, you'll run into race conditions if you try to run that code with multiple executors