I am writing a simple pyspark script to copy hdfs files and folders from one location to another. I have gone through many docs and answers available online but i could not find a way to copy folders and files using pyspark or to execute hdfs commands using pyspark(particularly copy folders and files)
Below is my code
hadoop = sc._jvm.org.apache.hadoop
Path = hadoop.fs.Path
FileSystem = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
fs = FileSystem.get(conf)
source = hadoop.fs.Path('/user/xxx/data')
destination = hadoop.fs.Path('/user/xxx/data1')
if (fs.exists(Path('/user/xxx/data'))):
for f in fs.listStatus(Path('/user/xxx/data')):
print('File path', str(f.getPath()))
**** how to use copy command here ?
Thanks in advance
Create a new Java object for the FileUtil class and use its copy methods, not hdfs script commands
How to move or copy file in HDFS by using JAVA API
It might be better to just use
distcp
rather than Spark, though, otherwise, you'll run into race conditions if you try to run that code with multiple executors