Accessing csv file placed in hdfs using spark

2.1k views Asked by At

I have placed a csv file into the hdfs filesystem using hadoop -put command. I now need to access the csv file using pyspark csv. Its format is something like

`plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')`

I am a newbie to hdfs. How do I find the address to be placed in hdfs://x.x.x.x?

Here's the output when I entered

hduser@remus:~$ hdfs dfs -ls /input

Found 1 items
-rw-r--r--   1 hduser supergroup        158 2015-06-12 14:13 /input/test.csv

Any help is appreciated.

3

There are 3 answers

3
Abhishek Choudhary On BEST ANSWER

you need to provide the full path of your files in HDFS and the url will be mentioned in your hadoop configuration core-site or hdfs-site where you mentioned.

Check your core-site.xml & hdfs-site.xml for get the details about url.

Easy way to find any url is access your hdfs from your browser and get the path.

If you are using absolute path in your file system use file:///<your path>
0
vvladymyrov On

Try to specify absolute path without hdfs://

plaintext_rdd = sc.textFile('/input/test.csv')

Spark while running on the same cluster with HDFS use hdfs:// as default FS.

0
Sairam Asapu On

Start the spark shell or the spark-submit by pointing to the package which can read csv files, like below:

spark-shell  --packages com.databricks:spark-csv_2.11:1.2.0

And in the spark code, you can read the csv file as below:

val data_df = sqlContext.read.format("com.databricks.spark.csv")
              .option("header", "true")
              .schema(<pass schema if required>)
              .load(<location in HDFS/S3>)