Code snippet :
val inp = sc.textFile("C:\\mk\\logdir\\foldera\\foldera1\\log.txt").collect.mkString(" ")
I know above code reads the entire file & combine them in one string & executes it driver node(single execution. not parallel one).
val inp = sc.textFile("C:\\mk\\logdir\\*\\*\\log.txt")
code block{ }
sc.stop
Q1 )Here I am reading multiple files (which are present in above folder structure). I believe in this case each file will be created as partition & will be sent to separate node & executed parallely. Am I correct in my understanding? Can someone confirm this? Or is there anyway i can confirm it systematically?
val inp = sc.textFile("C:\\mk\\logdir\\*\\*\\log.txt")
val cont = inp.collect.mkString(" ")
code block{ }
sc.stop
Q2) How the spark handles this case. though I am doing collect, I assume that it will not collect all content from all files but just the one file . Am I right? Can someone help me understand this?
Thank you very much in Advance for your time & help.
ANSWER :
SparkContext’s TextFile method, i.e., sc.textFile creates a RDD with each line as an element. If there are 10 files in data i.e
yourtextfilesfolder
folder, 10 partitions will be created. You can verify the number of partitions by:However, Partitioning is determined by data locality. This may result in too few partitions by default. AFAIK there is no guarantee that one partition will be created please see the code of 'SparkContext.textFile'.
& 'minPartitions' - suggested minimum number of partitions for the resulting RDD
For better understanding see below method.
you can mention
minPartitions
as shown above from SparkContext.scalaANSWER : Your rdd constructed with multiple text files. so collect will collect from all partitions to driver from different files, not one file at a time.
you can verify : using rdd.collect
However, If you want read multiple text files you can also use
wholeTextFiles
please see the @note in below method Small files are preferred, large file is also allowable, but may cause bad performance.See spark-core-sc-textfile-vs-sc-wholetextfiles
Doc :
Examples :
In your case I d prefer
SparkContext.wholeTextFiles
where you can get filename and its content after collect as described above, if thats the thing you wanted.