Using Spark 2.1.1., I have an N-row csv as 'fileInput'
colname datatype elems start end
colA float 10 0 1
colB int 10 0 9
I have successfully made an array of sql.rows ...
val df = spark.read.format("com.databricks.spark.csv").option("header", "true").load(fileInput)
val rowCnt:Int = df.count.toInt
val aryToUse = df.take(rowCnt)
Array[org.apache.spark.sql.Row] = Array([colA,float,10,0,1], [colB,int,10,0,9])
Against those Rows and using my random-value-generator scripts, I have successfully populated an empty ListBuffer[Any] ...
res170: scala.collection.mutable.ListBuffer[Any] = ListBuffer(List(0.24455154, 0.108798146, 0.111522496, 0.44311434, 0.13506883, 0.0655781, 0.8273762, 0.49718297, 0.5322746, 0.8416396), List(1, 9, 3, 4, 2, 3, 8, 7, 4, 6))
Now, I have a mixed-type ListBuffer[Any] with different typed lists. . How do iterate through and zip these? [Any] seems to defy mapping/zipping. I need to take N lists generated by the inputFile's definitions, then save them to a csv file. Final output should be:
ColA, ColB
0.24455154, 1
0.108798146, 9
0.111522496, 3
... etc
The inputFile can then be used to create any number of 'colnames's, of any 'datatype' (I have scripts for that), of each type appearing 1::n times, of any number of rows (defined as 'elems'). My random-generating scripts customize the values per 'start' & 'end', but these columns are not relevant for this question).
I think the
RDD.zipWithUniqueId()
orRDD.zipWithIndex()
methods can perform what you wanna do.Please refer to official documentation for more information. hope this help you