I have a sequence file whose values look like
(string_value, json_value)
I don't care about the string value.
In Scala I can read the file by
val reader = sc.sequenceFile[String, String]("/path...")
val data = reader.map{case (x, y) => (y.toString)}
val jsondata = spark.read.json(data)
I am having a hard time converting this to PySpark. I have tried using
reader= sc.sequenceFile("/path","org.apache.hadoop.io.Text", "org.apache.hadoop.io.Text")
data = reader.map(lambda x,y: str(y))
jsondata = spark.read.json(data)
The errors are cryptic but I can provide them if that helps. My question is, is what is the right syntax for reading these sequence files in pySpark2?
I think I am not converting the array elements to strings correctly. I get similar errors if I do something simple like
m = sc.parallelize([(1, 2), (3, 4)])
m.map(lambda x,y: y.toString).collect()
or
m = sc.parallelize([(1, 2), (3, 4)])
m.map(lambda x,y: str(y)).collect()
Thanks!
 
                        
The fundamental problem with your code is the function you use. Function passed to
mapshould take a single argument. Use either:or just:
As long as
keyClassandvalueClassmatch the data this should be all you need here and there should be no need for additional type conversions (this is handled internally bysequenceFile). Write in Scala:Read in Python:
Note:
Legacy Python versions support tuple parameter unpacking:
Don't use it for code that should be forward compatible.