I have a sequence file whose values look like
(string_value, json_value)
I don't care about the string value.
In Scala I can read the file by
val reader = sc.sequenceFile[String, String]("/path...")
val data = reader.map{case (x, y) => (y.toString)}
val jsondata = spark.read.json(data)
I am having a hard time converting this to PySpark. I have tried using
reader= sc.sequenceFile("/path","org.apache.hadoop.io.Text", "org.apache.hadoop.io.Text")
data = reader.map(lambda x,y: str(y))
jsondata = spark.read.json(data)
The errors are cryptic but I can provide them if that helps. My question is, is what is the right syntax for reading these sequence files in pySpark2?
I think I am not converting the array elements to strings correctly. I get similar errors if I do something simple like
m = sc.parallelize([(1, 2), (3, 4)])
m.map(lambda x,y: y.toString).collect()
or
m = sc.parallelize([(1, 2), (3, 4)])
m.map(lambda x,y: str(y)).collect()
Thanks!
The fundamental problem with your code is the function you use. Function passed to
map
should take a single argument. Use either:or just:
As long as
keyClass
andvalueClass
match the data this should be all you need here and there should be no need for additional type conversions (this is handled internally bysequenceFile
). Write in Scala:Read in Python:
Note:
Legacy Python versions support tuple parameter unpacking:
Don't use it for code that should be forward compatible.