I have a large indexed lzo file in HDFS that I would like to read in spark dataframes. The file contains lines of json documents.
posts_dir='/data/2016/01'
posts_dir
has the following:
/data/2016/01/posts.lzo
/data/2016/01/posts.lzo.index
The following works but doesn't make use of the index and hence takes long time because it only uses one mapper.
posts = spark.read.json(posts_dir)
Is there a way to make it utilize the index?
I solved this by first creating an RDD that recognizes the index and then using
from_json
function to turn each line intoStructType
, effectively producing similar results tospark.read.json(...)
I am not aware of a better or more straightforward way.