Parsing Protobuf ByteString in Spark not working after creating Encoder

2.1k views Asked by At

I'm trying to parse protobuf (protobuf3) data in spark 2.4 and I'm having some trouble with the ByteString type. I've created the case class using the ScalaPB library and loaded the jar into a spark shell. I've also tried creating a implicit encoder for the type however I still get the following error;

java.lang.UnsupportedOperationException: No Encoder found for com.google.protobuf.ByteString

Here is what I've tried so far;

import proto.Event._ // my proto case class
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders.kryo

// Register our UDTs to avoid "<none> is not a term" error:
EventProtoUdt.register()

val inputFile = "data.avro"

object ByteStringEncoder{ 
  implicit def byteStringEncoder: Encoder[com.google.protobuf.ByteString] = org.apache.spark.sql.Encoders.kryo[com.google.protobuf.ByteString] 
}

import ByteStringEncoder._
import spark.implicits._

def parseLine(s: String): Event= Event.parseFrom(org.apache.commons.codec.binary.Base64.decodeBase64(s))

import scalapb.spark._
val eventsDf = spark.read.format("avro").load(inputFile)

val eventsDf2 = eventsDf .map(row => row.getAs[Array[Byte]]("Body")).map(Event.parseFrom(_))

Any help is appreciated

1

There are 1 answers

0
thesamet On

This issue has been fixed in sparksql-scalapb 0.9.0. Please see the updated documentation on setting the imports so an Encoder for ByteString is picked up by implicit search.