How to split RDD rows by commas when there is no value between them?

415 views Asked by At

I'm trying to split the below RDD row into five columns

val test = [hello,one,,,]

val rddTest = test.rdd
val Content = rddTest.map(_.toString().replace("[", "").replace("]", ""))
      .map(_.split(","))
      .map(e ⇒ Row(e(0), e(1), e(2), e(3), e(4), e(5)))

when I execute I get "java.lang.ArrayIndexOutOfBoundsException" as there are no values between the last three commas.

any ideas on how to split the data now?

2

There are 2 answers

3
Suhas NM On

Your code is correct, but after splitting you are trying to access 6 elements instead of 5.

Change

.map(e ⇒ Row(e(0), e(1), e(2), e(3), e(4), e(5)))

to

.map(e ⇒ Row(e(0), e(1), e(2), e(3), e(4)))

UPDATE

By default, empty values are omitted when we do string split. That is the reason why your array has only 2 elements. To achieve what you intend to do, try this:

val Content = rddTest.map(_.toString().replace("[", "").replace("]", ""))
      .map(_.split(",",-1))
      .map(e ⇒ Row(e(0), e(1), e(2), e(3), e(4)))

observe the split function, using it that way will make sure all the fields are retained.

0
Lamanus On

So dirty but replace several times.

val test = sc.parallelize(List("[hello,one,,,]"))

test.map(_.replace("[", "").replace("]", "").replaceAll(",", " , "))
    .map(_.split(",").map(_.replace(" ", "")))
    .toDF().show(false)

+------------------+
|value             |
+------------------+
|[hello, one, , , ]|
+------------------+