Mock a Spark RDD in the unit tests

16.7k views Asked by At

Is it possible to mock a RDD without using sparkContext?

I want to unit test the following utility function:

 def myUtilityFunction(data1: org.apache.spark.rdd.RDD[myClass1], data2: org.apache.spark.rdd.RDD[myClass2]): org.apache.spark.rdd.RDD[myClass1] = {...}

So I need to pass data1 and data2 to myUtilityFunction. How can I create a data1 from a mock org.apache.spark.rdd.RDD[myClass1], instead of create a real RDD from SparkContext? Thank you!

2

There are 2 answers

0
eliasah On BEST ANSWER

I totally agree with @Holden on that!

Mocking RDDS is difficult; executing your unit tests in a local Spark context is preferred, as recommended in the programming guide.

I know this may not technically be a unit test, but it is hopefully close enough.

Unit Testing

Spark is friendly to unit testing with any popular unit test framework. Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test framework’s tearDown method, as Spark does not support two contexts running concurrently in the same program.

But if you are really interested and you still want to try mocking RDDs, I'll suggest that you read the ImplicitSuite test code.

The only reason they are pseudo-mocking the RDD is to test if implict works well with the compiler, but they don't actually need a real RDD.

def mockRDD[T]: org.apache.spark.rdd.RDD[T] = null

And it's not even a real mock. It just creates a null object of type RDD[T]

2
Holden On

RDDs are pretty complex, mocking them is probably not the best way to go about creating test data. Instead I'd recommend using sc.parallelize with your data. I'm also (somewhat biased) think that https://github.com/holdenk/spark-testing-base can help by providing a trait to setup & teardown the Spark Context for your tests.