I'm trying to run pyspark UDF and measure code coverage within UDF itself. Although this is not fully operational, I've uncovered something that might provide insight into how to make it work.
I'm using python 3.8, pyspark 3.2.3 and coverage 6.5.0
example:
file: module1.py
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
spark = SparkSession.builder \
.appName("SimpleSparkUDF") \
.getOrCreate()
data = [("John", 25)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
def add_one_udf(age):
import os
import coverage
import module2
# set up coverage
content = f"""
[run]
branch = True
cover_pylib = False
concurrency = multiprocessing,thread
parallel = True
data_file = {os.getcwd()}/.coverage
"""
with open("coveragerc_temp", "w") as file:
file.write(content.strip())
os.environ["COVERAGE_PROCESS_START"] = "coveragerc_temp"
cov = coverage.process_startup()
print("This line isn't covered by coverage.py")
module2.foo()
cov.stop()
cov.save()
coverage_data_files = [
f"{current_directory}/{name}" for name in os.listdir(current_directory) if name.startswith(".coverage")
]
# send back the .coverage files to my local machine
ubprocess.run(["scp", "-o", "StrictHostKeyChecking=no", *coverage_data_files, os.environ["localhost"])
return age + 1
add_one_udf_spark = udf(add_one_udf, IntegerType())
result = df.withColumn("AgePlusOne", add_one_udf_spark(df["Age"]))
result.show()
file: module2.py
def foo():
print("This line is covered by coverage.py")
print("This line is covered by coverage.py")
print("This line is covered by coverage.py")
On my local host, I received 1 .coverage file. After mapping the paths to my local machine and executing coverage combine + coverage report, I can easily see that the lines from module2.py are covered (75% of the lines, excluding the function signature). However, it seems like module1.py isn't covered at all. Additionally, I tried to debug it with the trace flag in the debug section, and module1.py isn't mentioned at all.
Does anyone have insights into why module1.py isn't covered at all, unlike module2.py?