I am trying to get OpenLineage information from a pyspark program. As an MVP I'm trying to run spark locally on my machine (this works) and somehow log the openlineage messages that would be sent to an openlineage server. However, I don't seem to get this right. In the logs when running the spark-submit job nothing openlineage related is ever mentioned.
I have the following file openlineage_spark.py
:
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder \
.appName("PySpark OpenLineage Example") \
.config("spark.driver.extraJavaOptions", "-Dlog4j.configuration=file:log4j-openlineage.xml") \
.config("openlineage.url", "http://localhost:5000") \
.config("openlineage.backend", "io.openlineage.spark.agent.backends.ConsoleLoggingBackend") \
.config("openlineage.namespace", "test") \
.getOrCreate()
# Example PySpark job
df = spark.read.text("some_input_file.txt")
df.show()
# Perform some transformation
transformed_df = df.selectExpr("length(value) as length")
transformed_df.show()
# Write data
transformed_df.write.mode('overwrite').csv("some_output_file.csv")
spark.stop()
And I call this with:
spark-submit --jars openlineage-spark-1.4.1.jar --files log4j-openlineage.xml openlineage_spark.py
The result of this spark-submit
is all the logging you would normally expect, the dataframes are printed etc. However, nothing is mentioned about the lineage.
The content of the xml files looks as follows:
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="warn">
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/>
</Console>
</Appenders>
<Loggers>
<Logger name="io.openlineage.spark.agent" level="info">
<AppenderRef ref="Console"/>
</Logger>
<Root level="error">
<AppenderRef ref="Console"/>
</Root>
</Loggers>
</Configuration>
Which should log the openlineage stuff to the console?
Can anyone help me to set up an MVP in which I can clearly see something openlineage related being logged or send?
Here are the steps you need to follow to get OpenLineage working. It took me sometime to get it working properly.
Following this guide :
https://openlineage.io/docs/development/ol-proxy
This will start your proxy local OpenLineage server whose output you can see in the console. Make sure you have Java 11 installed and no other java version. Because it works only with Java 11.
Once the above server is started without any error. Run the following pyspark program. The job events will be sent to the above server through the http api backend and the server will output the events in its console output which you can see on the terminal.
Console output :
Tail end of the output on my terminal.