Delete records from table before writing dataframe - pyspark

11.3k views Asked by At

I'm trying to delete records from my table before writing data into it from dataframe. Its not working for me ... What am I doing wrong?

Goal: "delete from xx_files_tbl" before writing new dataframe to table.
 
query = "(delete from xx_files_tbl)"
spark.write.format("jdbc")\
            .option("url", "jdbc:sqlserver://"+server+":1433;databaseName="+db_name)\
            .option("driver", driver_name)\
            .option("dbtable", query)\
            .option("user", user)\
            .option("password", password)\
            .option("truncate", "true")\
            .save()

Thanks.

4

There are 4 answers

0
SD3 On BEST ANSWER

Instead of deleting the data in sql server table before writing your dataframe, you can directly write your dataframe with .mode("overwrite") and .option("truncate",true).

https://learn.microsoft.com/en-us/sql/big-data-cluster/spark-mssql-connector?view=sql-server-ver15

1
Pavan Kumar Gopidesu On

You can not delete the data,as dataframes are immutable. You can do filter operation and create new data frame and write to your location.Something like this will help you i think.

newdf=spark.sql("select * from xx_files_tbl WHERE value <= 1")

1
Aleksejs R On

Spark documentations says that dbtable is used for passing table that should be read from or written into. FROM clause can be use only while reading data with JDBC connector. (resource: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)

My suggestion would be either to use overwrite writing mode or to open a separate connection for data deletion. Spark is not required for data deletion and connection to MySQL server. It will be enough to use Python MySQL connector or to open a separate jdbc connection.

0
Ravi Sampara On

This is always a limitation to execute DML operations using pyspark. But I have created a simple stored procedure in SQL server to accept any DML operation as parameter. I am calling that procedure from pyspark to run the dml operations in SQL server. its been working fine for me so far.

Create PROCEDURE DBO.dml_operations (@query varchar(2500)) AS BEGIN SET NOCOUNT ON;

-- print(@query) if you want to see how this is being passed on to procedure.
exec(@query)
select 0

END GO

declare @query varchar(2500) set @query = 'update <> set << my field >> = 4.33 where << char field >> = ''Something'''

exec DBO.dml_operations @query

I know we use different types of functions to run the stored procedures in pyspark.

let me know if you want to know how to run stored procs from pyspark.