I am trying to update the few fields of each row of a big mysql table (having close to 500
million rows). The table doesn't have any primary key (or having string primary key like UUID). I don't have enough executor memory to read and hold the entire data in once. Can anyone please let me know what are my options to process such tables.
Below is the schema
CREATE TABLE Persons ( Personid varchar(255) NOT NULL, LastName varchar(255) NOT NULL, FirstName varchar(255) DEFAULT NULL, Email varchar(255) DEFAULT NULL, Age int(11) DEFAULT NULL) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Spark code is like
SparkSession spark = SparkSession.builder().master("spark://localhost:7077").appName("KMASK").getOrCreate();
DataFrame rawDataFrame = spark.read().format("jdbc").load();
rawDataFrame.createOrReplaceTempView("data");
//encrypt is UDF
String sql = "select Personid, LastName, FirstName, encrypt(Email), Age from data";
Dataset newData = spark.sql(sql);
newData.write().mode(SaveMode.Overwrite).format("jdbc").options(options).save();
This table has around 150
million records, size of data is around 6GB
. My executor memory is just 2 gb
. Can I process this table using Spark - jdbc.
Ideally you can alter the spark jdbc
fetchsize
option to reduce/increase how many records are fetched and processed each time.Partitioning the data can also help to reduce shuffles and additional overhead. Since you have
Age
as a numerical field. You may also process the data in partitions determined by the Age. First determine the min and max age and use the Spark JDBC Options.Notably:
partitionColumn
:Age
lowerBound
: min age you identifiedupperBound
: max age you identifiednumPartitions
: really dependent on the number of cores and worker nodes but more hints and links are hereYou may also use custom queries to only select and update a few records that can hold in memory with the
query
option. NB. when using thequery
option you should not usedbtable
option.