I have a dataset in S3 in text format(.gz) and I am using spark.read.csv to read the file into spark.
This is about 100GB of data but it contains 150 columns. I am using only 5 columns (so I reduce the breadth of the data) and I have selecting only 5 columns.
For this kind of scenario, does spark scans the complete 100GB of data or it smartly filters only these 5 columns without scanning all the columns(like in columnar formats)?
Any help on this would be appreciated.
imp_feed = spark.read.csv('s3://mys3-loc/input/', schema=impressionFeedSchema, sep='\t').where(col('dayserial_numeric').between(start_date_imp,max_date_imp)).select("col1","col2","col3","col4")
make step 1 of your workflow the process of reading in the CSV file and saving it as snappy compressed ORC or parquet files.
then go to whoever creates those files and tell them to stop it. At the very least, they should switch to Avro + Snappy, as that's easier to split up the initial parse and migration to a columnar format.