I'm trying to read data from a specific folder in my s3 bucket. This data is in parquet format. To do that I'm using awswrangler:
import awswrangler as wr
# read data
data = wr.s3.read_parquet("s3://bucket-name/folder/with/parquet/files/", dataset = True)
This returns a pandas dataframe:
client_id center client_lat client_lng inserted_at matrix_updated
0700292081 BFDR -23.6077 -46.6617 2021-04-19 2021-04-19
7100067781 BFDR -23.6077 -46.6617 2021-04-19 2021-04-19
7100067787 BFDR -23.6077 -46.6617 2021-04-19 2021-04-19
However, instead of a pandas dataframe I would like to store this data retrieved from my s3 bucket in a spark dataframe. I've tried doing this(which is my own question), but seems not to be working correctly.
I was wondering if there is any way I could store this data into a spark dataframe using awswrangler. Or if you have an alternative I would like to read about it.
I didn't use awswrangler. Instead I used the following code which I found on this github: