Load Snowflake data into Pandas dataframe using AWS Sagemaker

1.1k views Asked by At

I'm attempting to read data from Snowflake into a Pandas dataframe using a Jupyter Lab on AWS Sagemaker. I can successfully load 100 rows, but not 2.2 Million Rows

  • I gave up loading complete data after ~20 minutes
  • On my local computer it takes 4 minutes, If I change to fetch in batches on local computer it takes long (~10 minutes) – but could try this on Sagemaker?
  • Command is sent successfully to snowflake (looking at history log)
  • Placing underlying view into a table on Snowflake showed 222.5MB size

Steps:

  • Run: !pip install snowflake-connector-python[pandas]

  • Restart kernel

  • Execute:

     ctx = snowflake.connector.connect(
          user='<username>',
          account='<account>',
          password = '<password>',
          warehouse='<wh>',
          database='<db>',
          role='<role>',
          schema='<schema>'
      )
      cur = ctx.cursor()
      sql = """
              SELECT * <db>.<schema>.<view>
      """
      cur.execute(sql)
      # Fetch the result set from the cursor and deliver it as the Pandas DataFrame.
      df = cur.fetch_pandas_all()
    
      cur.close()
      ctx.close()
    

Are there any specific configurations I need to ensure that I am using?

1

There are 1 answers

0
user1420372 On

In this case, sizing up the On-Demand Notebook Instance from ml.tx.xlarge (2vCPU, 8Gib) worked. See Amazon SageMaker Pricing for notebook instance CPU/Memory specifications.

In an earlier attempt to fix the problem, we had increased the volume size but that is for storage of data and didn't help with memory (see Customize your notebook volume size, up to 16 TB, with Amazon SageMaker for more details on storage volume); so we were able to decrease the volume size from 50 GB EBS to 10 GB EBS.

Memory can be monitored by opening up a terminal using the Jupyter interface and typing the linux command free