Load Python Pickle File from S3 Bucket to Sagemaker Notebook

4.3k views Asked by At

I have attempted the code on the many posts on how to load a pickle file (1.9GB) from an S3 bucket, but none seem to work for our notebook instance on AWS Sagemaker. Notebook size is 50GB.

Some of the methods attempted:

Method 1

import io
import boto3

client = boto3.client('s3')
bytes_buffer = io.BytesIO()
client.download_fileobj(Bucket=my_bucket, Key=my_key_path, Fileobj=bytes_buffer)

bytes_io.seek(0) 
byte_value = pickle.load(bytes_io)

This gives:

enter image description here

Method 2: This actually gets me something back with no error:

client = boto3.client('s3')
bytes_buffer = io.BytesIO()
client.download_fileobj(Bucket=my_bucket, Key=my_key_path, Fileobj=bytes_buffer)
byte_value = bytes_buffer.getvalue()
import sys
sys.getsizeof(byte_value)/(1024**3)

this returns: 1.93

but how do I convert the byte_value into the pickled object? I tried this:

pickled_data = pickle.loads(byte_value)

But the kernel "crashed" - went idle and I lost all variables.

1

There are 1 answers

0
user1420372 On

(in hindsight the solution was obvious, but wasn't to me on my first day in AWS Sagemaker world) ... a memory error means you need to increase the size of your notebook instance.

In this case, sizing up the On-Demand Notebook Instance from ml.tx.xlarge (2vCPU, 8Gib) to ml.tx.2xlarge (4vCPU, 16Gib) worked. See Amazon SageMaker Pricing for notebook instance CPU/Memory specifications.

In an earlier attempt to fix the problem, we had increased the volume size but that is for storage of data and didn't help with memory (see Customize your notebook volume size, up to 16 TB, with Amazon SageMaker for more details on storage volume); so we were able to decrease the volume size from 50 GB EBS to 10 GB EBS -

Memory can be monitored by opening up a terminal using the Jupyter interface and typing the linux command free

To load the picked dataframe, I simply used the solution from @kindjacket in this post: How to load a pickle file from S3 to use in AWS Lambda?, which was as follows:

import pickle
import boto3

s3 = boto3.resource('s3')
my_pickle = pickle.loads(s3.Bucket("bucket_name").Object("key_to_pickle.pickle").get()['Body'].read())