I have attempted the code on the many posts on how to load a pickle file (1.9GB) from an S3 bucket, but none seem to work for our notebook instance on AWS Sagemaker. Notebook size is 50GB.
Some of the methods attempted:
Method 1
import io
import boto3
client = boto3.client('s3')
bytes_buffer = io.BytesIO()
client.download_fileobj(Bucket=my_bucket, Key=my_key_path, Fileobj=bytes_buffer)
bytes_io.seek(0)
byte_value = pickle.load(bytes_io)
This gives:
Method 2: This actually gets me something back with no error:
client = boto3.client('s3')
bytes_buffer = io.BytesIO()
client.download_fileobj(Bucket=my_bucket, Key=my_key_path, Fileobj=bytes_buffer)
byte_value = bytes_buffer.getvalue()
import sys
sys.getsizeof(byte_value)/(1024**3)
this returns: 1.93
but how do I convert the byte_value into the pickled object? I tried this:
pickled_data = pickle.loads(byte_value)
But the kernel "crashed" - went idle and I lost all variables.
(in hindsight the solution was obvious, but wasn't to me on my first day in AWS Sagemaker world) ... a memory error means you need to increase the size of your notebook instance.
In this case, sizing up the On-Demand Notebook Instance from ml.tx.xlarge (2vCPU, 8Gib) to ml.tx.2xlarge (4vCPU, 16Gib) worked. See Amazon SageMaker Pricing for notebook instance CPU/Memory specifications.
In an earlier attempt to fix the problem, we had increased the volume size but that is for storage of data and didn't help with memory (see Customize your notebook volume size, up to 16 TB, with Amazon SageMaker for more details on storage volume); so we were able to decrease the volume size from 50 GB EBS to 10 GB EBS -
Memory can be monitored by opening up a terminal using the Jupyter interface and typing the linux command
free
To load the picked dataframe, I simply used the solution from @kindjacket in this post: How to load a pickle file from S3 to use in AWS Lambda?, which was as follows: