I've been trying to get SageMaker to run my job for a week and keep running into the same issue no matter what I try.
Folder Structure
Root:
__init__.py
train.py
/scripts (contains my_dataloader.py and my_model.py)```
train imports the two scripts. All these files are in a tarball in my S3 instance.
No matter what I try, I keep getting the following error:
AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ModuleNotFoundError: No module named 'my_dataloader' "
Command "/opt/conda/bin/python3.8 train.py --num-epochs 10",
exit code: 1
This is my estimator:
import sagemaker
from sagemaker.pytorch import PyTorch
# Get the SageMaker execution role
role = sagemaker.get_execution_role()
print(role)
# Create a SageMaker Estimator
estimator = PyTorch(
entry_point='train.py', # Your entry point file
role=role, # IAM role obtained above
framework_version='1.10.0', # Update the version as necessary
py_version='py38', # Python version
instance_count=1, # Number of instances to use for training
instance_type='ml.p3.2xlarge', # Type of instance to use for training
hyperparameters={
'num-epochs': 10, # Matching the parameter in train.py
},
output_path='s3://sagemaker-score/output', # S3 location for output artifacts
debugger_hook_config=False, # Adjust based on whether you need the debugger hook
)
# The fit method should point to the S3 bucket and prefix where your training data is located.
estimator.fit({'training': 's3://sagemaker-studio-v6q9hsuoomd/labeled/'})
This is the relevant part of train that performs the imports:
import argparse
import os
import sys
import torch
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from sklearn.metrics import mean_absolute_error
from torch.optim.lr_scheduler import ReduceLROnPlateau
# Parse arguments passed to the script
def parse_arguments():
parser = argparse.ArgumentParser()
# SageMaker Container environment
parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
parser.add_argument('--data-dir', type=str, default=os.environ.get('SM_CHANNEL_TRAINING'))
parser.add_argument('--num-epochs', type=int, default=10)
return parser.parse_args()
def main(args):
# Log the current working directory and files at the start
print("Starting training script...")
print("Current working directory at start: ", os.getcwd())
print("Files in current directory at start: ", os.listdir('.'))
# Determine the device to run the model on (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# The extracted_model_path should point to the directory containing your model.py and dataloader.py
# This is set by SageMaker when the 'source_dir' is specified in the Estimator
extracted_model_path = '/opt/ml/code/scripts' # Adjusted to point to the 'scripts' subfolder
# Insert the path so the Python interpreter knows where to find your modules
sys.path.insert(0, extracted_model_path)
# Log the current working directory and files before imports
print("Before imports...")
print("Current working directory: ", os.getcwd())
print("Files in current directory: ", os.listdir('.'))
# Import dataloader and model from the 'scripts' subfolder after setting the correct sys.path
from my_dataloader import load_enhanced_data_dataset
from my_model import PhotoQualityNet
What can I do to get my job to run correctly with the imported files?
I've tried:
Moving the imported files from the root directory to a subfolder. Renaming the files. Defining source_dir in the estimator. Validating the right permissions are on my AWS roles. What am I missing?