SageMaker Pytorch Job Not Finding Custom Dependency Files

79 views Asked by At

I've been trying to get SageMaker to run my job for a week and keep running into the same issue no matter what I try.

Folder Structure

Root:

__init__.py
train.py
/scripts (contains my_dataloader.py and my_model.py)```

train imports the two scripts. All these files are in a tarball in my S3 instance.

No matter what I try, I keep getting the following error:

AlgorithmError: ExecuteUserScriptError: 
ExitCode 1 
ErrorMessage "ModuleNotFoundError: No module named 'my_dataloader' " 
Command "/opt/conda/bin/python3.8 train.py --num-epochs 10", 
exit code: 1

This is my estimator:

import sagemaker
from sagemaker.pytorch import PyTorch

# Get the SageMaker execution role
role = sagemaker.get_execution_role()
print(role)

# Create a SageMaker Estimator
estimator = PyTorch(
    entry_point='train.py',                             # Your entry point file
    role=role,                                          # IAM role obtained above
    framework_version='1.10.0',                         # Update the version as necessary
    py_version='py38',                                  # Python version
    instance_count=1,                                   # Number of instances to use for training
    instance_type='ml.p3.2xlarge',                      # Type of instance to use for training
    hyperparameters={
        'num-epochs': 10,                               # Matching the parameter in train.py
    },
    output_path='s3://sagemaker-score/output',          # S3 location for output artifacts
    debugger_hook_config=False,                         # Adjust based on whether you need the debugger hook
)

# The fit method should point to the S3 bucket and prefix where your training data is located.
estimator.fit({'training': 's3://sagemaker-studio-v6q9hsuoomd/labeled/'})

This is the relevant part of train that performs the imports:

import argparse
import os
import sys
import torch
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from sklearn.metrics import mean_absolute_error
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Parse arguments passed to the script
def parse_arguments():
    parser = argparse.ArgumentParser()

    # SageMaker Container environment
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--data-dir', type=str, default=os.environ.get('SM_CHANNEL_TRAINING'))
    parser.add_argument('--num-epochs', type=int, default=10)

    return parser.parse_args()

def main(args):
    # Log the current working directory and files at the start
    print("Starting training script...")
    print("Current working directory at start: ", os.getcwd())
    print("Files in current directory at start: ", os.listdir('.'))

    # Determine the device to run the model on (GPU or CPU)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # The extracted_model_path should point to the directory containing your model.py and dataloader.py
    # This is set by SageMaker when the 'source_dir' is specified in the Estimator
    extracted_model_path = '/opt/ml/code/scripts'  # Adjusted to point to the 'scripts' subfolder

    # Insert the path so the Python interpreter knows where to find your modules
    sys.path.insert(0, extracted_model_path)

    # Log the current working directory and files before imports
    print("Before imports...")
    print("Current working directory: ", os.getcwd())
    print("Files in current directory: ", os.listdir('.'))

    # Import dataloader and model from the 'scripts' subfolder after setting the correct sys.path
    from my_dataloader import load_enhanced_data_dataset
    from my_model import PhotoQualityNet

What can I do to get my job to run correctly with the imported files?

I've tried:

Moving the imported files from the root directory to a subfolder. Renaming the files. Defining source_dir in the estimator. Validating the right permissions are on my AWS roles. What am I missing?

0

There are 0 answers