Improving Train Punctuality Prediction Using a Transformer Model: Model Setup and Performance Issues

13 views Asked by At

I am working on a project to predict the punctuality of trains, measured in minutes, at each station during various trips. I initially employed a Deep Feedforward Neural Network (DFFNN) for this task but wanted to explore the capabilities of a Transformer model to potentially improve predictions. I prepared my dataset into sequences, accounting for the varying number of stations per trip by implementing padding. Given the sequential nature of my data, I assumed that an encoder-only Transformer would suffice for this task.

During the model training process, I've not observed significant improvement, even after numerous epochs, which leads me to question whether my Transformer model setup is optimal for this particular problem. I've also integrated Optuna for hyperparameter tuning to aid in optimizing the model's performance.

grouped_data = df.groupby([<features to identify trips>])
grouped_data.ngroups

sequences = []
targets = []
masks = []

for trip, group in grouped_data:
    sequence = group[
        [
            <features>
        ]
    ]

    sequences.append(sequence.values)
    targets.append(group[<target>].values)

    # Create a mask for this sequence (1 for data, 0 for padding)
    mask = np.ones(len(sequence), dtype=np.float32)
    masks.append(mask)

max_seq_length = max(len(sequence) for sequence in sequences)
padding_value = 0  # Assume 0 is used for padding
padded_sequences = []
padded_targets = []
padded_masks = []  # List for padded masks

for sequence, target, mask in zip(sequences, targets, masks):
    padding_length = max_seq_length - len(sequence)

    # Pad sequence and target as before
    sequence_padding = np.full((padding_length, sequence.shape[1]), padding_value)
    target_padding = np.full(padding_length, padding_value)
    padded_sequence = np.concatenate((sequence, sequence_padding), axis=0)
    padded_target = np.concatenate((target, target_padding), axis=0)

    # Pad mask
    mask_padding = np.zeros(padding_length, dtype=np.float32)  # Padding for mask is 0
    padded_mask = np.concatenate((mask, mask_padding), axis=0)

    padded_sequences.append(padded_sequence)
    padded_targets.append(padded_target)
    padded_masks.append(padded_mask)  # Add the padded mask to the list

padded_sequences = np.array(padded_sequences)
padded_targets = np.array(padded_targets)
padded_masks = np.array(padded_masks)  # Convert padded masks list to a numpy array

(
    train_sequences,
    test_sequences,
    train_masks,
    test_masks,
    train_targets,
    test_targets,
) = train_test_split(
    padded_sequences, padded_masks, padded_targets, test_size=0.2, random_state=42
)

train_sequence_tensor = torch.tensor(train_sequences, dtype=torch.float32)
train_mask_tensor = torch.tensor(train_masks, dtype=torch.bool)
train_target_tensor = torch.tensor(train_targets, dtype=torch.float32)

test_sequence_tensor = torch.tensor(test_sequences, dtype=torch.float32)
test_mask_tensor = torch.tensor(test_masks, dtype=torch.bool)
test_target_tensor = torch.tensor(test_targets, dtype=torch.float32)

train_dataset = TensorDataset(
    train_sequence_tensor, train_mask_tensor, train_target_tensor
)
test_dataset = TensorDataset(test_sequence_tensor, test_mask_tensor, test_target_tensor)

batch_size = 32
train_dataloader = DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True
)
test_dataloader = DataLoader(
    test_dataset, batch_size=batch_size, shuffle=False, pin_memory=True
)

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe = torch.zeros(max_len, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[: x.size(0), :]
        return x


class TransformerModel(nn.Module):
    def __init__(
        self,
        input_dim,
        model_dim,
        num_heads,
        num_layers,
        dropout_rate=0.1,
        max_len=5000,
    ):
        super(TransformerModel, self).__init__()
        self.model_dim = model_dim
        self.feature_embedding = nn.Linear(input_dim, model_dim)
        self.positional_encoding = PositionalEncoding(model_dim, max_len)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=model_dim, nhead=num_heads, dropout=dropout_rate, batch_first=True
        )
        self.transformer_encoder = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )
        self.output_layer = nn.Linear(model_dim, 1)

    def forward(self, src, src_mask=None):
        src = self.feature_embedding(src)
        src = self.positional_encoding(src)
        output = self.transformer_encoder(src, src_key_padding_mask=src_mask)
        output = self.output_layer(output)
        output = output.squeeze(-1)
        return output


def objective(trial):
    start_time = time.time()
    max_duration = 500  # seconds

    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    model_dim = trial.suggest_categorical("model_dim", [128, 256, 512])
    num_heads = trial.suggest_categorical("num_heads", [4, 8, 16])
    num_layers = trial.suggest_int("num_layers", 1, 4)
    dropout_rate = trial.suggest_float("dropout_rate", 0.1, 0.5)
    weight_decay = trial.suggest_float("weight_decay", 1e-5, 1e-1, log=True)

    input_dim = len(padded_sequences[0][0])  # Number of features in the input
    model = TransformerModel(
        input_dim, model_dim, num_heads, num_layers, dropout_rate
    ).to(device)

    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

    num_epochs = 10
    first_epoch_loss = None
    for epoch in range(num_epochs):
        model.train()
        train_loss = 0.0
        for sequences, masks, targets in train_dataloader:
            sequences = sequences.to(device)
            masks = masks.to(device)
            targets = targets.to(device)

            optimizer.zero_grad()
            outputs = model(
                sequences, ~masks
            )  
            loss = criterion(outputs[masks], targets[masks])  # Apply masks
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        if epoch == 0:
            first_epoch_loss = train_loss / len(train_dataloader)

        scheduler.step()

        # Validation phase
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for sequences, masks, targets in test_dataloader:
                sequences = sequences.to(device)
                masks = masks.to(device)
                targets = targets.to(device)

                outputs = model(
                    sequences, ~masks
                )  # Inverting masks for src_key_padding_mask
                loss = criterion(outputs[masks], targets[masks])  # Apply masks
                val_loss += loss.item()

        val_loss /= len(test_dataloader)
        trial.report(val_loss, epoch)

        val_rmse = torch.sqrt(torch.tensor(val_loss))
        trial.set_user_attr("val_rmse", val_rmse.item())

        if trial.should_prune():
            raise optuna.TrialPruned()

        elapsed_time = time.time() - start_time
        if elapsed_time > max_duration:
            print(f"Pruning trial due to timeout: elapsed time {elapsed_time}s")
            raise optuna.TrialPruned()

        print(f"Epoch: {epoch + 1}, Loss: {val_loss}")

    print(
        f"Trial {trial.number}: First Epoch Loss = {first_epoch_loss}, Final Epoch Loss = {val_loss}"
    )

    return val_loss


study = optuna.create_study(direction="minimize", pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=20)

Here's a brief overview of my approach:

  • Data Preparation: I grouped my dataset by trips and created sequences for each trip, along with corresponding targets and masks to handle variable sequence lengths through padding.

  • Model Setup: I implemented an encoder-only Transformer model, considering the sequential nature of the data. The model includes a feature embedding layer, positional encoding, and an encoder with specified layers and heads.

  • Training and Tuning: Using PyTorch, I prepared my data for training and employed Optuna for hyperparameter tuning, expecting that it would help find an optimal configuration for my model. Despite these efforts, the model's performance hasn't improved as expected during training. I anticipated seeing a noticeable reduction in loss over epochs, reflecting the model's ability to better predict train punctuality. However, this hasn't been the case, and the performance seems stagnant over a large number of epochs.

Could there be an issue with how I've set up my Transformer model for this specific problem, or might there be other aspects of my approach that are hindering the model's learning? I'm particularly concerned about the encoder-only design, data padding handling, or perhaps the way I've implemented the positional encoding and feature embedding in the context of predicting time-based outcomes. Any insights or suggestions to improve the model's performance would be greatly appreciated.

0

There are 0 answers