Integrating Target Variable with Features Before Splitting Sequences Causes Shape Mismatch in ML Model Input

13 views Asked by At

I'm working on a machine learning project for time series analysis, where I need to preprocess sensor data to predict a target variable ('VB'). My approach involves splitting the dataset into sequences, but I'm facing an issue with integrating the target variable into these sequences before splitting.

Here's an overview of my preprocessing steps:

Load a CSV dataset and split it based on unique 'run' values. Select a single feature ('smcAC') and the target variable ('VB') from the dataset. Scale the feature data using MinMaxScaler. Combine the scaled features and targets into a single array. Split the combined array into sequences where each sequence includes both features and the target. Below is the relevant part of my code:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Load the csv dataset and preprocess
# [Code omitted for brevity]

# Combine feature and target data
def combine_features_targets(features, targets):
    combined = np.hstack((features, targets.reshape(-1, 1)))  # Add targets as a new column
    return combined

X_train_combined = combine_features_targets(X_train_scaled, y_train)
X_test_combined = combine_features_targets(X_test_scaled, y_test)

# Split combined data into sequences
def split_sequences(sequences, n_steps):
    X, y = [], []
    for i in range(len(sequences) - n_steps + 1):
        end_ix = i + n_steps
        seq_x = sequences[i:end_ix, :-1]  # Exclude the target column from features
        seq_y = sequences[end_ix - 1, -1]  # Target value is the last entry of the sequence
        X.append(seq_x)
        y.append(seq_y)
    return np.array(X), np.array(y)

X_train_reshaped, y_train_split = split_sequences(X_train_combined, chunk_size)
X_test_reshaped, y_test_split = split_sequences(X_test_combined, chunk_size)

# Shape checks
print("X_train_scaled shape:", X_train_scaled.shape)
print("X_test_scaled shape:", X_test_scaled.shape)
# [Further shape checks omitted for brevity]

I'm concerned that the way I'm integrating and splitting the data might cause issues with the shape expected by my model, especially since the target variable is included in the sequences. How can I ensure that the target variable is correctly included in each sequence without causing shape mismatches in my model input? Is there a more efficient way to structure this data preparation step?

I have tried re-writing the code in a different approach, by creating two different descriptive sequences for both X and y sets.

0

There are 0 answers