I have a dataset representing airport airside, 23 independent variables and 1 target value: precent of regulated aircraft (delayed more than 15 min). I used a code descipted in the post which use different regression models (random forest, linear regresion,GradientBoostingRegressor, XGB regressor), and i have this predictions. Could I make it better.
These are my prediction tests.
MSE R2 10% error test
LR 0.39 0.42 27.65
RF 0.13 0.81 49.16
SVR 0.19 0.72 42.46
GB 0.19 0.72 44.13
GBX 0.14 0.79 51.12
I think that the problem is my data, some of the features are multicolinear, and most of numerical features are not scaled. I have 1788 rows (78 airports for 2 years (24 months)) but most of data are the same in span of 24 months for one airport (ex. number of runway, no. of parking position etc.) Example of my data are :
• Level of slot coordination- 2
• Number of parking positions 176
• Total arivals and departures (TAD)- 10000
• Terminal Capacity (MILions) -14
• Global yearly current capacity (movements/year)- 210000
• Number of runway configurations-2
• Number of runways-2
type of runway configuration represented with 0 and 1:
· RC Cross 0
· RC Cross Parallel 0
· RC Parallel <=1000 0
· RC Parallel>1000 0
· RC Single 0 (if the airport has one runway)
· RC V Formation 0
How many ILS systems has
· Instrumental Landing System (ILS) CATI 0
· Instrumental Landing System (ILS) CATII 2 (2 runways with 2 ILS systems
· Instrumental Landing System (ILS) CATIII 0
· Approach separation (nautical miles)- 5
· Runway capacity per hour (a/h)- 35
· Season_Summer (SS)- 0 (4 months are winter)
· Season_Winter (SW)-1
· Number of turnaround aircraft (Heavy) 430
· Number of turnaround aircraft (Heavy)
· Number of turnaround aircraft (Heavy)
The target value is in precent (example 23 %, 45 %) i converted it in number.
my code is this:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
# Load data from Excel
data = pd.read_excel("PDPSAMO1.xlsx")
# Drop rows with NaN values
data = data.dropna()
# Separate features and target variable
X = data.drop(columns=['PDP'])
y = data['PDP']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Define custom evaluation function for 10% error margin
def within_10_percent_error(y_true, y_pred):
error_margin = 0.10
correct_predictions = sum(abs(y_true - y_pred) / y_true <= error_margin)
total_predictions = len(y_true)
return (correct_predictions / total_predictions) * 100
# Models to evaluate
models = {
'Linear Regression': LinearRegression(),
'Random Forest': RandomForestRegressor(),
'SVR': SVR(),
'Gradient Boosting': GradientBoostingRegressor(),
'XGBoost': xgb.XGBRegressor()
}
# Dictionary to store the best models and their respective scores
best_models = {}
scores = {
'MSE': mean_squared_error,
'R2': r2_score,
'Within 10% Error': within_10_percent_error
}
# Grid search for hyperparameter tuning and model selection
for model_name, model in models.items():
print(f"Training {model_name}...")
param_grid = {} # Define hyperparameters grid for each model if needed
grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
scoring='neg_mean_squared_error', cv=5)
grid_search.fit(X_train_scaled, y_train)
best_model = grid_search.best_estimator_
best_models[model_name] = best_model
# Evaluate model performance on test set
y_pred = best_model.predict(X_test_scaled)
print(f"Model: {model_name}")
for metric_name, metric_func in scores.items():
if metric_name == 'Within 10% Error':
score = metric_func(y_test, y_pred)
else:
score = metric_func(y_test, y_pred)
print(f"{metric_name}: {score:.2f}")
print("-------------------")
# Compare models and print the best model
best_accuracy = 0
best_model_name = ''
for model_name, model in best_models.items():
y_pred = model.predict(X_test_scaled)
accuracy = within_10_percent_error(y_test, y_pred)
if accuracy > best_accuracy:
best_accuracy = accuracy
best_model_name = model_name
print(f"Best Model: {best_model_name} with {best_accuracy:.2f}% Accuracy Within 10% Error Margin")
# Compare models and print the best model
best_accuracy = 0
best_model_name = ''
for model_name, model in best_models.items():
y_pred = model.predict(X_test_scaled)
accuracy = within_10_percent_error(y_test, y_pred) # Corrected function name
if accuracy > best_accuracy:
best_accuracy = accuracy
best_model_name = model_name
print(f"Best Model: {best_model_name} with {best_accuracy:.2f}% Accuracy Within 10% Error Margin")