Mahalanobis distance computation in Python

29 views Asked by At

When computing the Mahalanobis distance on the following (poorly correlated) dataframe, I get weird distance values. Here is the python code:

dataframe

data = { 'Price': [100000, 800000, 650000, 700000, 
               860000, 730000, 400000, 870000, 
               780000, 400000], 
     'Distance': [16000, 60000, 300000, 10000, 
                  252000, 350000, 260000, 510000, 
                  2000, 5000], 
     'Emission': [300, 400, 1230, 300, 400, 104, 
                  632, 221, 142, 267], 
     'Performance': [60, 88, 90, 87, 83, 81, 72,  
                     91, 90, 93], 
     'Mileage': [76, 89, 89, 57, 79, 84, 78, 99,  
                 97, 99] 
       } 

import libraries

import numpy as np 
import pandas as pd  
import scipy as stats

create dataset

df = pd.DataFrame(data,columns=['Price', 'Distance', 
                            'Emission','Performance', 
                            'Mileage']) 

compute the correlation matrix

df.corr(numeric_only=True)

the Mahalanobis distance function

def calculateMahalanobis(y=None, data=None, cov=None): 

    y_mu = y - np.mean(data) 
    if not cov: 
        cov = np.cov(data.values.T) 
    inv_covmat = np.linalg.inv(cov) 
    left = np.dot(y_mu, inv_covmat) 
    mahal = np.dot(left, y_mu.T) 
    return mahal.diagonal() 

create new column in dataframe that contains Mahalanobis distance for each row

df['MahalanobisDistance'] = calculateMahalanobis(y=df, data=df[['Price', 'Distance', 'Emission','Performance', 'Mileage']]) 

display the dataframe

print(df)

enter image description here

All the distances in the last column are equal and so large! Why? I carefully checked the function and it seems correct. On the contrary the first 10, as an example, are expected to be the following (from a reliable source):

enter image description here

0

There are 0 answers