Euclidean Distance for Arrays of 3D points in Python

329 views Asked by At

I have two .csv files of 3D points (numeric coordinate data) and associated attribute data (strings + numeric). I need to calculate the Euclidean distance between each point and every other point, and maintain the attribute data for each point associated with the difference. I have a method that works for this, but it uses a loop and I'm hoping that there is a better way to do this that is less resource intensive. Here is the code I am using currently:

import pandas as pd
import numpy as np

# read .csv
dataset_1 = pd.read_csv(dataset1 path)
dataset_2 = pd.read_csv(dataset2 path)

# convert to numpy array
array_1 = dataset_1.to_numpy()
array_2 = dataset_2.to_numpy()

# define data types for new array. This includes the attribute data I want to maintain
data_type = np.dtype('f4, f4, f4, U10, U10, f4, f4, f4, U10, U10, U10, f4, f4, U10, U100')

#define the new array
new_array = np.empty((len(array_1)*len(array_2)), dtype=data_type)

#calculate the Euclidean distance between each set of 3D coordinates, and populate the new array with the results as well as data from the input arrays
number3 = 0
for number in range(len(array_1)):
        for number2 in range(len(array_2)):
                Euclidean_Dist = np.linalg.norm(array_1[number, 0:3]-array_2[number2, 0:3])
                new_array[number3] = (array_1[number, 0], array_1[number, 1], array_1[number, 2], array_1[number, 3], array_1[number, 7],
                 array_2[number2, 0], array_2[number2, 1],array_2[number2, 2], array_2[number2, 3], array_2[number2, 6], array_2[number2, 7],
                 array_2[number2, 12], array_2[number2, 13], dist,''.join(sorted((str(array_2[number2, 0]) + str(array_2[number2, 1]) + str(array_2[number2, 2]) + str(array_2[number2, 3])))))
                number3+=1   
                
#Convert results to pandas dataframe
new_df = pd.DataFrame(new_array)

I work with very large datasets, so if anyone could suggest a more efficient way to do this I would be very grateful.

Thanks,

The code presented above works for my problem, but I'm looking for something to improve efficiency

Edit to show example input datasets (dataset_1 & dataset_2) and desired output dataset (new_df). The key is that for the output dataset I need to maintain the attributes from the input dataset associated with the Euclidean Distance. I could use scipy.spatial.distance.cdist to calculate the distances, but I'm not sure of the best way to maintain the attributes from the input data in the output data.

enter image description here

1

There are 1 answers

1
Daniel F On

Two methods. Setup:

import numpy as np
import pandas as pd
import string
from scipy.spatial.distance import cdist

upper = list(string.ascii_uppercase)
lower = list(string.ascii_lowercase)

df1 = pd.DataFrame(np.random.rand(26,3), 
                   columns = lower[-3:], 
                   index = lower )

df2 = pd.DataFrame(np.random.rand(25,3), 
                   columns = lower[-3:], 
                   index = upper[:-1] )  #testing different lengths

Using .merge(*, how='cross'), this gives your intended output I think

new_df = df1.reset_index().merge(df2.reset_index(), 
                              how = 'cross',
                              suffixes = ['1', '2'])
new_df['dist'] = cdist(df1, df2).flatten()

A 2D 'ravelled' method that maintains the original data as MultiIndexes:

new_df2 = pd.DataFrame(cdist(df1, df2), 
                   index = pd.MultiIndex.from_arrays(df1.reset_index().values.T, 
                                                     names = df1.reset_index().columns), 
                   columns = pd.MultiIndex.from_arrays(df2.reset_index().values.T, 
                                                     names = df2.reset_index().columns))