I have two .csv files of 3D points (numeric coordinate data) and associated attribute data (strings + numeric). I need to calculate the Euclidean distance between each point and every other point, and maintain the attribute data for each point associated with the difference. I have a method that works for this, but it uses a loop and I'm hoping that there is a better way to do this that is less resource intensive. Here is the code I am using currently:
import pandas as pd
import numpy as np
# read .csv
dataset_1 = pd.read_csv(dataset1 path)
dataset_2 = pd.read_csv(dataset2 path)
# convert to numpy array
array_1 = dataset_1.to_numpy()
array_2 = dataset_2.to_numpy()
# define data types for new array. This includes the attribute data I want to maintain
data_type = np.dtype('f4, f4, f4, U10, U10, f4, f4, f4, U10, U10, U10, f4, f4, U10, U100')
#define the new array
new_array = np.empty((len(array_1)*len(array_2)), dtype=data_type)
#calculate the Euclidean distance between each set of 3D coordinates, and populate the new array with the results as well as data from the input arrays
number3 = 0
for number in range(len(array_1)):
for number2 in range(len(array_2)):
Euclidean_Dist = np.linalg.norm(array_1[number, 0:3]-array_2[number2, 0:3])
new_array[number3] = (array_1[number, 0], array_1[number, 1], array_1[number, 2], array_1[number, 3], array_1[number, 7],
array_2[number2, 0], array_2[number2, 1],array_2[number2, 2], array_2[number2, 3], array_2[number2, 6], array_2[number2, 7],
array_2[number2, 12], array_2[number2, 13], dist,''.join(sorted((str(array_2[number2, 0]) + str(array_2[number2, 1]) + str(array_2[number2, 2]) + str(array_2[number2, 3])))))
number3+=1
#Convert results to pandas dataframe
new_df = pd.DataFrame(new_array)
I work with very large datasets, so if anyone could suggest a more efficient way to do this I would be very grateful.
Thanks,
The code presented above works for my problem, but I'm looking for something to improve efficiency
Edit to show example input datasets (dataset_1 & dataset_2) and desired output dataset (new_df). The key is that for the output dataset I need to maintain the attributes from the input dataset associated with the Euclidean Distance. I could use scipy.spatial.distance.cdist to calculate the distances, but I'm not sure of the best way to maintain the attributes from the input data in the output data.
Two methods. Setup:
Using
.merge(*, how='cross')
, this gives your intended output I thinkA 2D 'ravelled' method that maintains the original data as
MultiIndex
es: