My dataframe has N rows.
I have M centroids. Each centroid is the same shape as a dataframe-row.
I need to create a Nrows by Mcols matrix, where the m-th column is created by applying the m-th centroid to the dataframe.
My solution involves pre-creating the output matrix and filling it one column at a time as we manually iterate over centroids.
It feels clumsy. But I can't see clearly how to do it 'properly'.
df = pd.read_csv('test_data.csv')
centroids = df.sample(n=2)
centroids.reset_index(drop=True, inplace=True)
def getDistanceMatrix(df, centroids):
distanceMatrix = np.zeros((len(df), len(centroids)))
distFunc = lambda centroid, row: sum(centroid != row)
iCentroid = 0
for _, centroid in centroids.iterrows():
distanceMatrix[:, iCentroid] = df.apply(
lambda row: distFunc(centroid, row),
axis=1
)
iCentroid += 1
return distanceMatrix
distanceMatrix = getDistanceMatrix(df, centroids)
Here's an example test_data.csv with 9 rows:
A,B,C,D
1,2,1,1
2,1,1,2
1,2,3,4
2,2,1,2
2,3,3,4
1,1,3,1
4,2,1,2
2,3,3,3
4,1,1,2
It feels like some kind of outer-product-with-a-custom-function.
What's a good way to write this?
I mainly work with "vanilla numpy", so I can not give a nice solution based on pandas. I would do it like this if it only were numpy arrays, but I am not sure if there are any conversion overheads with pandas:
For reference, your code gives me: