How to perform an outer product with custom function (pandas/numpy)?

Question

How to perform an outer product with custom function (pandas/numpy)?

163 views Asked by P i At 19 September 2022 at 08:20

My dataframe has N rows.

I have M centroids. Each centroid is the same shape as a dataframe-row.

I need to create a Nrows by Mcols matrix, where the m-th column is created by applying the m-th centroid to the dataframe.

My solution involves pre-creating the output matrix and filling it one column at a time as we manually iterate over centroids.

It feels clumsy. But I can't see clearly how to do it 'properly'.

    df = pd.read_csv('test_data.csv')
    centroids = df.sample(n=2)
    centroids.reset_index(drop=True, inplace=True)

    def getDistanceMatrix(df, centroids):
        distanceMatrix = np.zeros((len(df), len(centroids)))

        distFunc = lambda centroid, row: sum(centroid != row)

        iCentroid = 0
        for _, centroid in centroids.iterrows():
            distanceMatrix[:, iCentroid] = df.apply(
                lambda row: distFunc(centroid, row),
                axis=1
            )
            iCentroid += 1

        return distanceMatrix

    distanceMatrix = getDistanceMatrix(df, centroids)

Here's an example test_data.csv with 9 rows:

A,B,C,D
1,2,1,1
2,1,1,2
1,2,3,4
2,2,1,2
2,3,3,4
1,1,3,1
4,2,1,2
2,3,3,3
4,1,1,2

It feels like some kind of outer-product-with-a-custom-function.

What's a good way to write this?

Original Q&A

There are 1 answers

**André** · Answer 1 · 2022-09-19T10:01:33+00:00

I mainly work with "vanilla numpy", so I can not give a nice solution based on pandas. I would do it like this if it only were numpy arrays, but I am not sure if there are any conversion overheads with pandas:

# Convert to numpy arrays (as I'm not proficient with
#   pandas dataframes (...yet))
df_np = df.to_numpy()
centroids_np = centroids.to_numpy()

# Broadcast df_np to (2,9,4) and centroids_np to (2,1,4),
#   then subtract the two.
# The result is a (2,9,4) array, where:
#   - axis=0 corresponds to the centroid of the difference
#   - axis=1 corresponds to the element in the dataframe
#   - axis=2 corresponds to the individual coordinates
diff = np.broadcast_to(
    df_np,
    (centroids_np.shape[0], df_np.shape[0], df_np.shape[1])
) - centroids_np[:, None, :]

# Convert to a binary distance
diff = (diff != 0).astype(df_np.dtype)

# Now sum along the coordinates
distanceMatrix2 = np.sum(diff, axis=-1).T
# array([[0, 2],
#       [3, 3],
#       [2, 2],
#       [2, 4],
#       [4, 3],
#       [2, 0],
#       [2, 4],
#       [4, 3],
#       [3, 3]], dtype=int64)

For reference, your code gives me:

distanceMatrix = getDistanceMatrix(df, centroids)
#array([[0., 2.],
#       [3., 3.],
#       [2., 2.],
#       [2., 4.],
#       [4., 3.],
#       [2., 0.],
#       [2., 4.],
#       [4., 3.],
#       [3., 3.]])

TechQA.

How to perform an outer product with custom function (pandas/numpy)?

There are 1 answers

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in NUMPY

Related Questions in OUTER-PRODUCT

Popular Questions

Trending Questions