Minimum distances among a Euclidean distance matrix

752 views Asked by At

I have some code that will calculate the distances between every Cartesian coordinate in one matrix to every other coordinate in another. For each coordinate, the minimum distance will be returned along with the index positions for the coordinates that produced the minimum.

function MED3D(m1, m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,3))
    @sync @distributed for k in 1:n1
        Dist[k,:] = MD3D(m1[k,:], m2, k)
    end
    return Dist
end

@everywhere function MD3D(v1, m2, k)
    dsum::Float64 = Inf
    dtemp::Float64 = Inf
    i = 0
    for j in 1:size(m2,1)
        @inbounds dtemp = sqrt((v1[1] - m2[j,1]) * (v1[1] - m2[j,1]) + (v1[2] - m2[j,2]) * (v1[2] - m2[j,2]) + (v1[3] - m2[j,3]) * (v1[3] - m2[j,3]))
        if dtemp < dsum
            dsum = dtemp
            i = j
        end
    end
    return [dsum, k, i]
end

m1 = rand(10,3)
m2 = rand(15,3)
results = MED3D(m1,m2)

While this works reasonably alright with smaller 3D point clouds, I am looking to increase the performance for large point clouds by using GPU-based analyses. However, using the more typical ways to do matrix operations in Julia seems not possible since I have to return the index positions and the minimum distance. I've tried several different ways to adopt CUarrays for this task, but so far they have all failed without using actual for loops. Also, a lot of ways to implement it appear exceptionally inefficient due to storing the distance matrix in memory, which quickly exceeds 128gb of ram for my particular data set.

Could someone please help me with how to properly implement this in Julia to operate on a GPU? Is CUarrays even the correct approach, or is it too abstract of a level given that I am returning indices in addition to a distance? I've tried to calculate the L2 norm using product and dot, but it's not exactly giving me what I require.

UPDATE:

Here is my failed attempt to GPUify the inner loop using broadcasting.

using CuArrays
function difff(m1,m2)
    n1 = size(m1,1)
    Dist = Array{Float64}(undef, n1,3)
    m2 = CuArray(m2)
    m1 = CuArray(m1)
    for z in 1:size(m1)
        v1 = transpose(m1[z,:])
        i = 0
        dsum::Float64 = Inf
        mi = v1 .- m2
        mi = mi .* mi
        mi = sum(mi, dims=2)
        mi = mi .^ 0.5
        mi = findmin(mi)
        i = mi[2][1]
        dsum = mi[1]
        @inbounds Dist[z,:] = [dsum,z,i]
    end
end

UPDATE:

Failed attempt #2. I tried to calculating the minimum distances and forgetting about the indices. This isn't ideal for my application, but I can live with it. However, this only works correctly if the first array has a single row. I tried to solve this by using mapslices, but it does not work.

using CuArray
a = rand(1,3)
b = rand(3,3)

a = CuArray(a)
b = CuArray(b)

function GK(m1, m2)
    reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end

mapslices(GK(b), a, 2)

UPDATE:

Making progress by using an outer loop, but surely there is a better way to do this?

using CuArray
using BenchmarkTools
aa = rand(2,3)
bb = rand(5000000,3)

a = CuArray(aa)
b = CuArray(bb)

function GK(m1, m2)
    reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end

function D(a,b)
    Dist = Array{Float64}(undef,size(a,1),1)
    for i in 1:size(a,1)
        Dist[i] = GK(a[i,:]',b)
    end
    return Dist
end

@benchmark test = D(a,b)
@benchmark test = D(aa,bb)

UPDATE:

Some benchmarking between my previous distributed version, modified distributed version, GPU version, and serial version.EDIT: After scaling to a 100 billion comparisons, the GPU version no longer outperforms my previous distributed version... Any thoughts on why this is????

using Distributed
using SharedArrays
using CuArrays
using BenchmarkTools

aa = rand(4,3)
bb = rand(500000,3)
a = CuArray(aa)
b = CuArray(bb)

function MED3D(m1, m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,1))
    @sync @distributed for k in 1:n1
        Dist[k] = MD3D(m1[k,:]', m2)
    end
    return Dist
end

@everywhere function MD3D(v1, m2)
    dsum::Float64 = Inf
    dtemp::Float64 = Inf
    for j in 1:size(m2,1)
        @inbounds dtemp = sqrt((v1[1] - m2[j,1]) * (v1[1] - m2[j,1]) + (v1[2] - m2[j,2]) * (v1[2] - m2[j,2]) + (v1[3] - m2[j,3]) * (v1[3] - m2[j,3]))
        if dtemp < dsum
            dsum = dtemp
        end
    end
    return dsum
end

function MED3DGK(m1, m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,1))
    @sync @distributed for k in 1:n1

        @inbounds Dist[k] = GK(m1[k,:]',m2)
    end
    return Dist
end

@everywhere function GK(m1, m2)
    reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end

function D(a,b)
    Dist = Array{Float64}(undef,size(a,1),1)
    for i in 1:size(a,1)
        @inbounds Dist[i] = GK(a[i,:]',b)
    end
    return Dist
end

@benchmark test = D(a,b)
@benchmark test = D(aa,bb)
@benchmark test = MED3D(aa,bb)
@benchmark test = MED3DGK(aa,bb)

UPDATE:

Implementation using NearestNeighbors.jl with distributed processing. Any thoughts on how to make this even faster?:

function MED3D(m1, m2)
    m2 = Matrix(m2')
    kdtree = KDTree(m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,1))
    Ind = SharedArray{Float64}((n1,1))
    @sync @distributed for k in 1:n1
        Ind[k,:], Dist[k,:] = knn(kdtree, m1[k,:], 1)
    end
    return [Ind,Dist]
end
1

There are 1 answers

0
jerlich On

I'm not sure if it will help in your case, but when taking a slice m1[k,:] by default julia copies that memory (although maybe it depends on what the knn function does with that slice.

Does it improve anything if you change it to knn(kdtree, @view m1[k,:], 1)