I have some code that will calculate the distances between every Cartesian coordinate in one matrix to every other coordinate in another. For each coordinate, the minimum distance will be returned along with the index positions for the coordinates that produced the minimum.
function MED3D(m1, m2)
n1::Int = size(m1,1)
Dist = SharedArray{Float64}((n1,3))
@sync @distributed for k in 1:n1
Dist[k,:] = MD3D(m1[k,:], m2, k)
end
return Dist
end
@everywhere function MD3D(v1, m2, k)
dsum::Float64 = Inf
dtemp::Float64 = Inf
i = 0
for j in 1:size(m2,1)
@inbounds dtemp = sqrt((v1[1] - m2[j,1]) * (v1[1] - m2[j,1]) + (v1[2] - m2[j,2]) * (v1[2] - m2[j,2]) + (v1[3] - m2[j,3]) * (v1[3] - m2[j,3]))
if dtemp < dsum
dsum = dtemp
i = j
end
end
return [dsum, k, i]
end
m1 = rand(10,3)
m2 = rand(15,3)
results = MED3D(m1,m2)
While this works reasonably alright with smaller 3D point clouds, I am looking to increase the performance for large point clouds by using GPU-based analyses. However, using the more typical ways to do matrix operations in Julia seems not possible since I have to return the index positions and the minimum distance. I've tried several different ways to adopt CUarrays for this task, but so far they have all failed without using actual for loops. Also, a lot of ways to implement it appear exceptionally inefficient due to storing the distance matrix in memory, which quickly exceeds 128gb of ram for my particular data set.
Could someone please help me with how to properly implement this in Julia to operate on a GPU? Is CUarrays even the correct approach, or is it too abstract of a level given that I am returning indices in addition to a distance? I've tried to calculate the L2 norm using product and dot, but it's not exactly giving me what I require.
UPDATE:
Here is my failed attempt to GPUify the inner loop using broadcasting.
using CuArrays
function difff(m1,m2)
n1 = size(m1,1)
Dist = Array{Float64}(undef, n1,3)
m2 = CuArray(m2)
m1 = CuArray(m1)
for z in 1:size(m1)
v1 = transpose(m1[z,:])
i = 0
dsum::Float64 = Inf
mi = v1 .- m2
mi = mi .* mi
mi = sum(mi, dims=2)
mi = mi .^ 0.5
mi = findmin(mi)
i = mi[2][1]
dsum = mi[1]
@inbounds Dist[z,:] = [dsum,z,i]
end
end
UPDATE:
Failed attempt #2. I tried to calculating the minimum distances and forgetting about the indices. This isn't ideal for my application, but I can live with it. However, this only works correctly if the first array has a single row. I tried to solve this by using mapslices, but it does not work.
using CuArray
a = rand(1,3)
b = rand(3,3)
a = CuArray(a)
b = CuArray(b)
function GK(m1, m2)
reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end
mapslices(GK(b), a, 2)
UPDATE:
Making progress by using an outer loop, but surely there is a better way to do this?
using CuArray
using BenchmarkTools
aa = rand(2,3)
bb = rand(5000000,3)
a = CuArray(aa)
b = CuArray(bb)
function GK(m1, m2)
reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end
function D(a,b)
Dist = Array{Float64}(undef,size(a,1),1)
for i in 1:size(a,1)
Dist[i] = GK(a[i,:]',b)
end
return Dist
end
@benchmark test = D(a,b)
@benchmark test = D(aa,bb)
UPDATE:
Some benchmarking between my previous distributed version, modified distributed version, GPU version, and serial version.EDIT: After scaling to a 100 billion comparisons, the GPU version no longer outperforms my previous distributed version... Any thoughts on why this is????
using Distributed
using SharedArrays
using CuArrays
using BenchmarkTools
aa = rand(4,3)
bb = rand(500000,3)
a = CuArray(aa)
b = CuArray(bb)
function MED3D(m1, m2)
n1::Int = size(m1,1)
Dist = SharedArray{Float64}((n1,1))
@sync @distributed for k in 1:n1
Dist[k] = MD3D(m1[k,:]', m2)
end
return Dist
end
@everywhere function MD3D(v1, m2)
dsum::Float64 = Inf
dtemp::Float64 = Inf
for j in 1:size(m2,1)
@inbounds dtemp = sqrt((v1[1] - m2[j,1]) * (v1[1] - m2[j,1]) + (v1[2] - m2[j,2]) * (v1[2] - m2[j,2]) + (v1[3] - m2[j,3]) * (v1[3] - m2[j,3]))
if dtemp < dsum
dsum = dtemp
end
end
return dsum
end
function MED3DGK(m1, m2)
n1::Int = size(m1,1)
Dist = SharedArray{Float64}((n1,1))
@sync @distributed for k in 1:n1
@inbounds Dist[k] = GK(m1[k,:]',m2)
end
return Dist
end
@everywhere function GK(m1, m2)
reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end
function D(a,b)
Dist = Array{Float64}(undef,size(a,1),1)
for i in 1:size(a,1)
@inbounds Dist[i] = GK(a[i,:]',b)
end
return Dist
end
@benchmark test = D(a,b)
@benchmark test = D(aa,bb)
@benchmark test = MED3D(aa,bb)
@benchmark test = MED3DGK(aa,bb)
UPDATE:
Implementation using NearestNeighbors.jl with distributed processing. Any thoughts on how to make this even faster?:
function MED3D(m1, m2)
m2 = Matrix(m2')
kdtree = KDTree(m2)
n1::Int = size(m1,1)
Dist = SharedArray{Float64}((n1,1))
Ind = SharedArray{Float64}((n1,1))
@sync @distributed for k in 1:n1
Ind[k,:], Dist[k,:] = knn(kdtree, m1[k,:], 1)
end
return [Ind,Dist]
end
I'm not sure if it will help in your case, but when taking a slice
m1[k,:]
by default julia copies that memory (although maybe it depends on what theknn
function does with that slice.Does it improve anything if you change it to
knn(kdtree, @view m1[k,:], 1)