FastChain vs GPUs in DiffEqFlux

262 views Asked by At

For GPU training of model, I am using

dudt = Chain(Dense(3,100,tanh),
    Dense(100,3)) |> gpu

versus

CPU training

dudt = FastChain(   
              FastDense(3,100,tanh),
              FastDense(100,3))

Over 1000 iterations, Fastchain is orders of magnitude faster than running GPU Tesla K40c. Is this expected behaviour? Otherwise, could I be doing something wrong with implementing the model on GPUs. MWE for GPU implementation as follows:

function lorenz(du,u,p,t)
    σ = p[1]; ρ = p[2]; β = p[3]
    du[1] = σ*(u[2]-u[1])
    du[2] = u[1]*(ρ-u[3]) - u[2]
    du[3] = u[1]*u[2] - β*u[3]
    return 
end
u0 = Float32[1.0,0.0,0.0]               
tspan = (0.0,1.0)                      
para = [10.0,28.0,8/3]                      
prob = ODEProblem(lorenz, u0, tspan, para)  
t = range(tspan[1],tspan[2],length=101)
ode_data = Array(solve(prob,Tsit5(),saveat=t))
ode_data = cu(ode_data)

u0train = [1.0,0.0,0.0] |> gpu
tspantrain = (0.0,1.0)  
ttrain = range(tspantrain[1],tspantrain[2],length=101)  
dudt = Chain(Dense(3,100,tanh),
    Dense(100,3)) |> gpu
n_ode = NeuralODE((dudt),tspantrain,Tsit5(),saveat=ttrain)

function predict_n_ode(p)
  n_ode(u0train,p)
end

function loss_n_ode(p)
    pred = predict_n_ode(p) |> gpu
    loss = sum(abs2, pred .- ode_data)
    loss,pred
end

res1 = DiffEqFlux.sciml_train(loss_n_ode, n_ode.p, ADAM(0.01), cb=cb, maxiters = 1000)
1

There are 1 answers

0
Chris Rackauckas On

That model is too small for GPU parallelism to really make a difference. The neural network is essentially a 3 matvecs, 100x3, 100x100, 3x100. The only one with a kernel that probably comes close to breaking even is the middle one, where a 100x100 matrix is multiplied by a length 100 vector.

For example, on my machine:

using BenchmarkTools, CuArrays
A = rand(100,100); x = rand(100);
@btime A*x; # 56.299 μs (1 allocation: 896 bytes)
gA = cu(A); gx = cu(x)
@btime gA*gx; # 12.499 μs (6 allocations: 160 bytes)

A = rand(100,3); x = rand(3);
@btime A*x; # 251.695 ns (1 allocation: 896 bytes)
gA = cu(A); gx = cu(x)
@btime gA*gx; # 12.212 μs (6 allocations: 160 bytes)

So while the speedup on the largest operation does exist, it's not enough to overcome the slowdown by putting other small operations on the GPU. This is because GPUs have a high floor (on my machine around 12μs) so you have to make sure your problem is large enough for it to really make sense. Generally machine learning benefits from GPUs because it's dominated by large matrix multiplications with layers in the size of tens of thousands.