For GPU training of model, I am using
dudt = Chain(Dense(3,100,tanh),
Dense(100,3)) |> gpu
versus
CPU training
dudt = FastChain(
FastDense(3,100,tanh),
FastDense(100,3))
Over 1000 iterations, Fastchain is orders of magnitude faster than running GPU Tesla K40c. Is this expected behaviour? Otherwise, could I be doing something wrong with implementing the model on GPUs. MWE for GPU implementation as follows:
function lorenz(du,u,p,t)
σ = p[1]; ρ = p[2]; β = p[3]
du[1] = σ*(u[2]-u[1])
du[2] = u[1]*(ρ-u[3]) - u[2]
du[3] = u[1]*u[2] - β*u[3]
return
end
u0 = Float32[1.0,0.0,0.0]
tspan = (0.0,1.0)
para = [10.0,28.0,8/3]
prob = ODEProblem(lorenz, u0, tspan, para)
t = range(tspan[1],tspan[2],length=101)
ode_data = Array(solve(prob,Tsit5(),saveat=t))
ode_data = cu(ode_data)
u0train = [1.0,0.0,0.0] |> gpu
tspantrain = (0.0,1.0)
ttrain = range(tspantrain[1],tspantrain[2],length=101)
dudt = Chain(Dense(3,100,tanh),
Dense(100,3)) |> gpu
n_ode = NeuralODE((dudt),tspantrain,Tsit5(),saveat=ttrain)
function predict_n_ode(p)
n_ode(u0train,p)
end
function loss_n_ode(p)
pred = predict_n_ode(p) |> gpu
loss = sum(abs2, pred .- ode_data)
loss,pred
end
res1 = DiffEqFlux.sciml_train(loss_n_ode, n_ode.p, ADAM(0.01), cb=cb, maxiters = 1000)
That model is too small for GPU parallelism to really make a difference. The neural network is essentially a 3 matvecs, 100x3, 100x100, 3x100. The only one with a kernel that probably comes close to breaking even is the middle one, where a 100x100 matrix is multiplied by a length 100 vector.
For example, on my machine:
So while the speedup on the largest operation does exist, it's not enough to overcome the slowdown by putting other small operations on the GPU. This is because GPUs have a high floor (on my machine around 12μs) so you have to make sure your problem is large enough for it to really make sense. Generally machine learning benefits from GPUs because it's dominated by large matrix multiplications with layers in the size of tens of thousands.