I am trying to use the finite difference method with NumPy arrays, but the slicing is incredibly slow. It's not viable to use lists as I am applying math operations. Using matlab, equivalent code is executed in 2 seconds whilst the python code finishes executing for over 1 minute. My python code is:
import numpy as np
from scipy import interpolate
def heston_explicit_nonuniform(S,V,K,T,Ns,Nv,kappa,theta,rho,sigma,r,q,dt,dv,ds):
U = np.empty((Ns,Nv))
for s in range(Ns):
for v in range(Nv):
U[s,v] = np.maximum( S[s] - K , 0)
for t in range(Nt - 1):
for v in range(Nv-1):
U[0,v] = 0
U[Ns-1,v] = np.maximum( Smax - K ,0)
for s in range(Ns):
U[s,Nv-1] = np.maximum(S[s] - K, 0)
u = np.copy(U)
for s in range(1,Ns-1):
derV = (u[s,1] - u[s,0]) / (V[1] - V[0])
derS = (u[s+1,0] - u[s-1,0]) / (S[s+1]-S[s-1])
U[s,0] = u[s,0] + dt * ( -r*u[s,0] + (r-q) * S[s] * derS + kappa*theta*derV)
u = np.copy(U)
for s in range(1,Ns-1):
for v in range(1,Nv-1):
derS = 0.5*(u[s+1,v] - u[s-1,v]) / (S[s+1] - S[s-1])
derV = 0.5*(u[s,v+1] - u[s,v-1]) / (V[v+1] - V[v-1])
derSS = (u[s+1,v] - 2*u[s,v] + u[s-1,v]) / ( (S[s+1] - S[s])*(S[s]-S[s-1]))
derVV = (u[s,v+1] - 2*u[s,v] + u[s,v-1]) / ( (V[v+1] - V[v])*(V[v]-V[v-1]))
derSV = (u[s+1,v+1] + u[s-1,v-1] - u[s-1,v+1] - u[s+1,v-1]) \
/ (4 * (S[s+1] - S[s-1]) * (V[v+1] - V[v-1]))
A = 0.5 * V[v] * (S[s]**2) * derSS
B = rho*sigma*V[v]*S[s] *derSV
C = 0.5*(sigma**2)*V[v]*derVV
D = 0.5*(r-q)*S[s]*derS
E = kappa*(theta-V[v])*derV
L = dt * (A + B + C + D + E - r*u[s,v])
U[s,v] = L + u[s,v]
if t %100==0:
print(t)
return U
Vmax=0.5
Vmin=0
Smin=0
T=0.15
K=100.
Smax = 2*K
Vmax=0.5
r= 0.02
q=0.05
rho=-0.9
v0=0.05
sigma=0.3
kappa=1.5
theta=0.04
L=17
Nv=39
Ns=79
Nt = 3000
dt = T/Nt
Vi = np.linspace(0,Vmax,Nv)
Si = np.linspace(0,Smax,Ns)
dv = (Vmax-Vmin)/Nv
ds = (Smax-Smin)/Ns
call = heston_explicit_nonuniform(Si,Vi,K,T,Ns,Nv,kappa,theta,rho,sigma,r,q,dt,dv,ds)
data = interpolate.RectBivariateSpline(Si,Vi,call)
z_new = data(101.52,0.05412)
print('FDM: ', z_new[0,0])
The thing is, there is no slicing in your code. Which is what is missing.
In numpy, you compute things on whole arrays, not elements by elements. In other words, you need to avoid
forloops, that are pure python (and therefore very slow: the only reason python is quite fast when using numpy, is because most of the CPU time is spend in C code, not in python code).So for example, to focus on a specific and quite easy part of your code, this loop
should not exist (or, to be more accurate, it should exist only in numpy's internal code, in C) what you do here, is, for each s between 1 and Ns-1, you compute
derV, that is a combination of sth element of 2 vectors (u[:,1]andu[:,0], so 2 columns ofu) and a scalar (V[1]-V[0]).So, all together that is
Ns-2computation for 2Ns-2vectors. You can let numpy do thatAnd now,
derVis an array of theNs-2value you computeSo this part of the code can be turned into
So we have "vectorized" the computation of the
Ns-2derVvalueLikewise
So
And that last iterated computation,
U[s,0]is also justNs-2independant computations, from theNs-2u[s,0]values, theNs-2derSvalues, theNs-2derVvalues, theNs-2Svalues, and some scalars.So it can also be turned into
If I continue this for your whole code
It returns the exact same result (including the very last decimal place), because it is the exact same computation. Just, iterations are done inside numpy's internal code, not in python
Note that I haven't even tried to understand the algorithm. I just translated as a I would any arrays handling. Because, all the computation I've rewritten were batches of Ns-2 or Nv-2 or (Ns-2)(Nv-2) independent computation (the order wasn't important)
It would probably be interesting to go deeper in the algorithm itself, not just its coding. For example, I am pretty sure, one at least of the two
copy()could be avoided.It would be harder (but probably not impossible) to also vectorize the
for tloop. Because this loop is different: computation done at the tth iteration is dependent on the result of the **(t-1)**th iteration's result. Which was not the case for thefor sandfor vloopsThat may be possible using cumulative operator (that allows to use, in some specific cases, numpy "vectorization", that is hijacking internal numpy's for loop to replace yours, even when each iteration depends on the previous one. For example, using
cumsum,cumprod. (Memory wise, that would imply to keep a big array of all values of all (t,s,v) combinations. But from your example, that would be hardly 10 millions values, which is really not a problem nowadays)But still, without even making any effort to adapt to the problem, by simple translation of for loops into numpy batches computation, we already gained a ×100 factor (on my slow machine, it went from precisely 100 seconds to 1 second. It is not every day that I see cases where timing ratio computation are that obvious :-) )