I have the following subroutine, which is my stepper function in a bigger DEM program. It computes every interaction between particles i and j, then updates the forces. I am now trying, as a first effort, to parallelize this thick O(N^2) loop, before I try to use fancier search algorithms.
But, I can't find a way to make it work. I know it is due to variables being modified by different threads, but I don't know how to deal with that properly. One possibility would be to redefine all my arrays to be matrices, but I still don't know if this is the correct way.
Also, note that I use include for my variables since I have a lot of variables that need to be accessed and modified by many subroutines, and it felt like the easy way to make that work, I am open to suggestions.
Here is my subroutine:
subroutine stepper (tstep)
use omp_lib
implicit none
include "parameter.h"
include "CB_variables.h"
include "CB_const.h"
include "CB_bond.h"
include "CB_forcings.h"
integer :: i, j
integer, intent(in) :: tstep
! reinitialize force arrays for contact and bonds
do i = 1, n
mc(i) = 0d0
mb(i) = 0d0
fcx(i) = 0d0
fcy(i) = 0d0
fbx(i) = 0d0
fby(i) = 0d0
end do
! put yourself in the referential of the ith particle
! loop through all j particles and compute interactions
!$omp parallel do schedule(dynamic) &
!$omp private(i,j) &
!$omp reduction(+:tfx,tfy,fcx,fcy,fbx,fby,m,mc,mb)
do i = 1, n
do j = i + 1, n
! compute relative position and velocity
call rel_pos_vel (i, j)
! bond initialization
if ( tstep .eq. 1 ) then
if ( -deltan(i, j) .le. 5d-1 * r(i)) then ! can be fancier
!bond (i, j) = 1
end if
call bond_properties (i, j)
end if
! verify if two particles are colliding
if ( deltan(i,j) .gt. 0 ) then
call contact_forces (i, j)
!call bond_creation (i, j) ! to implement
! update contact force on particle i by particle j
fcx(i) = fcx(i) - fcn(i,j) * cosa(i,j)
fcy(i) = fcy(i) - fcn(i,j) * sina(i,j)
! update moment on particule i by particule j due to tangent contact
mc(i) = mc(i) - r(i) * fct(i,j) - mcc(i,j)
! Newton's third law
! update contact force on particle j by particle i
fcx(j) = fcx(j) + fcn(i,j) * cosa(i,j)
fcy(j) = fcy(j) + fcn(i,j) * sina(i,j)
! update moment on particule j by particule i due to tangent contact
mc(j) = mc(j) - r(j) * fct(i,j) + mcc(i,j)
end if
! compute forces from bonds between particle i and j
if ( bond (i, j) .eq. 1 ) then
call bond_forces (i, j)
!call bond_breaking (i, j)
! update force on particle i by particle j due to bond
fbx(i) = fbx(i) - fbn(i,j) * cosa(i,j) + &
fbt(i,j) * sina(i,j)
fby(i) = fby(i) - fbn(i,j) * sina(i,j) - &
fbt(i,j) * cosa(i,j)
! update moment on particule i by particule j to to bond
mb(i) = mb(i) - r(i) * fbt(i,j) - mbb(i, j)
! Newton's third law
! update force on particle j by particle i due to bond
fbx(j) = fbx(j) + fbn(i,j) * cosa(i,j) - &
fbt(i,j) * sina(i,j)
fby(j) = fby(j) + fbn(i,j) * sina(i,j) + &
fbt(i,j) * cosa(i,j)
! update moment on particule j by particule i to to bond
mb(j) = mb(j) - r(i) * fbt(i,j) + mbb(j, i)
end if
! compute sheltering height for particule j on particle i for air and water drag
call sheltering(i, j)
end do
! compute the total forcing from winds, currents and coriolis on particule i
call forcing (i)
!call coriolis(i)
! reinitialize total force arrays before summing everything
m(i) = 0d0
tfx(i) = 0d0
tfy(i) = 0d0
! sum all forces together on particule i
tfx(i) = fcx(i) + fbx(i) + fax(i) + fwx(i)
tfy(i) = fcy(i) + fby(i) + fay(i) + fwy(i)
! sum all moments on particule i together
m(i) = mc(i) + mb(i) + ma(i) + mw(i)
end do
!$omp end parallel do
! forces on side particles for experiments
call experiment_forces
! integration in time
call velocity
call euler
end subroutine stepper
rel_pos_vel(i, j)computes relative positions, velocities, angles, etc. between particlesiandj. Everything inside is a 2D array from common blocks.bond_properties(i, j)same thing herecontact_forces(i, j)some local variables as well as some arrays from common blocksbond_forces(i, j)same thingshelteringandforcingsame
I tried to put different variables (the ones wthat are being changed tfx, tfy, m, etc.) in the private state, but without success. I don't think I understand well the reduction tag as well. Also, I know that some constant arrays like the radius r(i) are accessed by multiple threads, but I don't know how to deal with that.
Actually, when printing the number of threads in the parallel region (using omp_get_num_thread()), the output is 1, indicating that no parallelization is taking place.
Also, when trying to parallelize the first initialization loop as a first step towards understanding:
!$omp do private(i) schedule(dynamic) num_threads(10)
print*, omp_get_num_threads()
do i = 1, n
mc(i) = 0d0
mb(i) = 0d0
fcx(i) = 0d0
fcy(i) = 0d0
fbx(i) = 0d0
fby(i) = 0d0
end do
!$omp parallel do
it is still not working, the output of omp_get_num_threads is 1. And the number of threads is explicitly given to be 10.
I am compiling using gfortran, with the flags -ffast-math, -O3, -fopenmp
As I wrote in a comment: Before the end of the
ido loop you are updating thetfxandtfyarray, but only with theiindex. Consequently, all what has been calculated with thejindex does not participate here. So you should update it either inside the j do loop and also for the j index, or in a separate loop after the parallel loop.Here is an demonstration code:
If I run it without OpenMP (or optionally with a single thread) the output is as expected:
Now if I run it with OpenMP I get garbage in t(:):
The solution is then obviously to update
tonly after the parallel loop:Explanation:
In the example above (with OpenMP and reduction on
t), let's assume the thread 0 processes the oddiiterations, and thread 1 the eveniiterations. because of the reduction,sandtare implicitely private in the OpenMP region: let's denotes0,t0,s1,t1the private versions.j:s0(:)=[N-1,1,1,...,1]t0(:)=[N-1,0,0,...,0]j:s1(:)=[0,N-2,1,...,1]t1(:)=[0,N-2,0,...,0]j:s0(:)=[N-1,1,N-2,2,2,...,2]t0(:)=[N-1,0,N-2,0,0,...,0]j:s1(:)=[0,N-2,1,N-3,1,...,1]t1(:)=[0,N-2,0,N-3,0,...,0]...
j:s0(:)=[N-1,1,N-2,2,N-3,3,N-4,4,...,N/2-1,N/2]t0(:)=[N-1,0,N-2,0,N-3,0,N-4,0,...,N/2-1,0]j:s1(:)=[0,N-2,1,N-3,1,N-4,...,N/2,N/2-1]t1(:)=[0,N-2,0,N-3,0,N-4,...,0 ,N/2]At the end of the parallel region, the reduction consists in
s=s0+s1andt=t0+t1. You can see that it goes well fors, but not fort