After reading this how-can-a-fortran-openacc-routine-call-another-fortran-openacc-routine, I am still puzzled by this OpenACC function call limitation.
Below is a modified nonsense code from the above linked post:
PROGRAM Test
IMPLICIT NONE
CONTAINS
SUBROUTINE OuterRoutine( N )
!$acc routine
IMPLICIT NONE
INTEGER :: N
real :: y
INTEGER :: i
DO i = 0, N
call InnerRoutine( y )
ENDDO
END SUBROUTINE OuterRoutine
subroutine InnerRoutine( y )
!$acc routine
IMPLICIT NONE
real :: y
END subroutine InnerRoutine
END PROGRAM Test
When I compiled it with nvfortran
version 20.7, I got
$ nvfortran -acc -Minfo routine.f90
outerroutine:
14, Generating acc routine seq
Generating Tesla code
22, Reference argument passing prevents parallelization: y
innerroutine:
27, Generating acc routine seq
Generating Tesla code
nvvmCompileProgram error 9: NVVM_ERROR_COMPILATION.
Error: /tmp/pgaccr22eZDXceweL.gpu (43, 14): parse invalid forward reference to function '_innerroutine_' with wrong type!
ptxas /tmp/pgaccH22eJTMb0hKD.ptx, line 1; fatal : Missing .version directive at start of file '/tmp/pgaccH22eJTMb0hKD.ptx'
ptxas fatal : Ptx assembly aborted due to errors
NVFORTRAN-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (routine_inline.f90: 1)
0 inform, 0 warnings, 1 severes, 0 fatal for
What is triggering the compilation error? As a comparison, the following code with acc function calls
module data
integer, parameter :: maxl = 100000
real, dimension(maxl) :: xstat
real, dimension(:), allocatable :: yalloc
!$acc declare create(xstat,yalloc)
logical :: IsUsed
!$acc declare create(IsUsed)
end module
module useit
use data
contains
subroutine compute(n)
integer :: n
integer :: i
!$acc parallel loop present(yalloc,xstat)
do i = 1, n
call iprocess(i, yalloc)
enddo
end subroutine
subroutine iprocess(i, yalloc)
!$acc routine seq
integer :: i
real,intent(out) :: yalloc(:)
if(IsUsed) call kernel(i,yalloc)
contains
subroutine kernel(i,yalloc)
!$acc routine seq
integer, intent(in) :: i
real,intent(out) :: yalloc(:)
yalloc(i) = 2*xstat(i)
end subroutine
end subroutine
end module
program main
use data
use useit
implicit none
integer :: nSize = 100
!---------------------------------------------------------------------------
call allocit(nSize)
call initialize
call compute(nSize)
!$acc update self(yalloc)
write(*,*) "yalloc(10)=",yalloc(10) ! 3
call finalize
contains
subroutine allocit(n)
integer :: n
allocate(yalloc(n))
end subroutine allocit
subroutine initialize
xstat = 1.0
yalloc = 1.0
IsUsed = .true.
!$acc update device(xstat,yalloc,IsUsed)
end subroutine initialize
subroutine finalize
deallocate(yalloc)
end subroutine finalize
end program main
can be compiled with OpenACC and run.
UPDATE: surprisingly enough, for the first piece of code, when I simply switched the order of the subroutines, it worked:
PROGRAM Test
IMPLICIT NONE
CONTAINS
subroutine InnerRoutine( y )
!$acc routine
IMPLICIT NONE
real :: y
END subroutine InnerRoutine
SUBROUTINE OuterRoutine( N )
!$acc routine
IMPLICIT NONE
INTEGER :: N
real :: y
INTEGER :: i
DO i = 0, N
call InnerRoutine( y )
ENDDO
END SUBROUTINE OuterRoutine
END PROGRAM Test
It seems really surprising to me that this particular one depends on the routine ordering. But then why does it work for my second example above?
It's a compiler device code generation error. When calling "InnerRoutine" from "OuterRoutine" the compiler is correctly adding the hidden argument to the parent's stack but the definition for the "InnerRoutine" is missing it as an actual argument. The error is the mismatch between the callee and caller.
I've added a problem report, TPR #29057. Unclear if it's a wider issue or an artifact of the small test case.
Note, be mindful of using contained device subroutines. Fortran allows access to a parent's local variables by passing in a pointer to the parent's stack. If the parent is on the host and the child on the device, directly accessing a parent's variables will cause runtime errors. For example, if "iprocess" was contained within "compute" and you accessed "i" directly, rather than it being passed as an argument, you'd get errors since the device can't access the host's stack.