Routines inside acc parallel region

1k views Asked by At

After reading this how-can-a-fortran-openacc-routine-call-another-fortran-openacc-routine, I am still puzzled by this OpenACC function call limitation.

Below is a modified nonsense code from the above linked post:

PROGRAM Test
IMPLICIT NONE

CONTAINS

 SUBROUTINE OuterRoutine( N )
 !$acc routine
   IMPLICIT NONE
   INTEGER :: N
   real :: y
   INTEGER :: i

      DO i = 0, N
         call InnerRoutine( y )
      ENDDO

 END SUBROUTINE OuterRoutine

 subroutine InnerRoutine( y )
 !$acc routine
   IMPLICIT NONE

   real :: y

 END subroutine InnerRoutine

END PROGRAM Test

When I compiled it with nvfortran version 20.7, I got

$ nvfortran -acc -Minfo routine.f90
outerroutine:
     14, Generating acc routine seq
         Generating Tesla code
     22, Reference argument passing prevents parallelization: y
innerroutine:
     27, Generating acc routine seq
         Generating Tesla code
nvvmCompileProgram error 9: NVVM_ERROR_COMPILATION.
Error: /tmp/pgaccr22eZDXceweL.gpu (43, 14): parse invalid forward reference to function '_innerroutine_' with wrong type!
ptxas /tmp/pgaccH22eJTMb0hKD.ptx, line 1; fatal   : Missing .version directive at start of file '/tmp/pgaccH22eJTMb0hKD.ptx'
ptxas fatal   : Ptx assembly aborted due to errors
NVFORTRAN-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (routine_inline.f90: 1)
  0 inform,   0 warnings,   1 severes, 0 fatal for

What is triggering the compilation error? As a comparison, the following code with acc function calls

module data
   integer, parameter :: maxl = 100000
   real, dimension(maxl) :: xstat
   real, dimension(:), allocatable :: yalloc
   !$acc declare create(xstat,yalloc)
   logical :: IsUsed
   !$acc declare create(IsUsed)
 end module
 
 module useit
   use data
 contains
   subroutine compute(n)
      integer :: n
      integer :: i
      !$acc parallel loop present(yalloc,xstat)
      do i = 1, n
         call iprocess(i, yalloc)
      enddo
   end subroutine
   
   subroutine iprocess(i, yalloc)
      !$acc routine seq
      integer :: i
      real,intent(out) :: yalloc(:)
      if(IsUsed) call kernel(i,yalloc)

      contains

      subroutine kernel(i,yalloc)
        !$acc routine seq
        integer, intent(in) :: i
        real,intent(out) :: yalloc(:)
        yalloc(i) = 2*xstat(i)
      end subroutine

   end subroutine 

 end module
 
 program main
 
   use data
   use useit
 
   implicit none
 
   integer :: nSize = 100
   !---------------------------------------------------------------------------
 
   call allocit(nSize)
   call initialize
 
   call compute(nSize)
 
   !$acc update self(yalloc) 
   write(*,*) "yalloc(10)=",yalloc(10) ! 3
 
   call finalize
   
 contains
   subroutine allocit(n)
     integer :: n
     allocate(yalloc(n))
   end subroutine allocit
   
   subroutine initialize
     xstat = 1.0
     yalloc = 1.0
     IsUsed = .true.
     !$acc update device(xstat,yalloc,IsUsed)
   end subroutine initialize
 
   subroutine finalize
 
     deallocate(yalloc)
     
   end subroutine finalize
   
 end program main

can be compiled with OpenACC and run.

UPDATE: surprisingly enough, for the first piece of code, when I simply switched the order of the subroutines, it worked:

PROGRAM Test
IMPLICIT NONE

CONTAINS

 subroutine InnerRoutine( y )
 !$acc routine
   IMPLICIT NONE

   real :: y

 END subroutine InnerRoutine

 SUBROUTINE OuterRoutine( N )
 !$acc routine
   IMPLICIT NONE
   INTEGER :: N
   real :: y
   INTEGER :: i

      DO i = 0, N
         call InnerRoutine( y )
      ENDDO

 END SUBROUTINE OuterRoutine

END PROGRAM Test

It seems really surprising to me that this particular one depends on the routine ordering. But then why does it work for my second example above?

1

There are 1 answers

6
Mat Colgrove On

It's a compiler device code generation error. When calling "InnerRoutine" from "OuterRoutine" the compiler is correctly adding the hidden argument to the parent's stack but the definition for the "InnerRoutine" is missing it as an actual argument. The error is the mismatch between the callee and caller.

I've added a problem report, TPR #29057. Unclear if it's a wider issue or an artifact of the small test case.

Note, be mindful of using contained device subroutines. Fortran allows access to a parent's local variables by passing in a pointer to the parent's stack. If the parent is on the host and the child on the device, directly accessing a parent's variables will cause runtime errors. For example, if "iprocess" was contained within "compute" and you accessed "i" directly, rather than it being passed as an argument, you'd get errors since the device can't access the host's stack.