I'm trying to work out the best schedule for a benchmark Halide code and I might be missing something because the timing results don't make much sense to me.
I'm using AOT compilation, and here's the algorithm part of the code:
ImageParam input1(type_of<float>(), 3);
ImageParam input2(type_of<float>(), 3); 
Func in1 = BoundaryConditions::constant_exterior(input1, 0.0f);
Func in2 = BoundaryConditions::constant_exterior(input2, 0.0f);   
f1(x, y, z) = (in1(x + 1, y, z) + in1(x, y, z) + in1(x - 1, y,z));
f2(x, y, z) = (in2(x + 2, y, z) + in2(x + 1, y, z) + in2(x, y, z) +in2(x - 1, y, z) + in2(x - 2, y, z));
res(x, y, z) = f1(x, y, z) + f1(x - 1, y, z) + f2(x - 1, y, z) + f2(x, y, z);
For the schedule this is what I have:
f1.store_at(res, y).compute_at(res, yi).vectorize(x, 8);
f2.store_at(res, y).compute_at(res, yi).vectorize(x, 8);
res.split(y, y, yi, 8).vectorize(x, 8).parallel(y);
res.print_loop_nest();
I use the current_time function to time the execution of my code. When I use the mentioned schedule for both f1 and f2 the execution time is more than when I use the schedule on only one of these Funcs. Considering the structure of the stencils I'd have thought scheduling both of them would improve performance. What am I missing here? Also when I print the loops to see the generated code:
  for k:
    parallel j.j:
      store f1:
        store f2:
          for j.in_y in [0, 7]:
            produce f1:
              for k:
                for j:
                  for i.i:
                    vectorized i.v122 in [0, 7]:
                      f1(...) = ...
            consume f1:
              produce f2:
                for k:
                  for j:
                    for i.i:
                      vectorized i.v126 in [0, 7]:
                        f2(...) = ...
              consume f2:
                for i.i:
                  vectorized i.v133 in [0, 7]:
                    result(...) = ...
consume result:
Is it just the indentation or is the produce f2 nested within produce f1? Any suggestions for a better schedule?
 
                        
I think this is probably memory-bandwidth limited. The few extra adds implied by inlining f1 or f2 into res don't really matter. Indeed, I get the best performance with the following schedule:
I.e. just pulling in a padded scanline of each inputs and then doing all the math inlined.
But it's barely any faster than yours. The difference might be noise. My full experiment:
https://gist.github.com/abadams/c2e6f67d79e1768af6db5afcabb1caab
The produce of f2 is nested inside the consume of f1. That's normal - it doesn't use f1, but it's used by something that does use f1, so that's a reasonable place for it to end up.