I'm using Halide to optimize a stencil computation, but the scheduling part is kind of a challenge! Here's my Halide code, and I am using AOT compilation:
//Halide declarations:
Param<int> max_x, max_y, max_z;
Param<float> r;
Var x("x"), y("y"), z("z"), s, xi("xi"), yi("yi"), m("n"), xo("xo"), yo("yo"), index("idx");
Func f1_unbound("f_ub"), f1_bound("f_b"), result("res");
ImageParam input1(type_of<float>(), 3);
ImageParam input2(type_of<float>(), 3);
Expr t_h;
//The algorithm:
t_h = input2(x, y, z) / (r * input1(x, y, z));
f1_unbound(x, y, z) = 100.0f / (t_h) * pow((t_h / 200.0f), 1.5f);
f1_bound(x, y, z) = BoundaryConditions::repeat_edge(f1_unbound, 1, max_x - 2, 1, max_y - 2, 1, max_z - 2)(x, y, z);
result(x, y, z) = 0.125f * (f1_bound(x, y, z) + f1_bound(x + 1, y, z) +
f1_bound(x, y + 1, z) + f1_bound(x + 1, y + 1, z) +
f1_bound(x, y, z + 1) + f1_bound(x + 1, y, z + 1) +
f1_bound(x, y + 1, z + 1) + f1_bound(x + 1, y + 1, z + 1));
f1_bound.split(x, x, xi, 32).unroll(y, 2).unroll(xi, 2).vectorize(xi, 16).compute_at(result, y).store_at(result, y);
//f1_unbound.compute_root();
//f1_bound.vectorize(x, 16);
result.tile(x, y, x, y, xi, yi, 32, 8).vectorize(xi, 16).bound(x, 0, ((max_x + 1) / 2) * 2).bound(y, 0, ((max_y + 1) / 2) * 2).bound(z, 0, (max_z + 1)).parallel(y);
result.print_loop_nest();
result.compile_to_static_library("v_halide", {input1, input2, r, max_x, max_y, max_z}, "v_halide");
std::cout << "Compiled to static library!" << std::endl;
I understand how the performance changes when with different schedules when it comes to splitting/tiling and specifying where each function's evaluated. However, I have some issues with the parallel performance. I've tried different schedulings for parallelization, such as the one above. Yet I haven't been able to find an efficient parallel schedule and the profiling numbers are somehow confusing. For the above schedule, with increasing number of threads, "result" becomes slower and f1_bound becomes faster while the number reported as threads (which if I'm not mistaken is the average number of active threads in each region) is increasing for both:
4 threads:
average threads used: 3.586322heap allocations: 19500 peak heap usage: 116640 bytes
res: 0.946ms (33%) threads: 3.119
f1_b: 1.873ms (66%) threads: 3.823 peak: 116640 num: 19500 avg: 29160
2 threads: average threads used: 1.934264
heap allocations: 19500 peak heap usage: 58320 bytes
res: 0.769ms (19%) threads: 1.794
f1_b: 3.152ms (80%) threads: 1.968 peak: 58320 num: 19500 avg: 29160
When I schedule both f1_bound and unbound, I get better scaling as I increase the number of threads but I think there's less locality so the code with a single thread is slower than with no parallelization.
f1_bound.split(x, x, xi, 32).unroll(y, 2).unroll(xi, 2).vectorize(xi, 16).compute_at(result, y);
f1_unbound.split(x, x, xi, 32).unroll(y, 2).unroll(xi, 2).vectorize(xi, 16).compute_at(result, y).store_at(result, y);
Any suggestions for a better schedule?