I'm getting started with Halide, and whilst I've grasped the basic tenets of its design, I'm struggling with the particulars (read: magic) required to efficiently schedule computations.
I've posted below a MWE of using Halide to copy an array from one location to another. I had assumed this would compile down to only a handful of instructions and take less than a microsecond to run. Instead, it produces 4000 lines of assembly and takes 40ms to run! Clearly, therefore, I have a significant hole in my understanding.
- What is the canonical way of wrapping an existing array in a
Halide::Image? - How should the function
copybe scheduled to perform the copy efficiently?
Minimal working example
#include <Halide.h>
using namespace Halide;
void _copy(uint8_t* in_ptr, uint8_t* out_ptr, const int M, const int N) {
Image<uint8_t> in(Buffer(UInt(8), N, M, 0, 0, in_ptr));
Image<uint8_t> out(Buffer(UInt(8), N, M, 0, 0, out_ptr));
Var x,y;
Func copy;
copy(x,y) = in(x,y);
copy.realize(out);
}
int main(void) {
uint8_t in[10000], out[10000];
_copy(in, out, 100, 100);
}
Compilation Flags
clang++ -O3 -march=native -std=c++11 -Iinclude -Lbin -lHalide copy.cpp
Let me start with your second question:
_copytakes a long time, because it needs to compile Halide code to x86 machine code. IIRC,Funccaches the machine code, but sincecopyis local to_copythat cache cannot be reused. Anyways, schedulingcopyis pretty simple because it's a pointwise operation: First, it would probably make sense to vectorize it. Second, it might make sense to parallelize it (depending on how much data there is). For example:will vectorize along
xwith a vector size of 32 and parallelize alongy. (I am making this up from memory, there might be some confusion about the correct names.) Of course, doing all this might also increase compile times...There is no recipe for good scheduling. I do it by looking at the output of
compile_to_lowered_stmtand profiling the code. I also use the AOT compilation provided byHalide::Generator, this makes sure that I only measure the runtime of the code and not the compile time.Your other question was, how to wrap an existing array in a
Halide::Image. I don't do that, mostly because I use AOT compilation. However, internally Halide uses a type calledbuffer_tfor everything image related. There is also C++ wrapper calledHalide::Bufferthat makes usingbuffer_ta little easier, I think it can also be used inFunc::realizeinstead ofHalide::Image. The point is: If you understandbuffer_tyou can wrap almost everything into something digestible by Halide.