I'm getting started with Halide, and whilst I've grasped the basic tenets of its design, I'm struggling with the particulars (read: magic) required to efficiently schedule computations.
I've posted below a MWE of using Halide to copy an array from one location to another. I had assumed this would compile down to only a handful of instructions and take less than a microsecond to run. Instead, it produces 4000 lines of assembly and takes 40ms to run! Clearly, therefore, I have a significant hole in my understanding.
- What is the canonical way of wrapping an existing array in a
Halide::Image
? - How should the function
copy
be scheduled to perform the copy efficiently?
Minimal working example
#include <Halide.h>
using namespace Halide;
void _copy(uint8_t* in_ptr, uint8_t* out_ptr, const int M, const int N) {
Image<uint8_t> in(Buffer(UInt(8), N, M, 0, 0, in_ptr));
Image<uint8_t> out(Buffer(UInt(8), N, M, 0, 0, out_ptr));
Var x,y;
Func copy;
copy(x,y) = in(x,y);
copy.realize(out);
}
int main(void) {
uint8_t in[10000], out[10000];
_copy(in, out, 100, 100);
}
Compilation Flags
clang++ -O3 -march=native -std=c++11 -Iinclude -Lbin -lHalide copy.cpp
Let me start with your second question:
_copy
takes a long time, because it needs to compile Halide code to x86 machine code. IIRC,Func
caches the machine code, but sincecopy
is local to_copy
that cache cannot be reused. Anyways, schedulingcopy
is pretty simple because it's a pointwise operation: First, it would probably make sense to vectorize it. Second, it might make sense to parallelize it (depending on how much data there is). For example:will vectorize along
x
with a vector size of 32 and parallelize alongy
. (I am making this up from memory, there might be some confusion about the correct names.) Of course, doing all this might also increase compile times...There is no recipe for good scheduling. I do it by looking at the output of
compile_to_lowered_stmt
and profiling the code. I also use the AOT compilation provided byHalide::Generator
, this makes sure that I only measure the runtime of the code and not the compile time.Your other question was, how to wrap an existing array in a
Halide::Image
. I don't do that, mostly because I use AOT compilation. However, internally Halide uses a type calledbuffer_t
for everything image related. There is also C++ wrapper calledHalide::Buffer
that makes usingbuffer_t
a little easier, I think it can also be used inFunc::realize
instead ofHalide::Image
. The point is: If you understandbuffer_t
you can wrap almost everything into something digestible by Halide.