C++ array to Halide Image (and back)

1.7k views Asked by At

I'm getting started with Halide, and whilst I've grasped the basic tenets of its design, I'm struggling with the particulars (read: magic) required to efficiently schedule computations.

I've posted below a MWE of using Halide to copy an array from one location to another. I had assumed this would compile down to only a handful of instructions and take less than a microsecond to run. Instead, it produces 4000 lines of assembly and takes 40ms to run! Clearly, therefore, I have a significant hole in my understanding.

  1. What is the canonical way of wrapping an existing array in a Halide::Image?
  2. How should the function copy be scheduled to perform the copy efficiently?

Minimal working example

#include <Halide.h>

using namespace Halide;

void _copy(uint8_t* in_ptr, uint8_t* out_ptr, const int M, const int N) {

    Image<uint8_t> in(Buffer(UInt(8), N, M, 0, 0, in_ptr));
    Image<uint8_t> out(Buffer(UInt(8), N, M, 0, 0, out_ptr));

    Var x,y;
    Func copy;
    copy(x,y) = in(x,y);
    copy.realize(out);
}

int main(void) {
    uint8_t in[10000], out[10000];
    _copy(in, out, 100, 100);
}

Compilation Flags

clang++ -O3 -march=native -std=c++11 -Iinclude -Lbin -lHalide copy.cpp
2

There are 2 answers

0
znkr On

Let me start with your second question: _copy takes a long time, because it needs to compile Halide code to x86 machine code. IIRC, Func caches the machine code, but since copy is local to _copy that cache cannot be reused. Anyways, scheduling copy is pretty simple because it's a pointwise operation: First, it would probably make sense to vectorize it. Second, it might make sense to parallelize it (depending on how much data there is). For example:

copy.vectorize(x, 32).parallel(y);

will vectorize along x with a vector size of 32 and parallelize along y. (I am making this up from memory, there might be some confusion about the correct names.) Of course, doing all this might also increase compile times...

There is no recipe for good scheduling. I do it by looking at the output of compile_to_lowered_stmt and profiling the code. I also use the AOT compilation provided by Halide::Generator, this makes sure that I only measure the runtime of the code and not the compile time.

Your other question was, how to wrap an existing array in a Halide::Image. I don't do that, mostly because I use AOT compilation. However, internally Halide uses a type called buffer_t for everything image related. There is also C++ wrapper called Halide::Buffer that makes using buffer_t a little easier, I think it can also be used in Func::realize instead of Halide::Image. The point is: If you understand buffer_t you can wrap almost everything into something digestible by Halide.

0
jrk On

To emphasize the first thing Florian mentioned, which I think is the key point of misunderstanding here: you appear to be timing the compilation of the copy operation ("pipeline," in common Halide terms), not just its execution. Your code size estimate is presumably also for the whole binary resulting from copy.cpp, not just the code in the Halide-generated copy function (which won't actually even appear in the binary you're compiling with clang, since it is only constructed by JITing at runtime in this program).

You can observe the actual cost of your pipeline here by first calling copy.compile_jit() before realize (realize implicitly calls compile_jit the first time it is run, so it's not necessary, but it's valuable to factor apart the runtime from the compile overhead). You would then put your timer exclusively around realize.

If you actually want to pre-compile this (or any other) pipeline for static linking into your ultimate program, which is what it seems you might be expecting, what you really want to do is use Func::compile_to_file in one program to compile and emit the code (as copy.h and copy.o), and then link and call these in another program. Check out tutorial lesson 10 to see this in more detail:

https://github.com/halide/Halide/blob/master/tutorial/lesson_10_aot_compilation_generate.cpp https://github.com/halide/Halide/blob/master/tutorial/lesson_10_aot_compilation_run.cpp