I have two operations for manipulating elements in device vectors using CUDA Thrust. Which methods can implement these two functions more efficiently?
Replace part of values of a vector in batch with the values from another vector. Example is shown below:
arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] arr2 = [1, 1, 1, 2, 2, 2] // After replacing {4, 5, 6} and {10, 11, 12} in batch = 3: arr1 = [1, 2, 3, 1, 1, 1, 7, 8, 9, 2, 2, 2]In my case, I always have
size(arr1) / size(arr2) = 2.We replace the values in
arr1from the index1 * batchand3 * batch.In most cases, I need to replace items in odd batches. The case for general batches is also needed.
Merge two vectors alternating indexes.
A same question is raised in How to merge 2 vectors alternating indexes?, but is for
Rlanguage.arr1 = [1, 2, 3, 4, 5, 6] arr2 = [1, 1, 1, 2, 2, 2] //After merging arr1 and arr2: arr3 = [1, 2, 3, 1, 1, 1, 4, 5, 6, 2, 2, 2]replace_copy_ifmay work, but I don't know how to combine it with fancy iterators. Additionally, some blogs show thatreplace_copy_ifis slower thancopy_if.
This operation scatters the values in
arr2intoarr1, so let's usethrust::scatter. The indices to which the values are scattered can be calculated with athrust::transform_iteratorfrom athrust::counting_iterator(likestd::ranges::views::iota). For the general case where the batchindicesare given as another inputthrust::device_vector, you can usewhile in the specific case where you just want to scatter to the odd batches, you should rather use
The
scatteroperation itself is easy:There are certainly multiple possible ways of doing this. The one that came to mind first for me was merging the two vectors with
thrust::merge_by_key, where the keys are generated using a similar scheme as above for the scatter indices:This works and is relatively elegant, but probably not ideal for performance due to the complexity of the merge algorithm and the regularity of the operation. A (probably) more performant ansatz would be creating a fancy iterator that takes care of the whole interleaving operation:
This iterator can now be fed to the next algorithm in your pipeline for kernel fusion or used to initialize a new vector using the constructor which takes an
beginand anenditerator. If the interleavedarr3is used (read from) multiple times in future it, you should probably put it into a new vector instead of reusing the iterator as the iterator doesn't allow for coalesced global memory access ifbatchisn't a multiple of warp size (32).Complete source code:
Due to the use of device lambdas,
nvccneeds-extended-lambda. Since CUDA 12 it also needs-rdc=truefor some reason.