I'm using intrinsics to optimize a program of mine. But now I would like to sum the four elements that are in a __m128 vector in order to compare the result to a floating point value. For instance, let's say I have this 128 bits vector : {a, b c, d}. How can I compare a+b+c+d to e, where e is of type float ?
Does SSE2 or SSE3 provide a way to do that simply or do you have any code snippet that could help me ? Thanks !
The best I can up with is this:
If A and B absolutely have to be in the low quadword then as far as I can tell you need a shuffle, which is slower on pre-Penryn (and on a Penryn the DPPS solution is available).