SSE - Vectorizing conditional code

I've written several articles about how to use SSE in C/C++ code, a beginners' tutorial, an explanation on saturation arithmetic with SSE and SSE applied to image processing. In this article I will introduce a new technique to help converting your scalar code to an SSE-accelerated code.

Vectorization

Vectorization is the task of converting a scalar code to a code using SSE instructions. The benefit is that a single instruction will process several elements at the same time (SIMD: Single Instruction, Multiple Data).

Auto-vectorization

Sometimes the compiler manages to transform the scalar code you've written into a vectorized code using SIMD instructions, this process is called auto-vectorization. For example you can read the documentation of GCC to check the status of its tree vectorizer: http://gcc.gnu.org/projects/tree-ssa/vectorization.html, several examples of vectorizable loops are given.

Manual Vectorization

There are a lot of tips to help auto-vectorization: for example rewriting the code to avoid some constructs, using builtins to help the compiler, etc.
But sometimes, despite all your efforts, the compiler cannot vectorize your code. If you really need a performance speedup then you will have to do vectorization by hand using the _mm_* functions. These functions are really low-level so this process can be quite hard and writing an efficient SSE code is difficult.

Vectorization of conditional code

If you are new to SSE, you might be wondering how conditional code should be vectorized. We will show in this article that this is in fact really easy. You only need to know boolean algebra.

Scalar code

We will attempt to use SSE instructions to speedup the following scalar code:

#include <cmath>
 
void scalar(float* __restrict__ result,
            const float* __restrict__ v,
            unsigned length)
{
  for (unsigned i = 0; i < length; ++i)
  {
    float val = v[i];
    if (val >= 0.f)
      result[i] = std::sqrt(val);
    else
      result[i] = val;
  }
}

This example is simple: we have a (long) array of floats and we want to compute the square root of each value from these array. Given that a negative number has an imaginary square root, we will only compute the square root of positive values from the array.

Auto-vectorization attempt

Before using SSE instructions, let's see if GCC manages to vectorize our code, we compile the code using the following flags: -O3 -msse3 -ffast-math.
We also use the flag -ftree-vectorizer-verbose=10 to get feedback from the vectorizer on each analyzed loop. Here is the output for our function:

sse.cc:12: note: ===== analyze_loop_nest =====
sse.cc:12: note: === vect_analyze_loop_form ===
sse.cc:12: note: not vectorized: control flow in loop.
sse.cc:12: note: bad loop form.
sse.cc:8: note: vectorized 0 loops in function.

Auto-vectorization failed because of the if statement, we will have to vectorize the code by hand.
Note that if we remove the if (we compute the square root on all elements), then auto-vectorization is successful:

sse.cc:12: note: Cost model analysis:
Vector inside of loop cost: 4
Vector outside of loop cost: 20
Scalar iteration cost: 3
Scalar outside cost: 7
prologue iterations: 2
epilogue iterations: 2
Calculated minimum iters for profitability: 5

sse.cc:12: note: Profitability threshold = 4

sse.cc:12: note: Profitability threshold is 4 loop iterations.
sse.cc:12: note: LOOP VECTORIZED.
sse.cc:8: note: vectorized 1 loops in function.

Branchless conditional using a mask

Load and store

Based on what I've shown in my SSE tutorial, we already know the skeleton of our algorithm: first we need to load 4 float values from memory to a SSE register, then we compute our square root and when we are done we store the result in memory. As a reminder, we will be using the _mm_load_ps function to load from memory to a SSE register and the _mm_store_ps to load from a register to memory (we assume memory accesses are 16-bytes aligned). Our skeleton is therefore as follows:

void sse(float* __restrict__ result,
         const float* __restrict__ v,
         unsigned length)
{
  for (unsigned i = 0; i <= length - 4; i += 4)
  {
    __m128 vec = _mm_load_ps(v + i);
 
    // FIXME: compute square root.
 
    _mm_store_ps(result + i, res);
  }
}

Generating a 128 bits mask

SSE offers a whole range of comparison instructions. These instructions perform 4 comparisons at the same time and the result is stored in another register.

For our vectorization attempt we will use the _mm_cmpge_ps instruction (compare greater or equal):

r0 := (a0 >= b0) ? 0xffffffff : 0x0
r1 := (a1 >= b1) ? 0xffffffff : 0x0
r2 := (a2 >= b2) ? 0xffffffff : 0x0
r3 := (a3 >= b3) ? 0xffffffff : 0x0

If the comparison is true for one pair of values from the operands, all corresponding bits of the result will be set, otherwise the bits are cleared.

Using the mask

The result of the comparison instruction can be used as a logical mask because all the 32 bits corresponding to a float value are set. We can use this mask to conditionally select some floats from a SSE register. In order to vectorize our code, we will use the cmpge instruction and compare 4 values with 0:

   __m128 zero = _mm_set1_ps(0.f);
   __m128 mask = _mm_cmpge_ps(vec, zero);

(Note: as we are comparing with zero, we could have just used the sign bit here).

The trick is now simple, we perform the square root operation as if there were no negative values:

   __m128 sqrt = _mm_sqrt_ps(vec);

If there were negative values, our sqrt variable is now filled with some NaN values. We simply extract the non-NaN values using a binary AND with our previous mask:

   _mm_and_ps(mask, sqrt)

But, if there were negative values, some values in the resulting register are just 0. We need to copy the values from the original array. We just negate the mask and perform a binary AND with the register containing the original values (variable vec)

   _mm_andnot_ps(mask, vec)

Now we just need to combine the two registers using a binary OR and that's it!
Here is the final code for this function:

void sse(float* __restrict__ result,
         const float* __restrict__ v,
         unsigned length)
{
  __m128 zero = _mm_set1_ps(0.f);
 
  for (unsigned i = 0; i <= length - 4; i += 4)
  {
    __m128 vec = _mm_load_ps(v + i);
    __m128 mask = _mm_cmpge_ps(vec, zero);
    __m128 sqrt = _mm_sqrt_ps(vec);
    __m128 res = _mm_or_ps(_mm_and_ps(mask, sqrt),
                           _mm_andnot_ps(mask, vec));
    _mm_store_ps(result + i, res);
  }
}

Benchmark

Here is a small benchmark to test if our implementation is indeed faster, we tested both implementations on 3 arrays of different size. The CPU used is AMD Phenom II X4 955

pow(2, 16) pow(2, 20) pow(2, 24)
Scalar 0.428 ms 6.749 ms 131.4 ms
SSE 0.116 ms 2.124 ms 51.8 ms


If you want coherent benchmark results, don't forget to clean your cache between two iterations!

void flush_cache()
{
  static unsigned length = 1 << 26;
  static char* input = new char[length];
  static char* output = new char[length];
 
  memcpy(output, input, length);
}

Conclusion

Thanks to the comparison instructions available with SSE, we can vectorize a scalar code that used a if statement.
Our scalar code was very simple, but if it was at the end of a complicated computation, it would be tempting to perform the square root without SSE on the result vector. We have shown here that this computation can also be done using SSE, thus maximizing parallelism.

Comments are closed.