SSE - Saturation Arithmetic

In a previous article I presented the SSE instruction set and how to use it in C/C++ with simple examples.
In this article I will introduce other instructions that have a really nice property: they allow you to use saturation arithmetic.

Saturation Arithmetic

With saturation arithmetic the result of an operation is bounded in a range between a minimum and a maximum value. For example, with saturation arithmetic in the range [0, 255], we have: 200 + 70 = 255 and 20 - 70 = 0.

This property would be great in C/C++ because when performing arithmetic the regular way, we use modular arithmetic, for example with an unsigned char we have : 200 + 70 = 14 and 20 - 70 = 206, this phenomenon is called overflow and at CPU level, it can be detected using the carry/overflow flags.

Saturation with SSE

As mentioned in the previous article, SSE was initially designed for signal processing and graphics processing. For example imagine we want to add/subtract two grayscale images together, we would be losing a lot of time if we had to clip the result by hand after the operation.
Fortunately, SSE provides special instructions for saturation arithmetic, with a single assembly instruction you can add several values at the same time and clip all the results.

Example

In the last article we used the _mm_add_epi8 function in order to add 16 char at the same time. In order to perform the same operation but with saturation, we simply use _mm_adds_epi8 (notice the additional 's').

In the following example we add two unsigned char values and print the result (but remember you can add 16 values at the same time).

  unsigned char a[16] __attribute__ ((aligned (16))) = { 200 };                                                                                                                                                  
  unsigned char b[16] __attribute__ ((aligned (16))) = { 70 };                                                                                                                                                   
  __m128i l = _mm_load_si128((__m128i*)a);                                                                                                                                                                                      
  __m128i r = _mm_load_si128((__m128i*)b);  
 
  _mm_store_si128((__m128i*)a, _mm_adds_epu8(l, r));
  std::cout << (int)a[0] << std::endl;

This small program should print 255, instead of 14 with modular arithmetic.

We can also subtract unsigned bytes using _mm_subs_epi8:

  unsigned char a[16] __attribute__ ((aligned (16))) = { 20 };                                                                                                                                                  
  unsigned char b[16] __attribute__ ((aligned (16))) = { 70 };                                                                                                                                                   
  __m128i l = _mm_load_si128((__m128i*)a);                                                                                                                                                                                      
  __m128i r = _mm_load_si128((__m128i*)b);  
 
  _mm_store_si128((__m128i*)a, _mm_subs_epu8(l, r));
  std::cout << (int)a[0] << std::endl;

This program should output 0 (206 with modular arithmetic).

Limitations

With SSE, saturation arithmetic is limited to signed/unsigned 8 and 16 bytes operands: epi8, epu8, epi16, epu16. The available arithmetic operations are adds and subs.