# Parallel Algorithms of the Standard Template Library

Contents[Show]

The idea is quite simple. The Standard Template has more than 100 algorithms for searching, counting, and manipulating ranges and their elements. With C++17, 69 of them are overloaded and a few new ones are added. The overloaded and new algorithms can be invoked with a so-called execution policy. By using the execution policy, you can specify whether the algorithm should run sequential, parallel, or parallel and vectorise.

## A first example

Vectorisation stands for the SIMD (Single Instruction, Multiple Data) extensions of the instruction set of a modern processor. SIMD enables your processor to execute one operation in parallel on several data.

Which overloaded variant of an algorithm is used can be chosen by the policy tag. How does that work?

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13``` ```vector v = ... // standard sequential sort std::sort(v.begin(), v.end()); // sequential execution std::sort(std::parallel::seq, v.begin(), v.end()); // permitting parallel execution std::sort(std::parallel::par, v.begin(), v.end()); // permitting parallel and vectorized execution std::sort(std::parallel::par_unseq, v.begin(), v.end()); ```

The example shows that you can even use the classic variant of std::sort (line 4). On contrary, you explicitly specify in C++17, whether the sequential (line 7), parallel (line 10), or the parallel and vectorised (line 13) version is used.

You have to keep two points in mind.

## Two special points

On one hand, that an algorithm will actually not always be performed parallel and vectorised if you use the execution policy std::parallel::par_unseq. On the other hand, you as the user is responsible for using the algorithm in the correct way.

### Parallel and vectorised execution

Whether an algorithm runs in a parallel and vectorised way depends on many points. It depends if the CPU and the operating system support SIMD instructions. Additionally, it's a question of the compiler and the optimisation level that you used to translate your code.

 ``` 1 2 3 4 5 6 7 8 9 10``` ```const int SIZE= 8; int vec[]={1,2,3,4,5,6,7,8}; int res[SIZE]={0,}; int main(){ for (int i= 0; i < SIZE; ++i) { res[i]= vec[i]+5; } } ```

Line 8 is the key line in the small program. Thanks to the compiler explorer https://godbolt.org/ , it is quite easy to generate the assembler instructions for clang 3.6 with and without maximum optimisation (-03).

#### Without Optimisation

Although my time fiddling with assembler instruction is long gone, it's quite obvious that all is done sequentially.

#### With maximum optimisation

By using maximum optimisation, I get instructions that run in parallel on several data.

The move operation (movdqa) and the add operation (paddd) use the special registers xmm0 and xmm1. Both registers are so-called SSE registers and have 128 Bits. SSE stands für Streaming SIMD Extensions.

### Hazards of data races and deadlocks

The algorithm does not automatically protect from data races and deadlocks. Example?

```int numComp= 0;

std::vector<int> vec={1,3,8,9,10};

std::sort(std::parallel::vec, vec.begin(), vec.end(),
[&numComp](int fir, int sec){ numComp++; return fir < sec; });

```

The small code snippet has a data race. numComp counts the number of operations. That means, in particular, that numComp is modified in the lambda-function; therefore, the code snippet has a data race. In order to be well-defined, numComp has to be an atomic variable.

## Static versus dynamic execution policy

Sorry to say, but the execution policy made it not in the C++17 standard. We have to wait for C++20.

The creation of a thread is an expensive operation; therefore, it makes no sense to sort a small container in a parallel (std::parallel::par) or parallel and vectorised (std::parallel:par_unseq) fashion. The administrative overhead to deal with threads will outweigh the benefit of parallelisation. It even gets worse if you use a divide and conquer algorithm such as quicksort.

 ``` 1 2 3 4 5 6 7 8 9 10 11``` ```template void quicksort(ForwardIt first, ForwardIt last){ if(first == last) return; auto pivot = *std::next(first, std::distance(first,last)/2); ForwardIt middle1 = std::partition(std::parallel::par, first, last, [pivot](const auto& em){ return em < pivot; }); ForwardIt middle2 = std::partition(std::parallel::par, middle1, last, [pivot](const auto& em){ return !(pivot < em); }); quicksort(first, middle1); quicksort(middle2, last); } ```

Now the issue is that the number of threads is way too big for your system. To solve this issue, C++17 supports a dynamic execution policy.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18``` ```std::size_t threshold= ...; // some value template void quicksort(ForwardIt first, ForwardIt last){ if(first == last) return; std::size_t distance= std::distance(first,last); auto pivot = *std::next(first, distance/2); std::parallel::execution_policy exec_pol= std::parallel::par; if ( distance < threshold ) exec_pol= std::parallel_execution::seq; ForwardIt middle1 = std::partition(exec_pol, first, last, [pivot](const auto& em){ return em < pivot; }); ForwardIt middle2 = std::partition(exec_pol, middle1, last, [pivot](const auto& em){ return !(pivot < em); }); quicksort(first, middle1); quicksort(middle2, last); } ```

I use in line 9 and 10 the dynamic execution policy. By default, quicksort will run in parallel (line 9). If the length of the range is smaller than a given threshold (line 1), quicksort will run sequentially (line 10).

## All algorithms

69 of the algorithms of the STL support a parallel or a parallel and vectorized execution. Here they are.

In addition, we get 8 new algorithms.

### New algorithms

The new variation of  std::for_each and the new algorithms std::for_each_n, std::exclusive_scan, std::inclusive_scan, std::transfom_exclusive_scan , std::transform_inclusive_scan, std::reduce and std::transform_reduce are in the std namespace.

```std::for_each
std::for_each_n
std::exclusive_scan
std::inclusive_scan
std::transform_exclusive_scan
std::transform_inclusive_scan
std::reduce
std::transform_reduce
```

Let's have closer look at std::transform_reduce.

### transform becomes map

The Haskell known function map is called std::transform in C++.  If that is not a broad hint. When I substitute in the name std::transform_reduce transform with map, I will get std::map_reduce. MapReduce is the well-known parallel framework that maps in the first phase each value to a new value and reduces in the second phase all values to the result.

The algorithm is directly applicable in C++17. Of course, my algorithm will not run on a big, distributed system but the strategy is the same; therefore, I map in the map phase each word to its length and reduce in the reduce phase the lengths of all words to their sum. The result is the sum of the lengths of all words.

```std::vector<std::string> str{"Only","for","testing","purpose"};

std::size_t result= std::transform_reduce(std::parallel::par,
str.begin(), str.end(),
[](std::string s){ return s.length(); },
0, [](std::size_t a, std::size_t b){ return a + b; });

std::cout << result << std::endl;      //   21
```

The 0 is the initial element for the reduction.

## What's next?

With the next post, I will go three years further in the future. In C++20 we get atomic smart pointers. In accordance with their C++11 pendants they are called std::atomic_shared_ptr and std::atomic_weak_ptr.

Thanks a lot to my Patreon Supporters: Matt Braun, Roman Postanciuc, Tobias Zindl, Marko, G Prvulovic, Reinhold Dröge, Abernitzke, Frank Grimm, Sakib, Broeserl, António Pina, Sergey Agafyin, Андрей Бурмистров, Jake, GS, Lawton Shoemake, Animus24, Jozo Leko, John Breland, espkk, Wolfgang Gärtner, Louis St-Amour, Stephan Roslen, Venkat Nandam, Jose Francisco, Douglas Tinkham, Kuchlong Kuchlong, Robert Blanch, Truels Wissneth, Kris Kafka, Mario Luoni, Neil Wang, Friedrich Huber, lennonli, Pramod Tikare Muralidhara, Peter Ware, Tobi Heideman, Daniel Hufschläger, Red Trip, Alexander Schwarz, Tornike Porchxidze, Alessandro Pezzato, and Evangelos Denaxas.

Thanks in particular to Jon Hess, Lakshman, Christian Wittenhorst, Sherhy Pyton, Dendi Suhubdy, Sudhakar Belagurusamy, and Richard Sargeant.

## Seminars

I'm happy to give online-seminars or face-to-face seminars world-wide. Please call me if you have any questions.

### Standard Seminars (English/German)

Here is a compilation of my standard seminars. These seminars are only meant to give you a first orientation.

### Modernes C++,

Tags: C++17

0 #1 KARTHIKEYAN VASUKI B 2017-02-20 00:09
good
+1 #2 Roman 2018-04-26 14:15
Sorry, I have a stupid question:
What does this do:
int res[]={0,};
+1 #3 Rainer Grimm 2018-04-27 08:03
Quoting Roman:
Sorry, I have a stupid question:
What does this do:
int res[]={0,};

Sorry, my mistake. I fixed it. res is now an array for length SIZE. All elements are 0.
0 #4 Roman 2018-04-27 08:52
Quoting Rainer Grimm:
Quoting Roman:
Sorry, I have a stupid question:
What does this do:
int res[]={0,};

Sorry, my mistake. I fixed it. res is now an array for length SIZE. All elements are 0.

Make sense now, thanks

### Subscribe to the newsletter (+ pdf bundle)

 Name Email Please enable the javascript to submit this form

### Visitors

Today 4505

Yesterday 7135

Week 36254

Month 46801

All 6275273

Currently are 200 guests and no members online

• #### Templates - First Steps

Thanks for putting it so clear and detailed way Eagerly waiting for next post in this series

• #### And The Winner is: Templates

typo: "Template metaprogramming is turning complete. "

• #### Lazy Futures with Coroutines

I tried the example in eagerFutureWithComments.cpp and in the output, MyFuture::MyFuture immediately ...

• #### std::format in C++20

Also, C++20 doesn't support the 'print' function of {fmt}. Hopefully it will appear in C++23 (there is ...

• #### Automatically Resuming a Job with Coroutines on a Separate Thread

The thread is started automatically if the job is not done.