GCC supports my favorite C++17 feature: the Standard Template Library (STL) parallel algorithms. I recognized this a few days ago, and I'm happy to write a post about it and share my enthusiasm.

The Microsoft compiler has supported parallel algorithms since its beginning but sadly, neither GCC nor Clang. I have to be precise; GCC 9 allows you to use parallel algorithms. Before I show you examples with performance numbers in my next post, I want to write about the parallel algorithms of the STL and give you the necessary information.
Parallel Algorithms of the Standard Template Library
The Standard Template Library has more than 100 algorithms for searching, counting, and manipulating ranges and their elements. With C++17, 69 get new overloads, and new ones are added. The overloaded and new algorithms can be invoked with a so-called execution policy. Using an execution policy, you can specify whether the algorithm should run sequentially, in parallel, or parallel with vectorization. For using the execution policy, you have to include the header <execution>
.
Execution Policy
The C++17 standard defines three execution policies:
std::execution::sequenced_policy
std::execution::parallel_policy
std::execution::parallel_unsequenced_policy
The corresponding policy tag specifies whether a program should run sequentially, in parallel, or in parallel with vectorization.
std::execution::seq
: runs the program sequentially
std::execution::par
: runs the program in parallel on multiple threads
std::execution::par_unseq
: runs the program in parallel on multiple threads and allows the interleaving of individual loops; permits a vectorized version with SIMD (Single Instruction Multiple Data).
The usage of the execution policy std::execution::par
or std::execution::par_unseq
allows the algorithm to run parallel or parallel and vectorized. This policy is a permission and not a requirement.
The following code snippet applies to all execution policies.
std::vector<int> v = {1, 2, 3, 4, 5, 6, 7, 8, 9};
// standard sequential sort
std::sort(v.begin(), v.end()); // (1)
// sequential execution
std::sort(std::execution::seq, v.begin(), v.end()); // (2)
// permitting parallel execution
std::sort(std::execution::par, v.begin(), v.end()); // (3)
// permitting parallel and vectorized execution
std::sort(std::execution::par_unseq, v.begin(), v.end()); // (4)
The example shows that you can still use the classic variant of std::sort
(4). Besides, in C++17, you can specify explicitly whether the sequential (2), parallel (3), or parallel and vectorized (4) version should be used.
Modernes C++ Mentoring
Be part of my mentoring programs:
Do you want to stay informed about my mentoring programs: Subscribe via E-Mail.
Parallel and Vectorized Execution
Whether an algorithm runs in a parallel and vectorized way depends on many factors. For example, it depends on whether the CPU and the operating system support SIMD instructions. It also depends on the compiler and the optimization level you used to translate your code.
The following example shows a simple loop for filling a vector.
const int SIZE = 8;
int vec[] = {1, 2, 3, 4, 5, 6, 7, 8};
int res[] = {0, 0, 0, 0, 0, 0, 0, 0};
int main() {
for (int i = 0; i < SIZE; ++i) {
res[i] = vec[i]+5;
}
}
The expression res[i] = vec[i] + 5
is the crucial line in this small example. Thanks to Compiler Explorer, we can look closer at the assembler instructions generated by clang 3.6.
Without Optimization
Here are the assembler instructions. Each addition is done sequentially.

With Maximum Optimization
By using the highest optimization level, -O3, special registers such as xmm0
are used that can hold 128 bits or 4 ints. This special register means that the addition takes place in parallel on four vector elements.

An overload of an algorithm without an execution policy and an overload of an algorithm with a sequential execution policy std::execution::seq
differ in one aspect: exceptions.
Exceptions
If an exception occurs during the usage of an algorithm with an execution policy,std::terminate
it is called. std::terminate
calls the installedstd::terminate_handler
. The consequence is that per default std::abort
is called, which causes abnormal program termination. The handling of exceptions is the difference between an algorithm’s invocation without an execution policy and an algorithm with a sequential std::execution::seq
execution policy. The invocation of the algorithm without an execution policy propagates the exception, and, therefore, the exception can be handled.
With C++17, 69 of the STL algorithms got new overloads, and new algorithms were added.
Algorithms
Here are the 69 algorithms with parallelized versions.

The New Algorithms
The new algorithm in C++17, which are designed for parallel execution, are in the std
namespace and need the header <numeric>
.
std::exclusive_scan:
Applies from the left a binary callable up to the range's ith (exclusive) element. The left argument of the callable is the previous result—stores intermediate results.
std::inclusive_scan
: Applies from the left a binary callable up to the range's ith (inclusive) element. The left argument of the callable is the previous result—stores intermediate results.
std::transform_exclusive_scan
: First applies a unary callable to the range and then applies std::exclusive_scan
.
std::transform_inclusive_scan
: First applies a unary callable to the range and then applies std::inclusive_scan
.
std::reduce
: Applies a binary callable to the range.
std::transform_reduce
: Applies first a unary callable to one or a binary callable to two ranges and then std::reduce
to the resulting range.
Admittedly this description is not easy to digest, but if you already know std::accumulat
e and std::partial_sum
, the reduce and scan variations should be pretty familiar. std::reduce
is the parallel pendant to std::accumulate and scan the parallel pendant to partial_sum. The parallel execution is the reason that std::reduce
needs an associative and commutative callable. The corresponding statement holds for the scan variations contrary to the partial_sum variations. To get the full details, visit cppreferenc.com/algorithm.
You may wonder why we need std::reduce
for parallel execution because we already have std::accumulate
. The reason is that std::accumulate
it processes its elements in an order that cannot be parallelized.
std::accumulate
versus std::reduce
While std::accumulate
processes its elements from left to right, std::reduce
and does it in an arbitrary order. Let me start with a small code snippet using std::accumulate
and std::reduce
. The callable is the lambda function [](int a, int b){ return a * b; }
.
std::vector<int> v{1, 2, 3, 4};
std::accumulate(v.begin(), v.end(), 1, [](int a, int b){ return a * b; });
std::reduce(std::execution::par, v.begin(), v.end(), 1 , [](int a, int b){ return a * b; });
The two following graphs show the different processing strategies of std::accumulate
and std::reduce
.
std::accumulate
starts at the left and successively applies the binary operator.

- On the contrary,
std::reduce
applies the binary operator in a non-deterministic way.

The associativity of the callable allows the std::reduce
algorithm to apply the reduction step on arbitrary adjacents pairs of elements. The intermediate results can be computed in an arbitrary order thanks to commutativity.
What's next?
As promised, my next post uses parallel algorithms of the STL and provides performance numbers for the Microsoft compiler and the GCC.
Five Vouchers for Stephan Roth's Book "Clean C++20" to Win
I am giving away five vouchers for Stephan Roth's book "Clean C++20", sponsored by the book's publisher Apress. Here is how you can get it: https://bit.ly/StephanRoth.
Thanks a lot to my Patreon Supporters: Matt Braun, Roman Postanciuc, Tobias Zindl, G Prvulovic, Reinhold Dröge, Abernitzke, Frank Grimm, Sakib, Broeserl, António Pina, Sergey Agafyin, Андрей Бурмистров, Jake, GS, Lawton Shoemake, Animus24, Jozo Leko, John Breland, Venkat Nandam, Jose Francisco, Douglas Tinkham, Kuchlong Kuchlong, Robert Blanch, Truels Wissneth, Kris Kafka, Mario Luoni, Friedrich Huber, lennonli, Pramod Tikare Muralidhara, Peter Ware, Daniel Hufschläger, Alessandro Pezzato, Bob Perry, Satish Vangipuram, Andi Ireland, Richard Ohnemus, Michael Dunsky, Leo Goodstadt, John Wiederhirn, Yacob Cohen-Arazi, Florian Tischler, Robin Furness, Michael Young, Holger Detering, Bernd Mühlhaus, Matthieu Bolt, Stephen Kelley, Kyle Dean, Tusar Palauri, Dmitry Farberov, Juan Dent, George Liao, Daniel Ceperley, Jon T Hess, Stephen Totten, Wolfgang Fütterer, Matthias Grün, Phillip Diekmann, Ben Atakora, Ann Shatoff, and Rob North.
Thanks, in particular, to Jon Hess, Lakshman, Christian Wittenhorst, Sherhy Pyton, Dendi Suhubdy, Sudhakar Belagurusamy, Richard Sargeant, Rusty Fleming, John Nebel, Mipko, Alicja Kaminska, and Slavko Radman.
My special thanks to Embarcadero 
My special thanks to PVS-Studio 
My special thanks to Tipi.build 
My special thanks to Take Up Code 
Seminars
I'm happy to give online seminars or face-to-face seminars worldwide. Please call me if you have any questions.
Bookable (Online)
German
Standard Seminars (English/German)
Here is a compilation of my standard seminars. These seminars are only meant to give you a first orientation.
- C++ - The Core Language
- C++ - The Standard Library
- C++ - Compact
- C++11 and C++14
- Concurrency with Modern C++
- Design Pattern and Architectural Pattern with C++
- Embedded Programming with Modern C++
- Generic Programming (Templates) with C++
New
- Clean Code with Modern C++
- C++20
Contact Me
- Phone: +49 7472 917441
- Mobil:: +49 176 5506 5086
- Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
- German Seminar Page: www.ModernesCpp.de
- Mentoring Page: www.ModernesCpp.org
Modernes C++,

Comments
I wonder why the committee chose to add the ExecutionPolicy&& as the first parameter instead of the last one. Any design motivation behind this?
RSS feed for comments to this post