Parallel Algorithms of the STL with the GCC Compiler


GCC supports my favorite C++17 feature: the Standard Template Library (STL) parallel algorithms. I recognized this a few days ago, and I'm happy to write a post about it and share my enthusiasm.



The Microsoft compiler has supported parallel algorithms since its beginning but sadly, neither GCC nor Clang. I have to be precise; GCC 9 allows you to use parallel algorithms. Before I show you examples with performance numbers in my next post, I want to write about the parallel algorithms of the STL and give you the necessary information.

Parallel Algorithms of the Standard Template Library

The Standard Template Library has more than 100 algorithms for searching, counting, and manipulating ranges and their elements. With C++17, 69 get new overloads, and new ones are added. The overloaded and new algorithms can be invoked with a so-called execution policy. Using an execution policy, you can specify whether the algorithm should run sequentially, in parallel, or parallel with vectorization. For using the execution policy, you have to include the header <execution>.

Execution Policy

The C++17 standard defines three execution policies:

  • std::execution::sequenced_policy
  • std::execution::parallel_policy
  • std::execution::parallel_unsequenced_policy

The corresponding policy tag specifies whether a program should run sequentially, in parallel, or in parallel with vectorization.

  • std::execution::seq: runs the program sequentially

  • std::execution::par: runs the program in parallel on multiple threads

  • std::execution::par_unseq: runs the program in parallel on multiple threads and allows the interleaving of individual loops; permits a vectorized version with SIMD (Single Instruction Multiple Data).

The usage of the execution policy std::execution::par or std::execution::par_unseq allows the algorithm to run parallel or parallel and vectorized. This policy is a permission and not a requirement.

The following code snippet applies to all execution policies.

std::vector<int> v = {1, 2, 3, 4, 5, 6, 7, 8, 9};

// standard sequential sort                              
std::sort(v.begin(), v.end());                             // (1)

// sequential execution
std::sort(std::execution::seq, v.begin(), v.end());        // (2)

// permitting parallel execution
std::sort(std::execution::par, v.begin(), v.end());        // (3)

// permitting parallel and vectorized execution
std::sort(std::execution::par_unseq, v.begin(), v.end());  // (4)

The example shows that you can still use the classic variant of std::sort (4). Besides, in C++17, you can specify explicitly whether the sequential (2), parallel (3), or parallel and vectorized (4) version should be used.


Rainer D 6 P2 540x540Modernes C++ Mentoring

Be part of my mentoring programs:





Do you want to stay informed about my mentoring programs: Subscribe via E-Mail.

Parallel and Vectorized Execution

Whether an algorithm runs in a parallel and vectorized way depends on many factors. For example, it depends on whether the CPU and the operating system support SIMD instructions. It also depends on the compiler and the optimization level you used to translate your code.

The following example shows a simple loop for filling a vector.
const int SIZE = 8;
int vec[] = {1, 2, 3, 4, 5, 6, 7, 8};
int res[] = {0, 0, 0, 0, 0, 0, 0, 0};
int main() {
    for (int i = 0; i < SIZE; ++i) {
        res[i] = vec[i]+5;


The expression res[i] = vec[i] + 5 is the crucial line in this small example. Thanks to Compiler Explorer, we can look closer at the assembler instructions generated by clang 3.6.

Without Optimization

Here are the assembler instructions. Each addition is done sequentially.


With Maximum Optimization

By using the highest optimization level, -O3, special registers such as xmm0 are used that can hold 128 bits or 4 ints. This special register means that the addition takes place in parallel on four vector elements.


An overload of an algorithm without an execution policy and an overload of an algorithm with a sequential execution policy std::execution::seq differ in one aspect: exceptions.



If an exception occurs during the usage of an algorithm with an execution policy,std::terminate it is called. std::terminate calls the installedstd::terminate_handler. The consequence is that per default std::abort is called, which causes abnormal program termination. The handling of exceptions is the difference between an algorithm’s invocation without an execution policy and an algorithm with a sequential std::execution::seq execution policy. The invocation of the algorithm without an execution policy propagates the exception, and, therefore, the exception can be handled.

With C++17, 69 of the STL algorithms got new overloads, and new algorithms were added.


Here are the 69 algorithms with parallelized versions.


The New Algorithms

The new algorithm in C++17, which are designed for parallel execution, are in the std namespace and need the header <numeric>.

  • std::exclusive_scan: Applies from the left a binary callable up to the range's ith (exclusive) element. The left argument of the callable is the previous result—stores intermediate results.
  • std::inclusive_scan: Applies from the left a binary callable up to the range's ith (inclusive) element. The left argument of the callable is the previous result—stores intermediate results.
  • std::transform_exclusive_scan: First applies a unary callable to the range and then applies std::exclusive_scan.
  • std::transform_inclusive_scan: First applies a unary callable to the range and then applies std::inclusive_scan.
  • std::reduce: Applies a binary callable to the range.
  • std::transform_reduce: Applies first a unary callable to one or a binary callable to two ranges and then std::reduce to the resulting range.

Admittedly this description is not easy to digest, but if you already know std::accumulate and std::partial_sum, the reduce and scan variations should be pretty familiar. std::reduce is the parallel pendant to std::accumulate and scan the parallel pendant to partial_sum. The parallel execution is the reason that std::reduce needs an associative and commutative callable. The corresponding statement holds for the scan variations contrary to the partial_sum variations. To get the full details, visit

You may wonder why we need std::reduce for parallel execution because we already have std::accumulate. The reason is that std::accumulate it processes its elements in an order that cannot be parallelized.

std::accumulate versus std::reduce

While std::accumulate processes its elements from left to right, std::reduce and does it in an arbitrary order. Let me start with a small code snippet using std::accumulate and std::reduce. The callable is the lambda function [](int a, int b){ return a * b; }.


std::vector<int> v{1, 2, 3, 4};

std::accumulate(v.begin(), v.end(), 1, [](int a, int b){ return a * b; });
std::reduce(std::execution::par, v.begin(), v.end(), 1 , [](int a, int b){ return a * b; });


The two following graphs show the different processing strategies of std::accumulate and std::reduce.

  • std::accumulate starts at the left and successively applies the binary operator.


  • On the contrary, std::reduce applies the binary operator in a non-deterministic way.



The associativity of the callable allows the std::reduce algorithm to apply the reduction step on arbitrary adjacents pairs of elements. The intermediate results can be computed in an arbitrary order thanks to commutativity.

What's next?

As promised, my next post uses parallel algorithms of the STL and provides performance numbers for the Microsoft compiler and the GCC.

Five Vouchers for Stephan Roth's Book "Clean C++20" to Win

I am giving away five vouchers for Stephan Roth's book "Clean C++20", sponsored by the book's publisher Apress. Here is how you can get it:


Thanks a lot to my Patreon Supporters: Matt Braun, Roman Postanciuc, Tobias Zindl, G Prvulovic, Reinhold Dröge, Abernitzke, Frank Grimm, Sakib, Broeserl, António Pina, Sergey Agafyin, Андрей Бурмистров, Jake, GS, Lawton Shoemake, Animus24, Jozo Leko, John Breland, Venkat Nandam, Jose Francisco, Douglas Tinkham, Kuchlong Kuchlong, Robert Blanch, Truels Wissneth, Kris Kafka, Mario Luoni, Friedrich Huber, lennonli, Pramod Tikare Muralidhara, Peter Ware, Daniel Hufschläger, Alessandro Pezzato, Bob Perry, Satish Vangipuram, Andi Ireland, Richard Ohnemus, Michael Dunsky, Leo Goodstadt, John Wiederhirn, Yacob Cohen-Arazi, Florian Tischler, Robin Furness, Michael Young, Holger Detering, Bernd Mühlhaus, Matthieu Bolt, Stephen Kelley, Kyle Dean, Tusar Palauri, Dmitry Farberov, Juan Dent, George Liao, Daniel Ceperley, Jon T Hess, Stephen Totten, Wolfgang Fütterer, Matthias Grün, Phillip Diekmann, Ben Atakora, Ann Shatoff, and Rob North.


Thanks, in particular, to Jon Hess, Lakshman, Christian Wittenhorst, Sherhy Pyton, Dendi Suhubdy, Sudhakar Belagurusamy, Richard Sargeant, Rusty Fleming, John Nebel, Mipko, Alicja Kaminska, and Slavko Radman.



My special thanks to Embarcadero CBUIDER STUDIO FINAL ICONS 1024 Small


My special thanks to PVS-Studio PVC Logo


My special thanks to logo


My special thanks to Take Up Code TakeUpCode 450 60



I'm happy to give online seminars or face-to-face seminars worldwide. Please call me if you have any questions.

Bookable (Online)


Standard Seminars (English/German)

Here is a compilation of my standard seminars. These seminars are only meant to give you a first orientation.

  • C++ - The Core Language
  • C++ - The Standard Library
  • C++ - Compact
  • C++11 and C++14
  • Concurrency with Modern C++
  • Design Pattern and Architectural Pattern with C++
  • Embedded Programming with Modern C++
  • Generic Programming (Templates) with C++


  • Clean Code with Modern C++
  • C++20

Contact Me

Modernes C++,




0 #1 Marlon 2021-07-19 08:19
Wow, thanks for sharing, this is probably my favorite feature in C++17!

I wonder why the committee chose to add the ExecutionPolicy&& as the first parameter instead of the last one. Any design motivation behind this?

Stay Informed about my Mentoring



English Books

Course: Modern C++ Concurrency in Practice

Course: C++ Standard Library including C++14 & C++17

Course: Embedded Programming with Modern C++

Course: Generic Programming (Templates)

Course: C++ Fundamentals for Professionals

Course: The All-in-One Guide to C++20

Course: Master Software Design Patterns and Architecture in C++

Subscribe to the newsletter (+ pdf bundle)

All tags

Blog archive

Source Code


Today 1858

Yesterday 4344

Week 38736

Month 18982

All 12097191

Currently are 158 guests and no members online

Kubik-Rubik Joomla! Extensions

Latest comments