Performance of the Parallel STL Algorithms

Contents[Show]

I presented in my last post "Parallel Algorithms of the STL with the GCC Compiler" the necessary theory about the C++17 algorithm. Today, I make a performance test using the Microsoft compiler and the GCC compiler to answer the simple question: Does the execution policy pay off?

 timelineParallelSTL

The reason for the short detour about my template post is the following. I recognized that GCC supports my favorite C++17 feature: the parallel algorithms of the Standard Template Library. I present in this post the brand-new GCC 11.1 but a GCC 9 should also be fine.To use the parallel STL algorithms with the GCC, you have to install a few additional libraries.

Threading Building Blocks

The GCC uses under the hood the Intels Thread Building blocks (TBB). The TBB is a C++ template library developed by Intel for parallel programming on multi-core processors.

To be precise, you need TBB 2018 version or higher. When I installed the developer package of the TBB on my Linux desktop (Suse), the package manager also chooses TBB memory allocator. Using the TBB is easy. You have to link against the TBB using the flag -ltbb.

buildGcc

 

Now, I'm prepared to take my next steps with parallel algorithms. Here are the first numbers using the Microsoft Compiler 19.16 and the GCC 11.1.

Performance Numbers with the Microsoft Compiler and the GCC Compiler

The following program parallelSTLPerformance.cpp calculates the tangents with the sequential (1), parallel (2), and parallel and vectorized (3) execution policy.

// parallelSTLPerformance.cpp

#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <random>
#include <string>
#include <vector>

constexpr long long size = 500'000'000;  
  
const double pi = std::acos(-1);

template <typename Func>
void getExecutionTime(const std::string& title, Func func){                   // (4)
    
  const auto sta = std::chrono::steady_clock::now();
  func();                                                                     // (5)
  const std::chrono::duration<double> dur = std::chrono::steady_clock::now() - sta;
  std::cout << title << ": " << dur.count() << " sec. " << std::endl;
     
}
    
int main(){

  std::cout << '\n';
    
  std::vector<double> randValues;
  randValues.reserve(size);
   
  std::mt19937 engine;
  std::uniform_real_distribution<> uniformDist(0, pi / 2);
  for (long long i = 0 ; i < size ; ++i) randValues.push_back(uniformDist(engine));
    
  std::vector<double> workVec(randValues);
    
  getExecutionTime("std::execution::seq", [workVec]() mutable {                // (6)
    std::transform(std::execution::seq, workVec.begin(), workVec.end(),        // (1)
		   workVec.begin(), 
                   [](double arg){ return std::tan(arg); }              
                  );
    });
        
  getExecutionTime("std::execution::par", [workVec]() mutable {                // (7)
    std::transform(std::execution::par, workVec.begin(), workVec.end(),        // (2)
		   workVec.begin(), 
                   [](double arg){ return std::tan(arg); }
                  );
  });
     
  getExecutionTime("std::execution::par_unseq", [workVec]() mutable {          // (8)
    std::transform(std::execution::par_unseq, workVec.begin(), workVec.end(),  // (3)
		   workVec.begin(), 
                   [](double arg){ return std::tan(arg); }
                  );
  });

  std::cout << '\n';
    
}

 

First, the vector randValues is filled with 500 million numbers from the half-open interval [0, pi / 2 [. The function template getExecutionTime (4) gets the name of the execution policy, and the lambda function executes the lambda function (5), and shows the execution time. There is one particular point about the three lambda functions ((6), (7), and (8)) used in this program. They are declared mutable. This is necessary because the lambda functions modify their argument workVec. Lambda functions are per default constant. If a lambda function wants to change its values, it has to be declared mutable.

Let me start with the windows performance numbers. But before I do that I have to make a short disclaimer.

Disclaimer

I explicitly want to emphasize this. I don't want to compare Windows and Linux because both computers on which run Windows or Linux have different capabilities. These performance numbers should only give you a gut feeling. This means if you want the numbers for your system, you have to repeat the test.

I use maximum optimization on Windows and Linux. This means for Windows the flag /O2 and on Linux the flag -O3.

To make it short. I'm really keen to know if the parallel execution of the STL algorithms pays and to what extent. My main focus is the relative performance of the sequential and parallel execution.

Windows

My windows laptop has eight logical cores, but the parallel execution is more than ten times faster.

parallelSTLPerformance

The numbers for the parallel and the parallel and vectorized execution are in the same ballpark. Here is the explanation for the Visual C++ Team Blog: Using C++17 Parallel Algorithms for Better Performance: Note that the Visual C++ implementation implements the parallel and parallel unsequenced policies the same way, so you should not expect better performance for using par_unseq on our implementation, but implementations may exist that can use that additional freedom someday.

Linux

My Linux computer has only four cores. Here are the numbers.

parallelPerformanceSTL

 

The numbers are as expected. I have four cores and the parallel execution is about four times faster than the sequential execution. The performance numbers of the parallel and vectorized version and the parallel version are in the same ballpark. My assumption is, therefore, that the GCC compiler uses the same strategy such as the Windows compiler. When I ask for the parallel and vectorized execution by using the execution policy std::execute::par_unseq, I get the parallel execution policy (std::execute::par). This behavior is according to the C++17 standard because the execution policies are only a hint for the compiler.

To my knowledge, neither the Windows compiler nor the GCC compiler supports the parallel and vectorized execution of the parallel STL algorithms. When you want to see the parallel and vectorized algorithms in action, Nvidias STL implementation Thrust may be an ideal candidate. For further information, read the following Nvidia post: .

What's next?

 After this C++17 detour, I go back to my original path: templates. In my next post, I dive deeper into templates and write about template instantiation.

 

 

Thanks a lot to my Patreon Supporters: Matt Braun, Roman Postanciuc, Tobias Zindl, Marko, G Prvulovic, Reinhold Dröge, Abernitzke, Frank Grimm, Sakib, Broeserl, António Pina, Sergey Agafyin, Андрей Бурмистров, Jake, GS, Lawton Shoemake, Animus24, Jozo Leko, John Breland, espkk, Louis St-Amour, Venkat Nandam, Jose Francisco, Douglas Tinkham, Kuchlong Kuchlong, Robert Blanch, Truels Wissneth, Kris Kafka, Mario Luoni, Neil Wang, Friedrich Huber, lennonli, Pramod Tikare Muralidhara, Peter Ware, Tobi Heideman, Daniel Hufschläger, Red Trip, Alexander Schwarz, Tornike Porchxidze, Alessandro Pezzato, Evangelos Denaxas, Bob Perry, Satish Vangipuram, Andi Ireland, Richard Ohnemus, Michael Dunsky, Dimitrov Tsvetomir, Leo Goodstadt, Eduardo Velasquez, John Wiederhirn, Yacob Cohen-Arazi, Florian Tischler, and Robin Furness.

 

Thanks in particular to Jon Hess, Lakshman, Christian Wittenhorst, Sherhy Pyton, Dendi Suhubdy, Sudhakar Belagurusamy, Richard Sargeant, Rusty Fleming, and Said Mert Turkal.

 

 

My special thanks to Embarcadero CBUIDER STUDIO FINAL ICONS 1024 Small

 

Seminars

I'm happy to give online seminars or face-to-face seminars worldwide. Please call me if you have any questions.

Bookable (Online)

German

Standard Seminars (English/German)

Here is a compilation of my standard seminars. These seminars are only meant to give you a first orientation.

New

Contact Me

Modernes C++,

RainerGrimmSmall

Comments   

0 #1 memory_leak 2021-07-26 22:15
Dude the name is GNU/Linux not Linux. You could have done both GCC and VC on Windows for more fair comparison, and there are optimizations behind what both /O2 and -O3 specifies.
Quote
0 #2 chris green 2021-08-04 23:34
The problem with the MSVC versions are that they are unusable for high core count systems such as servers or engineering workstations. Both the MSVC std::thread implmentation and the parallel STL implementation are limited in how many cores they can use by windows processor groups. It is possible to work around this for std::thread with some non-portable initialization code, but not the parallel STL stuff.

See benchmarks here comparing gcc, msvc, and intel tbb for parallel sort and transform, under windows and linux:

https://twitter.com/ChrisGr93091552/status/1422643747922223107

https://twitter.com/ChrisGr93091552/status/1423008413856849922
Quote

My Newest E-Books

Course: Modern C++ Concurrency in Practice

Course: C++ Standard Library including C++14 & C++17

Course: Embedded Programming with Modern C++

Course: Generic Programming (Templates)

Course: C++ Fundamentals for Professionals

Interactive Course: The All-in-One Guide to C++20

Subscribe to the newsletter (+ pdf bundle)

Blog archive

Source Code

Visitors

Today 341

Yesterday 8036

Week 16556

Month 191266

All 7253556

Currently are 150 guests and no members online

Kubik-Rubik Joomla! Extensions

Latest comments