C++ Core Guidelines: The Remaining Rules to Performance


Today, I will write about the remaining 10 rules to performance. Ten rules seem to be a lot but only two of them have actual content.



Here are the remaining rules to performance:

Let's have a closer look at rule Per.11 and rule Per.19.

Per.11: Move computation from run time to compile time

The example shows the simple gcd algorithm which calculates the greatest common division at runtime. gcd uses for its calculation the euclidean algorithm.

int gcd(int a, int b){
    while (b != 0){
        auto t= b;
        b= a % b;
        a= t;
    return a;

By declaring gcd as constexpr gcd can quite easily make to a  function which can run at compile time. There are only a few restrictions on constexpr functions in C++14. gcd must not use static or thread_local variables, exception handling, goto statements, and all variables has to be initialised. Additionally, variables have to be of literal type.

Let's try it out.

// gcd.cpp

#include <iostream>

constexpr int gcd(int a, int b){
    while (b != 0){
        auto t= b;
        b= a % b;
        a= t;
    return a;

int main(){

    std::cout << std::endl;

    constexpr auto res = gcd(121, 11);            // (1)
    std::cout << "gcd(121, 11) = " << res << std::endl;

    auto val = 121;                               // (3)
    auto res2 = gcd(val, 11);                     // (2)
    std::cout << "gcd(val, 11) = " << res2 << std::endl;
    std::cout << std::endl;


Declaring gcd to a constexpr function does not mean that it has to run at compile time. It means that gcd has the potential to run at compile time. A constexpr function has to be executed at compile if used in a constant expression. (1) is a constant expression because I ask for the result with a contexpr variable res (2). (3) is not a constant expression because res2 is not a constant expression. When I change res2 to constexpr auto res2, I would get an error: val is not a constant expression. Here is the output of the program.


Once more, here is the key observation. You can use a constexpr function at runtime and at compile time. To use it at compile its arguments have to be constant expressions.

You don't believe me that the line (1) will be executed at compile time? Here is the proof: the assembler instructions to line (1) generated from gcc 7.3 with maximum optimisation level. I created the output with the help of the Compiler Explorer from Matt Godbolt. 


The function call gcd(121, 11) boils down to its result: 11.

Templates are often used to make the decision at compile time. There is a nice example of this idea in the C++ Core Guidelines. A popular technique is to provide a handle for storing small objects on the stack and big objects on the heap. Here is the example:


template<typename T>
struct Scoped {     // store a T in Scoped
        // ...
    T obj;

template<typename T>
struct On_heap {    // store a T on the free store
        // ...
        T* objp;

template<typename T>
using Handle = typename std::conditional<(sizeof(T) <= on_stack_max),  // (1)
                    Scoped<T>,      // first alternative                  (2)
                    On_heap<T>      // second alternative                 (3)

void f()
    Handle<double> v1;                   // the double goes on the stack
    Handle<std::array<double, 200>> v2;  // the array goes on the free store
    // ...

How does it work? std::conditional in line (1) is a ternary operator from the type-traits library. In contrast to the ternary operator at runtime (a ? b : cstd::condition will be executed at compile time. This means if std::conditional<(sizeof(T) <= on_stack_max) is evaluated to true, the first alternative is used at runtime. If not,  the second alternative. 

Per.19: Access memory predictably

What does that mean? If you read an int from memory more than the size of one int is actually read from memory. An entire cache line is read from memory and stored in a cache. On modern architectures, a cache line has typically 64 bytes. If you now request an additional variable from memory and this variable is one previous cache, the read directly uses this cache and the operation is much faster.

A datat structure such as std::vector which stores its data in a continuous memory block, is a cache line aware data structure because each element in the cache line will typically be used. A std::vector has a size and a capacity and can grow only in one direction. The capacity is greater than its size and indicates when it is necessary to allocate memory. This argumentation also applies to std::vector and std::array although std::array has no capacity.


My argumentation to std::vector will not hold for a std::list or a std::forward_list. Both containers consist of nodes which are double or singly linked.


The std::forward_list can only grow in one direction.

forward list

std::deque is something in between because it is kind of a doubly linked list of small arrays.


This was the theory of cache lines. Now I'm curious. Does it make a difference to read and accumulate all elements from std::vector, a std::deque, std::list, and std::forward_list. The small program should give an answer.

// memoryAcess.cpp

#include <forward_list>
#include <chrono>
#include <deque>
#include <iomanip>
#include <iostream>
#include <list>
#include <string>
#include <vector>
#include <numeric>
#include <random>

const int SIZE = 100'000'000; 

template <typename T>
void sumUp(T& t, const std::string& cont){            // (5)
  std::cout << std::fixed << std::setprecision(10);

  auto begin= std::chrono::steady_clock::now();
  std::size_t res = std::accumulate(t.begin(), t.end(), 0LL);
  std::chrono::duration<double> last=  std::chrono::steady_clock::now() - begin;
  std::cout << cont <<  std::endl;
  std::cout << "time: " << last.count() << std::endl;
  std::cout << "res: " << res << std::endl;
  std::cout << std::endl;
  std::cout << std::endl;

int main(){
    std::cout << std::endl;
    std::random_device seed;                            // (1)
    std::mt19937 engine(seed());
    std::uniform_int_distribution<int> dist(0, 100);
    std::vector<int> randNumbers;
    for (int i=0; i < SIZE; ++i){
      std::vector<int> myVec(randNumbers.begin(), randNumbers.end());
      sumUp(myVec,"std::vector<int>");                 // (2)

      std::deque<int>myDec(randNumbers.begin(), randNumbers.end());
      sumUp(myDec,"std::deque<int>");                 // (3)
      std::list<int>myList(randNumbers.begin(), randNumbers.end());
      sumUp(myList,"std::list<int>");                 // (4)
      std::forward_list<int>myForwardList(randNumbers.begin(), randNumbers.end());
      sumUp(myForwardList,"std::forward_list<int>");  // (5)

The program memoryAccess.cpp creates first 100 Million random numbers between 0 and 100 (1). Then it accumulate the elements using a std::vector (2), a std::deque (3), a std::list (4), and a std::forward_list (5). The actual work is done in the function sumUp (6). I assume that Linux and Windows use a quite similar implementation of std::accumulate; therefore the access time of the elements is the dominant factor for the overall performance.

template<class InputIt, class T>
T accumulate(InputIt first, InputIt last, T init)
    for (; first != last; ++first) {
        init = init + *first;
    return init;


I compiled the program with maximum optimisation and executed it on Linux and Windows. I'm not interested in the comparison between Linux and Windows because that would be a comparison between a desktop PC and a Laptop. I'm interested in the relative performance numbers of the four containers. Here are they:memoryAccessWin
















To get the big picture. Here are my observations:

  • std::vector is 10 times faster than std::list or std::forward_list on Linux and std::vector is 30 times faster than std::list or std::forward_list on Windows.
  • std::vector is 1.5 times faster than std::deque on Linux, but  std::vector is 5 times faster than std::deque on Windows. The reason may be that the doubly linked small arrays are bigger on Linux than on Windows. The effect is that the access time is more vectorish.
  • std::deque is 6 times faster than std::list and std::forward_list. This hold for Linux and Windows.
  • std::list and std::forward_list are in the same ballpark on Linux and Windows.

I don't want to overestimate my performance numbers but one key observation is obvious. The more cache line aware the container is, the faster is the access time of the elements: std::vector > std::deque > (std::list, std::forward_list).

What's next?

That was it to performance. With the next post, I will start to write about the rules to concurrency. I'm quite curious.




Thanks a lot to my Patreon Supporters: Eric Pederson, Paul Baxter,  Sai Raghavendra Prasad Poosa, Meeting C++, Matt Braun, and Avi Lachmish.



Get your e-book at leanpub:

The C++ Standard Library


Concurrency With Modern C++


Get Both as one Bundle

cover   ConcurrencyCoverFrame   bundle
With C++11, C++14, and C++17 we got a lot of new C++ libraries. In addition, the existing ones are greatly improved. The key idea of my book is to give you the necessary information to the current C++ libraries in about 200 pages.  

C++11 is the first C++ standard that deals with concurrency. The story goes on with C++17 and will continue with C++20.

I'll give you a detailed insight in the current and the upcoming concurrency in C++. This insight includes the theory and a lot of practice with more the 100 source files.


Get my books "The C++ Standard Library" (including C++17) and "Concurrency with Modern C++" in a bundle.

In sum, you get more than 550 pages full of modern C++ and more than 100 source files presenting concurrency in practice.


Add comment

My Newest E-Books

Latest comments

Subscribe to the newsletter (+ pdf bundle)

Blog archive

Source Code


Today 1754

All 889272

Currently are 235 guests and no members online

Kubik-Rubik Joomla! Extensions