Single Threaded: Summation of a Vector

Contents[Show]

What is the fastest way to add the elements of a std::vector?. A question which I will pursue in the next posts. I use the single threaded addition as reference number. In further posts I discuss atomics, locks, and thread local data.

 

My strategy

My plan is to fill a std::vector with one hundred million arbitrary numbers between 1 and 10. I apply the uniforml distribution to get the numbers. The task is it to calculate the sum of all values. 

As usual I use my Linux desktop and my Windows laptop to get the numbers. The Linux PC has four, my Windows PC two cores. Here are details to my compilers: Thread safe initialization of a singleton. I will measure the performance with and without maximum optimization.

A simple loop

The obvious strategy is it to add the numbers in a range-based for lop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// calculateWithLoop.cpp

#include <chrono>
#include <iostream>
#include <random>
#include <vector>

constexpr long long size= 100000000;

int main(){

  std::cout << std::endl;

  std::vector<int> randValues;
  randValues.reserve(size);

  // random values
  std::random_device seed;
  std::mt19937 engine(seed());
  std::uniform_int_distribution<> uniformDist(1,10);
  for ( long long i=0 ; i< size ; ++i) randValues.push_back(uniformDist(engine));
 
  auto start = std::chrono::system_clock::now();
  
  unsigned long long add= {};
  for (auto n: randValues) add+= n;
 
  std::chrono::duration<double> dur= std::chrono::system_clock::now() - start;
  std::cout << "Time for addition " << dur.count() << " seconds" << std::endl;
  std::cout << "Result: " << add << std::endl;

  std::cout << std::endl;

}

 

The calculation takes place in line 26. How fast are my computers? 

Without optimization

 CalculateWithLoopCalculateWithLoop win

Maximum optimization

 CalculateWithLoopOptCalculateWithLoopOpt win

You should not use explicit loops. A rule which in particular hold true for Windows. That's simple therefore I have to look in the Standard Template Library (STL). 

The STL with std::accumulate

std::accumulate is the standard way to calculate the sum.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// calculateWithStd.cpp

#include <algorithm>
#include <chrono>
#include <iostream>
#include <numeric>
#include <random>
#include <vector>

constexpr long long size= 100000000;

int main(){

  std::cout << std::endl;

  std::vector<int> randValues;
  randValues.reserve(size);

  // random values
  std::random_device seed;
  std::mt19937 engine(seed());
  std::uniform_int_distribution<> uniformDist(1,10);
  for ( long long i=0 ; i< size ; ++i) randValues.push_back(uniformDist(engine));
 
  auto start = std::chrono::system_clock::now();
  
  unsigned long long add= std::accumulate(randValues.begin(),randValues.end(),0);
 
  std::chrono::duration<double> dur= std::chrono::system_clock::now() - start;
  std::cout << "Time for addition " << dur.count() << " seconds" << std::endl;
  std::cout << "Result: " << add << std::endl;

  std::cout << std::endl;

}

 

The key lines is line 27. The performance of std::acummulate corresponds to the performance of the range-based for loop. But not for Windows.

Without optimization

 CalculateWithStdCalculateWithStd win

Maximum optimization

 CalculateWithStdOptCalculateWithStdOpt win

That was all. I have my numbers to compare the single threaded with the multithreading program. Really? I'm very curious to protect the summation with a lock or use an atomic. So we get the overhead of protection.

Protection by a lock

If I protect the access to the summation variable with a lock, I get the answers to my two questions.

  1. How expensive is the synchronization of a lock?
  2. How fast can a lock be if no concurrent access to a variable takes place?

Of course I can rephrase point 2. If more the one thread access the shared variable, the access time decreases.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// calculateWithLock.cpp

#include <chrono>
#include <iostream>
#include <mutex>
#include <numeric>
#include <random>
#include <vector>

constexpr long long size= 100000000;

int main(){

  std::cout << std::endl;

  std::vector<int> randValues;
  randValues.reserve(size);

  // random values
  std::random_device seed;
  std::mt19937 engine(seed());
  std::uniform_int_distribution<> uniformDist(1,10);
  for ( long long i=0 ; i< size ; ++i) randValues.push_back(uniformDist(engine));

  std::mutex myMutex;
 
  unsigned long long int add= 0;
  auto start = std::chrono::system_clock::now();
  
  for (auto i: randValues){
      std::lock_guard<std::mutex> myLockGuard(myMutex);
      add+= i;
  }
 
  std::chrono::duration<double> dur= std::chrono::system_clock::now() - start;
  std::cout << "Time for addition " << dur.count() << " seconds" << std::endl;
  std::cout << "Result: " << add << std::endl;
    
  std::cout << std::endl;

}

 

That's the result I assumed. The access on the protected variable add is slower.

 

Without optimization

CalculateWithLockCalculateWithLock win

Maximum optimization

CalculateWithLockOptCalculateWithLockOpt win

The non-optimized version with the lock is 4 - 12 times slower as the maximum optimized. The maximum optimized about 10 - 40 times slower thant std::accumulate.

Finally  atomics.

Protection by an atomic

The same question for locks holds also for atomics.

  1. How expensive is the synchronization of a lock?
  2. How fast can a lock be if no concurrent access to a variable takes place?

But there is an additional question. How is the performance difference between an atomic and a lock?

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// calculateWithAtomic.cpp

#include <atomic>
#include <chrono>
#include <iostream>
#include <numeric>
#include <random>
#include <vector>

constexpr long long size= 100000000;

int main(){

  std::cout << std::endl;

  std::vector<int> randValues;
  randValues.reserve(size);

  // random values
  std::random_device seed;
  std::mt19937 engine(seed());
  std::uniform_int_distribution<> uniformDist(1,10);
  for ( long long i=0 ; i< size ; ++i) randValues.push_back(uniformDist(engine));
  
  std::atomic<unsigned long long> add={};
  std::cout << std::boolalpha << "add.is_lock_free(): " << add.is_lock_free() << std::endl;
  std::cout << std::endl;
 
  auto start = std::chrono::system_clock::now();
  
  for (auto i: randValues) add+= i;
 
  std::chrono::duration<double> dur= std::chrono::system_clock::now() - start;
  std::cout << "Time for addition " << dur.count() << " seconds" << std::endl;
  std::cout << "Result: " << add << std::endl;
    
  std::cout << std::endl;
  
  add=0;
  start = std::chrono::system_clock::now();
  
  for (auto i: randValues) add.fetch_add(i,std::memory_order_relaxed);
  
  dur= std::chrono::system_clock::now() - start;
  std::cout << "Time for addition " << dur.count() << " seconds" << std::endl;
  std::cout << "Result: " << add << std::endl;
    
  std::cout << std::endl;

}

 

The program is special. First I ask in line 26, if the atomic has a lock. That is crucial, because otherwise, there is not difference between the usage of locks and atomics. On all mainstream platforms I knew atomics use no lock. Second I calculate the sum in two ways. I use in line 31 the += operator, in line 42 the method fetch_add. Both variants have in the single threaded case a comparable performance but I can explicitly specify in the case of fetch_add the memory model. More about that point in the next post.

 

But now the numbers.

Without optimization

 CalculateWithAtomicCalculateWithAtomic win

Maximum optimization

CalculateWithAtomicOptCalculateWithAtomicOpt win

The atomics are in the case of Linux 1.5 times slower, in the case of windows 8 times slower than the std::accumulate algorithm. That changes even more in the case of optimization. Now Linux is 15 times, Windows is 50 times faster.

I want to stress two points.

  1. Atomics are 2 - 3 times faster on Linux and Windows than locks.
  2. Linux is in particular for atomics 2 times faster than Windows.

All numbers compact

How lost the orientation because of the number. Here is the overview in seconds.

SingleThreadedAdditionEng

What's next?

Singlethreaded becomes mulithreaded in the next post. The summation variable add becomes in the first step a shared variable used by four threads. In the second step add will be an atomic.

 

 

 

 

 

 

 

title page smalltitle page small Go to Leanpub/cpplibrary "What every professional C++ programmer should know about the C++ standard library".   Get your e-book. Support my blog.

 

Tags: atomics, lock

Comments   

0 #1 Alexey Nikitin 2016-09-09 19:43
Using the sum of type int in calculateWithStd.cpp leads to incorrect measurement results. In my opinion it's better to use std::accumulate(randValues.begin(), randValues.end(), 0ULL)
Quote
0 #2 Rainer Grimm 2016-09-10 14:51
Quoting Alexey Nikitin:
Using the sum of type int in calculateWithStd.cpp leads to incorrect measurement results. In my opinion it's better to use std::accumulate(randValues.begin(), randValues.end(), 0ULL)


On my platform int is big enough, but in general you are right.

std::numeric_limits<int>::max() : 2147483647
std::numeric_limits<unsigned long long>::max(): 18446744073709551615

So my int can hold about 2 billions and my sum is about 1/2 a billion.
Quote
0 #3 Jaroslaw 2016-09-19 19:08
Quoting Alexey Nikitin:
The atomics are in the case of Linux 1.5 times faster, in the case of windows 8 times faster than the std::accumulate algorithm. That changes even more in the case of optimization. Now Linux is 15 times, Windows is 50 times faster.


Did you mean "atomics are 1.5 time slower"?
Quote
0 #4 Rainer Grimm 2016-09-20 06:49
Quoting Jaroslaw:
Quoting Alexey Nikitin:
The atomics are in the case of Linux 1.5 times faster, in the case of windows 8 times faster than the std::accumulate algorithm. That changes even more in the case of optimization. Now Linux is 15 times, Windows is 50 times faster.


Did you mean "atomics are 1.5 time slower"?

Thanks. Of course you are right. I fix it.
Quote
0 #5 SEO 2016-09-27 10:27
I like the valuable info you provide in your articles.

I will bookmark your blog and check again here frequently.
I am quite certain I will learn plenty of new stuff right here!

Good luck for the next!
Quote
0 #6 Mel 2016-11-11 20:21
Had a lot of hassle researching posts about business casual outfits with leggings,
glad I came across this...incredibly helpful
Single threaded: Summation of a vector - The newest addition to my RSS feed
Quote

Add comment


My Newest E-Book

Latest comments

Subscribe to the newsletter (+ pdf bundle)

Blog archive

Source Code

Visitors

Today 169

All 277623

Currently are 146 guests and no members online