I’m happy to present the seven winners in this post including their answers.

Strange Packt Behavior

First, I have to apologize.Packt said they give me two vouchers for a printed and five vouchers for a digital book. During the quiz, they decided, to skip the two coupons for the printed book and said to me that I should rewrite my posts and remove the coupons for the printed books. This was not an option for me. Therefore, I changed a little the rewards and have seven coupons.

The Changed Reward

The five best answers get a coupon for a digital book. I will send your e-mail adresses to Packt. They will contact you and send you the coupon. If there is an issue, please let me know.
Two participants get a coupon for one of my books. I will send you an e-mail and you have to answer me, which of my LeanPub books you prefer: https://leanpub.com/bookstore?type=all&search=Grimm
I got so many good answers and I decided, therefore, the following. Each participant gets a coupon for my online course “C++ Fundamentals for Professionals“. I will send you an e-mail including the coupon.

And here are the questions and the best answers. I only mention your first name.

What is your favorite tool to measure the performance of your program, and why?

Charles

Big fan of VTune at work, mostly because nothing else seems to work quite as well on our toolchain. I generally use the hotspot detection the most to drill into the functions that are taking up the most CPU time.

Close second would be Tracy (https://github.com/wolfpld/tracy). I work in graphics and most of my timings are on a frame basis, so being able to mark parts of functions and see them in evaluation order is very important. Tracy is great because if you have the source you can add it to anything, and you can instrument only what’s needed through scoping.

David

First, I use perf tools from Linux kernel tooling, because no compilation flags to set, just `perf record` and a pid, then `perf report` to see whatever happens.
There is almost no performance penalty to use it so so you can run almost everything without impacting the program behavior.

Second, valgrind with callgrind, because with kcachegrind it displays things well so you can understand your program in a global way. But valgrind is highly performance impacting for program, so I use it sparsely now, and prefer perf.

In case of multi-threaded program, I like to have, under Linux, a simple htop command running while my program is running so I can see which thread use most of a CPU (using named thread help a lot),

Sal

My first favorite tool is Compiler Explorer, which allows me to the code generated . See if i can improve my code logic or use better practices such using constexpr when makes sense, using std::move or universal references to avoid copying for heavy containers / classes. I also try to write to have better cache allocation and try to use less branching.

My second favorite tool is vcperf from MSVC which is an open source performance analyzer and C++ Build Insights is a visual explorer to be able analyze the data (https://github.com/microsoft/vcperf).

That way i can figure out the bottlenecks and work on those parts only to improve. A similar tool is available for linux i believe has perf functionality and flamegraph which is an open source visual performance analyzer uses perf: (https://github.com/brendangregg/FlameGraph)

My third tool; is just writing scoped timers so i can manually test certain parts quickly; especially helps when testing allocator or concurrency issues.

John

I think my answer will be different to others you are receiving. Although I’ve been involved in writing software having an explicit goal of “performance” since the 1980’s, I rarely use any kind of external tooling like a profiler.

My current work has the explicit requirements of using little memory and running fast, as it’s a job that had to date been running on mainframe hardware, and now it’s possible to get it to fit on a common Linux server. The earliest proof of concept code wasn’t just to get the right output, but to provide metrics to clarify how much memory the scaled-up job will need, and how long it would take to run.

That is, my “tool” is to include performance metrics as part of the code. These can be reported via logging or telemetry in order to make sure that it continues to perform well.

This will certainly involve noting the size and count of important objects, which is more telling than just knowing the process’s overall memory usage. I use a simple base class template to instrument the classes with a labeled counter and include it in the logging. Besides the initial assessment of how the program’s memory usage will scale with the load, monitoring this will make sure that changes don’t increase the size of the object. For example, adding a bool field to an object surprises a maintenance coder by increasing the size of each instance by 16 bytes. Packing fields efficiently, using smaller types and bit-fields, we can see the result and know that the packing worked as intended.

Of course, the running time is important to know. Not just the overall process time, but instrumenting the time spent in each component. On the platform I’m currently working on (CentOS 7 via AWS) there are two different library calls available: one to get wall-clock time with higher resolution and no syscall overhead, and one to get the separate user/system times but at lower resolution and with the overhead of calling into the kernel.

I have a simple class I can use, with a scope wrapper to start and stop a specific timer around a section of code. It also reports the number of times it was started, which can be useful sometimes on its own, but is generally used to subtract out the function overhead with that multiplier. I look at the system time for code that does I/O or other system calls, and use the lower-overhead, higher precision call more generally.

By arranging these in a logical way around the architecture of the code, it’s easy to see which parts are taking a lot of time. By recording the numbers and running with jobs of different sizes, we can see the algorithmic complexity: does the time vary directly with n, or with n²? For stand-alone unit testing and running using mocks, we can separate out the time of this component from the time consumed by the input parsing, output formatting, and the mocks. When a change is made, we can see if the numbers changed drastically: either an innocent edit made the time worse for some reason, or whether something that was supposed to implement a performance improvement (at the expense of readability or maintainability) actually had an effect that was worth it.

If some innocent change makes the timing seriously worse, we don’t need fancy tools to figure out that the performance issue is in that code which was just committed. It may be a hidden O(n²)-inducing loop, or bad branch prediction, or a coder who hadn’t yet learned that a linked list isn’t what they learned in school, performance wise. We don’t have to use elaborate means to hunt for it, and we catch it immediately.

The point is to keep a handle on these performance-related metrics, we don’t have to make a special run using a special tool. They are built in, automatically reported when running the local unit testing.

Slightly more generally, the idea is to build something that you can easily measure the performance of. For example, a unit testing program will not read one line of input, digest it, call the actual component, and repeat; although that is how you might prefer to do it in “real” usage. Instead, it reads in all the input and stores it in a std::vector, and closes the file. Then, it processes the input by iterating over the in-memory vector. This isolates the timing of the actual component from the system I/O usage. Even the extra memory used by the intermediate storage of the entire input can be noted at the end of that stage, to be subtracted out of the process total. And of course it doesn’t affect the measurements of the object count, since it’s unrelated to that.

José

The tool I used the most to evaluate the performance of my programs is valgrind

I find it to be the best tool to find memory leaks — simple usage is quite intuitive but it still has a bunch of extra powerful features.

I worked on robotics for many years, and our target systems had much less memory and computational speed than regular computers — this usually means that trying to run valgrind on the target systems renders them too slow to the point they become unusable, however, I was still lucky in a handful of times and valgrind caught a few issues on target without compromising robot’s behavior too much — issues that were not reproducible in debug/simulation environment.

Ivica

My favorite tool is perf to collect the profiling data, and speedscope to visualize it. Perf is really good, it is available in Linux by default, it can measure runtime, but also all other sorts of events like: cache misses, mutex contentions, mallocs, etc. You don’t need to recompile your code, everything works straight from the box.

My favorite tool for visualization is speedscope. It takes the perf’s output and displays it in web browser. It supports flame graphs, but also can track program execution in time, or it can group functions by runtime. Here it how it works in practice:

https://www.speedscope.app/#profileURL=https%3A%2F%2Fraw.githubusercontent.com%2Fibogosavljevic%2Fjohnysswlab%2Fmaster%2F2021-03-speedscope%2Fspeedscope-ffmpeg.txt

Farés

These are three big names I know of ( as a Linux developer ) there’s plenty of other smaller ones that haven’t seen active development in a while.

Valgrind
TAU – Tuning and Analysis Utilities
Google Perftools

I personally use Google Perftools because it is faster than Valgrind (yet, not so fine-grained)
Does not need code instrumentation
Nice graphical output ( –> Kcachegrind )
Does memory-profiling, CPU-profiling, leak-checking

IMHO, some go for a debugger ( Maybe the best method ). All you need is an IDE or debugger that lets you halt the program. It nails your performance problems before you even get the profiler installed.

For instance, the profiler in Visual Studio 2008 is very good: fast, user-friendly, clear, and well-integrated in the IDE.

Tell me how your favorite tool helped you find the performance bottleneck.

Charles

In almost every case the performance bottlenecks I face in C++ are never where I think they should be, even after godbolting sections of code and comparing edits. These kinds of very introspective tools allow me to find the actual problem, not the hypothetical thing I think it is, and target it very directly.

David

Not sure I used one for this, may be commented line of codes are the most usefull tools, so I can slip my code and activate or deactivate some parts so I can measure which performance improvement I have. I can do some refactoring and see how my code change by measuring it (using a simple time outside, or a timer inside the code itself)

Modernes C++ Mentoring

"Fundamentals for C++ Professionals" (open)

"Design Patterns and Architectural Patterns with C++" (open)

"C++20: Get the Details" (open)

"Concurrency with Modern C++" (open)

"Embedded Programming with Modern C++": January 2025

"Generic Programming (Templates) with C++": February 2025

"Clean Code: Best Practices for Modern C++": May 2025

Do you want to stay informed: Subscribe.

Sal

Writing a scope timer and memory tracker really helped to understand how many allocations cause using standart vector and string but using a pmr allocator or writing custom allocator really helped to avoid as much as possible

I am still learning those and I know performance analysis is hard since it involves CPU as well and the number of tests that you can run is limited. Therefore having CI tools and using tools such as Google Benchmark and Quickbench will help. I will be learning those as well.

John

Well, in the current project, I gave typical examples above: see which architectural parts are slowest, and see if an edit had a change in the overall timing or memory usage.

But I would say a “favorite” would be when I could almost instantly say “The problem is not in my code.” Maybe (just making up some numbers here) the client (internal customer using this component in a larger context) finds that a call is taking 25 seconds. He can only separate out the wall-clock time of the call and the overhead of the remote network call, as that’s built into the framework (see: built-in measurements are handy!) But I can look at the logging and say that my component only took 8 seconds, and that his problem is that he’s spending all his time formatting the output into JSON for the return over the network.

Let me add that this technique is available and very useful even on platforms where we don’t have other tools like profilers, or a GUI running on the same system. This can be a locked-down restricted corporate server with nothing installed other than what came in the OS image, or an embedded system with no room for anything other than the actual code and none of the needed support for such tools to even exist.

José

I distinctly remember two instances of the tool helping me. One as a direct result of the output valgrind provides, the other as an indirect result.

Well, the former was a very small leak growing unbounded from object creation and destruction; the robot would behave properly for 3-4h and then become unresponsive — we noticed RAM usage going through the roof. No issues happened in the simulation environment, however, running valgrind in the target for a few minutes was enough to catch the small leak — in a servo driver which explained why it was not happening in the simulation environment.
The latter proved to be a memory leak caused by a race condition in our interactions with external systems. When running in a simulation environment, valgrind detected leaks in distinct portions of the code and we were unable to pin-point where. But, in real deployment, there were no leaks being detected by valgrind which really puzzled us at first – could they be happening in the emulation layer? But, we also took into consideration that the system behaves much slower when valgrind is monitoring memory usage, so, maybe the leak was caused by some faster sequence of actions – that hint helped us find the culprit, it was in a communication protocol layer responsible to exchange data with the GUI interface.

Ivica

Perf+speedscope help me everyday. I record what the program is executing with perf record, then I visualize it with perf script | speedscope –. This fires up the web browser and displays the profile.

Farés

I googled the info and stumbled to this article which I need to understand: JFYI http://15418.courses.cs.cmu.edu/spring2013/article/19

Post Views: 2,778

Thanks a lot to my Patreon Supporters: Matt Braun, Roman Postanciuc, Tobias Zindl, G Prvulovic, Reinhold Dröge, Abernitzke, Frank Grimm, Sakib, Broeserl, António Pina, Sergey Agafyin, Андрей Бурмистров, Jake, GS, Lawton Shoemake, Jozo Leko, John Breland, Venkat Nandam, Jose Francisco, Douglas Tinkham, Kuchlong Kuchlong, Robert Blanch, Truels Wissneth, Mario Luoni, Friedrich Huber, lennonli, Pramod Tikare Muralidhara, Peter Ware, Daniel Hufschläger, Alessandro Pezzato, Bob Perry, Satish Vangipuram, Andi Ireland, Richard Ohnemus, Michael Dunsky, Leo Goodstadt, John Wiederhirn, Yacob Cohen-Arazi, Florian Tischler, Robin Furness, Michael Young, Holger Detering, Bernd Mühlhaus, Stephen Kelley, Kyle Dean, Tusar Palauri, Juan Dent, George Liao, Daniel Ceperley, Jon T Hess, Stephen Totten, Wolfgang Fütterer, Matthias Grün, Phillip Diekmann, Ben Atakora, Ann Shatoff, Rob North, Bhavith C Achar, Marco Parri Empoli, Philipp Lenk, Charles-Jianye Chen, Keith Jeffery, Matt Godbolt, Honey Sukesan, and bruce_lee_wayne.

Thanks, in particular, to Jon Hess, Lakshman, Christian Wittenhorst, Sherhy Pyton, Dendi Suhubdy, Sudhakar Belagurusamy, Richard Sargeant, Rusty Fleming, John Nebel, Mipko, Alicja Kaminska, Slavko Radman, and David Poole.

My special thanks to Embarcadero
My special thanks to PVS-Studio
My special thanks to Tipi.build
My special thanks to Take Up Code
My special thanks to SHAVEDYAKS