What is this thing called Profiling?

If you’re reading this blog (I’m originally posting this on hirt.se), you probably already know who I am and my background. As some of you may know, one of my current responsibilities at Datadog is the Continuous Profiler product. After some discussions with profiling team members, I found it interesting that there are many subtly different ideas about what profiling is – often influenced by what a particular ecosystem is calling profiling.

So, here are my (unsolicited) thoughts around what software profiling is. 😉

What’s in a Word

Profiling literally means trying to understand behaviour. When asked to define Profiling, Google will say this:

“the recording and analysis of a person’s psychological and behavioural characteristics, so as to assess or predict their capabilities in a certain sphere or to assist in identifying categories of people.”

This is analogous to what we typically mean when we talk about the profiling of software, which I would simply state as:

“The recording and analysis of a program’s runtime behaviour.”

We’re simply trying to understand how it behaves (recording data), and why it is behaving that way (analysis of the data), so that we can improve some aspect of the program. There are other techniques that we can use to understand the semantic behaviour of a program, such as using debuggers. In profiling though, we’re trying to understand the runtime behaviour – how the program (your code) is behaving in terms of utilization of constrained or costly resources, such as CPU, memory, locking primitives and other potentially thread latency inducing operations and so on.

As to why we want to understand the runtime behaviour, there are a lot of different reasons these days:

To support performance engineering, for example:
- Reducing the cost of running a program.
- Making the software run faster (throughput).
- Making the software run smoother (e.g. less latency outliers, less STW-interruptions, less variance).
- Understanding performance deltas between versions of a program.
- Optimizing resource utilization.
To aid in diagnostics, for example:
- Help discovering and understanding unknown-unknowns.
- To help explain deviations from normal behaviour, e.g. why there was suddenly a 40 second delay in the execution of a particular program at a particular point in time.
- Help provide a link to the source code in the context of something else the user cares about, making whatever occurred more actionable.

I would argue that there are a few additional constraints for a profiler to be truly usable these days:

The Heisenberg Observer Effect notes that we can’t observe a system without affecting it. That said, a profiler that materially changes the runtime behaviour of the software it is profiling is not very useful – it will make us bark up the wrong tree.
Because of this, profilers will be making trade-offs. If the profiler, in the quest to not affect the runtime behaviour of the software it is profiling, misrepresents the runtime behaviour too much, it is also not very useful.
Also, since it is notoriously difficult to build a test which will perfectly mirror the system behaviour on Black Friday, 8:00 p.m. in production, these days you typically want a profiler that has a low enough overhead (it will be too costly otherwise), and that is stable enough, that you can use it continuously in production. A continuous profiler is a very powerful tool for finding the unknown-unknowns, especially when you’re in a tough spot.
With the uptake of continuous integration and continuous delivery practices (CI/CD), a new version of a program can be published every few hours, or even more often than that. You will want, at least, to have production data around for every version you publish, and probably from multiple different time periods during the process lifecycle. (Of course, with continuous profiling, this point is moot – you have data for all time periods, should something happen.)

Sampling Profilers

Today, most profilers will be sampling profilers. A sampling profiler is a type of profiler that collects data about a program’s execution by periodically sampling the program’s state at specific intervals. In contrast to other profilers, which typically capture every function call or at specific runtime events, sampling profilers gather information by intermittently observing the program’s execution state.

This means that the correlation with a certain runtime characteristic will depend on when the sample was taken. To do CPU profiling, simply wait for a thread to use up a certain amount of CPU time, then signal the thread and take a sample. To do allocation profiling, wait until a certain amount of memory has been allocated, then take a sample (in the allocation path of the runtime). To do lock profiling, wait until a monitor has been waited on for a certain amount of time, then take the sample (in the appropriate monitor handling path of the runtime). The reason for why one must sample, is that tracing every method/function invocation will cause too much overhead, quite possibly affecting the runtime behaviour of the application.

A sampling profiler will try to sample uniformly over some quantity, for example every 9 ms of CPU-time consumed. This gives some rather nice statistical properties. It is easy to aggregate the samples and relate them to that quantity – “this method is on average using 456.32 ms of CPU time / s”, “that method is responsible for an allocation rate of 845 MiB / s (which in turn is why your garbage collector is running hot)”.

Note that these sampling profilers do not need to pre-aggregate data to be proper sampling profilers. With the advent and adoption of the pprof format, this is sometimes assumed, but there are plenty of sampling profilers that also capture the time the sample was taken. This makes the samples much more useful for diagnostics. One recent example was a Go service where it wasn’t discovered, until the time stamps were preserved, that the service has bursts of activity for a couple of 10s of milliseconds every 10 seconds, which stood out very well in a heat map, once time stamp information was included per sample. Collecting timestamps per sample (and adding context) helps immensely with diagnostics, but more on this later.

We don’t necessarily need stack traces for this to be profiling. We capture whatever is needed to understand how we came to present the observed behaviour. That said, having no execution context at all, for example a simple performance metric, will usually not be enough to satisfyingly help with the analysis part. It’s usually understood that the stack trace will be one of the primary sets of data included in the sample, since it is indeed very useful in many cases and languages.

Also note that the data production rate can still be hard to understand even with a sampling profiler. For CPU it’s relatively easy – the upper limit will be #cpus * average sample size / sample interval. For allocation sampling, it was hard enough that we (Datadog) introduced a new rate limited allocation profiler in OpenJDK, conceptually using a discrete PID controller to control how to subsample (we can’t use reservoir sampling, since we don’t want to do the work up front, and then decide which samples to keep), and also record the amount of memory allocated since the last sample in each sample to be able to normalize.

Execution Tracers and Event Recorders

Another kind of profilers are the so-called execution tracers. An execution tracer instrument certain operations in the runtime and typically provide events around them, often containing useful diagnostics information. For example, the monitor event in JFR will contain information about the monitor class, the thread holding on to the monitor blocking our hero thread, the address of the monitor (so that we can see if there are multiple monitor instances at play), and more. Note that sampling profilers (especially runtime specific profilers) can capture such information as well, so the difference is mostly in how the sample is taken.

Since emitting data for every invocation of a pathway in the runtime can be prohibitively expensive, tools like JFR will provide configuration options to subsample the data in different ways. This can be useful, for example, for outlier profiling. An example of outlier profiling are all the thread latency events in JFR, for example the monitor enter event, where you can specify to only pick events with a duration longer than a specified interval.

Serious drawbacks of execution tracers, even the ones that only capture specific events, are that:

Unless there is a subsampling strategy, and even then, the amount of data emitted can be very hard to reason about.
Depending on the subsampling strategy, some aggregations will be harder to reason about. For example, if you’re looking at latency outliers, you can paint a picture of where the latency outliers are, but you will not be able to tell what the average latency is.

Performance Engineering

Performance engineering is the black art of optimizing the software to do better on some runtime characteristics. To do that, profiling data is analyzed, for example learning where CPU is spent, so that the program can be optimized to use less resources (e.g. CPU), which in turn makes it less costly to run (need less hardware). Supporting performance engineering is what people most commonly will associate with profiling, and for that use case, it’s most commonly CPU profiling that springs to mind. Many programming languages/runtimes will have the concept of a stack, and ultimately, the underlying hardware will have the concept of hardware threads, and stacks to execute. Therefore, to understand how we came to execute a particular method or function, profilers will often capture a stack trace. Depending on the programming language and environment (such as frameworks used) this can be more or less useful, as practitioners of reactive programming and async frameworks will be very well familiar with. (Project Loom is in a way a response to this problem. Make Stack Traces Great Again! 😉 )

Both sampling profilers and execution tracers can be put to good use to understand the runtime profile of a program. For example, if you demand a little bit more of your distributed tracer and the tracer integrations, and keep tabs of when threads are doing work in the context of a certain trace / span / operationName, you can start aggregating profiling information by endpoint, for example showing the amount of CPU-time spent by a specific endpoint.

Using Profiling for Diagnostics

Of course, when you have samples that contain context and time information, you can also go look at that information when something goes spectacularly wrong. For example, looking at a timeline view of the threads involved in processing a part of a distributed operation that was painfully slow, can reveal a lot. When something goes wrong, and it has not been a priori instrumented by the tracer, logging or some other instrumentation, profiling data is often the last resort for explaining what went wrong. In other words, it can help understand the unknown unknowns.

Here are some screenshots that hopefully will give you some idea of the capability:

For some examples using this feature (in Go), see Felix Geisendörfer’s YouTube video.

Note that these screenshots were from profiling timelines for specific spans in a distributed trace. It’s also possible to look at a timeline for all the threads in the runtime.

Some time ago we had a 21 second span that remained unexplained until the profiling data showed that it was a safe pointing VM operation related to dumping the heap – someone had ssh:d into the machine and used jcmd to request a full heap dump. Had it not been for recording outliers for safe pointing VM operations, this could have been hard to explain. Profilers purely written in Java and using exceptions to force stack walks, or using the Java stack walking APIs, would never know better. For them it would have been like the world would have been stopped, and the only visible effect would have been that the clock suddenly skipped ahead 21 seconds.

TL;DR

Profiling is the recording and analysis of the runtime behaviour of a program.
Profiling can not only be used for performance engineering, but it can also be a very powerful diagnostic tool.
Profiling samples often contain stack traces, but in some paradigms, stack traces will not be the most helpful thing to explain why something ended up being called.

Many thanks to my colleagues at Datadog for all the awesome work they do, and for the feedback on this post.