OpenJDK and the Future of Production Profiling

Some thoughts on the future of continuous production profiling on the OpenJDK platform.

A long time ago, the JRockit Runtime Analyzer (JRA) was introduced into the JRockit JVM as a means of being able to figure out what was going on in the JVM. It was mainly there to find out how customers were using the JVM, so that the JVM could be optimized for actual real-world production work-loads. The JRA output the data as XML, since customers insisted on the data being human readable so that they could see exactly what they would be sending us. Later LAT (LATency Analyzer) was introduced, since after the introduction of the JRockit low latency garbage collector (a.k.a the DetGC), some customers complained about the GC not keeping its latency promises. More often than not, it turned out it was other kinds of thread stalls causing the latencies, so LAT was introduced so that the JRockit team could figure out where the problematic code was. Since there could be a considerable amount of data in LAT, a binary format was introduced for the events recorded.

Eventually the production profiling (JRA + LAT) had a model overhaul and became the JRockit Flight Recorder. The data format was binary, self describing, extensible and efficient. After Oracle acquired Sun Microsystems, and the Hotspot and JRockit JVM teams merged, it became the Java Flight Recorder (JFR), and in 2018 it was open sourced as the JDK Flight Recorder (still JFR), since calling anything related to Java something with “Java” in the name can be complicated.

The JDK Flight Recorder design philosophy is to be the one-stop-shop production profiler for OpenJDK. JFR needs to be able to do various kinds of profiling, all at the same time, at a low overhead. It also needs to be able to run continuously for as long as someone is interested in the data. Potentially always.

Now, with changes in the Java (and the computing) ecosystem, JFR has some loom-ing challenges to remain relevant for the future.

JFR has a lot of nice properties:

  • It does multiple kinds of profiling, normally at a low overhead (space and CPU).
    (Profiling types include allocation profiling, almost CPU profiling, latency outlier profiling and much more.)
  • It doesn’t suffer from the same constraints as Java-only based profilers.
    (Weaknesses usually include things like safe point bias, safepointing VM operations, allocation pressure, undoing scalarization optimizations, lack of STW/safepoint visibility.)
  • It provides context helpful to solving problems.
    (For example, a monitor enter event has the monitor class, the monitor address, the thread holding the monitor, the thread and stack trace for the blocking call and more.)
  • It is low overhead, (mostly) designed for predictable data rates and overhead.
  • It is extensible – you can add your own profilers and data.
    (Datadog, for example, has its own rather useful exception profiler which is publishing data into JFR.)
  • As long as all your data is produced in JFR, all the common constants in the runtime will be recorded into the same constant pool.
    (For example, if you record some data using JFR and some data with a custom agent, common class names, method names etc will need to be repeated in both of the serialization formats.)

There are however a few ways in which JFR can, and probably has to, improve, to remain the ultimate production profiling platform for Java:

  • Data rate (and thereby overhead) needs to be more deterministic.
  • CPU profiling needs to be improved.
  • Wall clock profiling needs to be possible.
  • It needs to be possible to easily and cheaply associate context with events.

Let’s take these in reverse order.

Recording Context

A very common problem is to try to associate some contextual information with the data produced by JFR. For example, in the good old Oracle days, there were events produced by WebLogic Server (WLS) containing what was called the Execution Context IDs (ECIDs). Since JFR events are thread local, it was possible to use them to find out what other, lower level, information was captured during that time, for example where the allocation pressure was, or why a call to a logger leading to a blocking call trying to enter a contended monitor happened. The instrumented code, and thereby the events, were usually emitted where something took a bit of time, such as database IO and similar, so the overhead was not bad.

Fast-forward to today. We now have distributed tracers doing pretty much what WLS was doing back in the day. We now have a trace, which is a directed acyclic graph of spans, which we track. The difference being that with the microservices and async frameworks that are commonly in use these days, these spans can now last a very short time. The tracers do keep track of when the context is propagated to other threads; in other words, the tracer can know where work is carried out on specific threads for a specific trace/span. In the end, the tracer doesn’t really care though – it only keeps track of this to know when a span can be closed. No matter how many related little work items were scheduled on various threads, the tracer will simply note when the span was started, and when the span was closed. For the profiler, however, the information about which thread did work related to a specific span is crucial to be able to figure out what traces were captured in the context of what spans/traces. Since a thread can switch between doing work for different spans at an incredible rate, the overhead of using additional events to keep track of them can be quite costly. Especially if the switches happen way more often than a sample is being captured.

This will only be compounded once Loom becomes part of the Java platform, and we can have millions of Virtual Threads (fibers, p-threads, MxN threads, green threads, whatever you want to call them) running.

A long time ago, the JFR team was discussing something we called thread coloring – the ability to tell a thread what the current context is. This is exactly what Go does today with goroutines. In Go it is called profiling labels. Go pre-aggregates the CPU profiling data, but allows the settings of labels to group the profiling data into buckets. Sadly, in Go it is currently limited to CPU-profiling only.

One solution to the context problem could be to allow setting the context for a thread – conceptually a Map<String, long>, and to provide some new settings for an event, for example record-context, to decide if the context should be recorded into the event or if we don’t care, and record-only-on-context, to decide to record an event only if there is some context present.

Having such a capability would not only make it possible to associate events with a context even where such a context changes rapidly. It could also be used to only record certain events when they occur in a context for which we care, possibly bringing down the overhead of capturing data and stack-traces, and saving memory in the buffers for the things we find to be the most relevant. A concrete example would be to only record certain information if a thread is currently associated with a trace which will be sampled.

See https://bugs.openjdk.java.net/browse/JDK-8264516.

Wall-Clock Profiling Must Be Possible

JFR has an interesting mix of sampling profiling, thresholded execution tracing and metrics. For example, it has a sampling execution profiler, a rate-limited sampling allocation profiler etc. It also has events to locate thread latency outliers. These events are not sampled, but give you exact information about a thread halt lasting longer than a configurable threshold.

Of course, one problem with thresholding is that you can theoretically have a lot of thread latencies just below the threshold. It is easy to construct an artificial benchmark where the data indicates that you have no thread latencies whilst having nothing but thread latencies.

One way to get unbiased information, at a reasonable cost (i.e. not setting the threshold to 0 for the latency outlier events), is to introduce a wall clock profiler. A wall clock profiler basically periodically dumps the thread stack for a thread, no matter what it is up to. A limitation is that the wall clock profiler has no idea of what the thread was actually up to – we can’t get all the juicy, custom, actionable, information at that point – we usually only get the thread dump.

Also, given that you might have millions of (virtual) threads, you probably don’t want to sample them all at any given time, and you certainly don’t want to stop the world whilst doing the sampling. As a matter of fact, for most applications, looking at wall clock profiling information from all threads will be a very boring exercise – most of them will be halted waiting for something to do, for example in thread pools, or waiting in some I/O. For example, for your average Java recording, you will have plenty of threads in thread pools, being parked. To see that you have thousands of threads waiting on a park call in a thread pool will not help you resolve a great many problems. If you, on the other hand, have a really long park causing a trace to take a really long time to complete, then you’ll likely be interested in what is going on.

This means that different profilers probably want these wall-clock samples to be picked differently. One way to accomplish this would be to allow users of JFR to commit events on a separate thread.

Event#commit(Thread)

This would allow someone to build a profiler by simply periodically committing events on the threads it currently cares about. It would also allow building other kinds of profilers and sampling behaviours.

Another way of accomplishing this would be to add an annotation to override default stuff captured by JFR in an event.

class MyEvent extends Event {

    @overridejfr
    Thread overrideThread;
}

An advantage with this variant is that you could potentially add more override behaviours that are handled by JFR over time.

A third variant might be to introduce a configurable global wall clock event, perhaps one that picks one thread at a time in a round robin fashion, at a certain frequency. Together with the context feature mentioned above, it would be possible to only emit events where there is useful context available.

Yet another variant may be to add a native function (extern “C”) to emit an event. Something similar to AsyncGetCallTrace (probably implemented using AsyncGetCallTrace), but which emits an event. If we have the context feature described above, it could be used to decide where events are emitted. If different kinds of profiling will be implemented, you can simply provide context information about the kind of profiling being done to avoid the profilers interfering with each other.

There is JDK-8237206 for this, but there isn’t much information recorded in the issue.

CPU Profiling Improved

The current way of doing CPU profiling with JFR is to use the Execution Sample Event in JFR. It gives you a good idea of where the JVM is spending the most CPU time executing Java code. It is cheap, has low overhead, and satisfies the needs of many users.

That said, there are some problems. One is that it does not cover the full Java process. If there is a native library running native threads, the CPU time spent there won’t be accounted for. Also, some other native and intrinsified code, for example the JVM native threads, will not be covered. It is easy to write a benchmark where JFR will capture almost no samples at all.

To get a better understanding of where the CPU is being spent, we need to build a profiler actually sampling on CPU time, and that can, at least optionally, capture the full thread stacks including native frames. One approach to this is to do what async profiler does – use kernel profiling data and line it up with information from AsyncGetCallTrace.

At Datadog we’ve built a pretty awesome CPU profiler for Linux, based around perf_event_open – one that does really well, and which handles symbol resolution in containerized environments better than most. Sadly, we don’t have it for Windows and Mac, so, if we contribute it to OpenJDK, we would need help implementing it on other platforms.

There is JDK-8234854, but I haven’t added much context yet.

Deterministic Overhead and Memory Use

One problem that JFR used to have, before JDK 16, was that allocation profiling could get very expensive. Especially for allocation heavy applications, running on plenty of CPUs. It could get very expensive both in terms of overhead but also in terms of size. For example, running a well parallelized, allocation heavy, application on a 96 core machine could produce millions of events per minute, resulting in recording sizes of hundreds of megabytes per minute.

In JDK 16 Datadog submitted a new kind of allocation profiler, which was rate limited. It drew inspiration from PID controllers (yes, I still love robotics, even though there is literally no time for it these days) to let the user specify a budget of the maximum number of events per second, and then the controller will attempt to keep the number of events emitted to that budget, subsampling if necessary.

We had already used the same technique successfully for our exception profiler, since it turns out that some people use not only exceptions, but even subclasses of Error (!), for flow control.

Now, it may be useful to use the same kind, or some similar kind, of safety valve to limit both the overhead and the data production rate for other events that are prone to the same edge cases. For example, the thread latency events today are currently thresholded. This means that you perhaps, given the number of threads, might be able to estimate the maximum number of events you could produce. But it’s not very helpful, since you might be running very few, or, in the future, with Loom, millions of them.

Instead you might want to subsample them, at a max rate that you can set yourself, per event type. You could even allow for thresholding and rate limited sampling at the same time, accepting a certain bias, e.g. if you want to focus on longer lasting thread halts.

Summary

TL;DR, JFR has some interesting challenges ahead. Some of them will be compounded by the introduction of Virtual Threads. For JFR to remain the premium, best in class, production profiling platform, some investment will be needed. Datadog will try to help, but some of these problems will require updates to the JFR file format and updating the API. The OpenJDK community will need to be involved.

A Closer Look at JFR Streaming

By Marcus Hirt and JP Bempel

Since JDK 14 there is a new JFR kid on the block – JFR streaming. 🙂 This blog post will discuss some of the things that you can do with JFR streaming, as well as some of the things you may want to avoid.

An Introduction to JFR Streaming

In the most recent version of the JDK a new JFR-related feature was introduced – JFR streaming. It is a feature allowing a developer to subscribe to select JFR data and to decide what to do with that data in the host process. JFR events can also be consumed from a separate process by pointing to the file repo of a separate JVM process – the mechanism is the same.

The JFR streaming works by allowing the reading from the JFR file whilst it is being written, the emissions to disk happening more frequently (usually every seconds, or when the in memory buffers are full) than during your normal flight recordings, where the data is emitted when the in-memory buffers are full. It does not support streaming directly from in-memory buffers, and the events are not delivered synchronously when they are occurring in the JVM.

The new functionality mostly resides in jdk.jfr.consumer. This is how you would open an event stream and start consuming the CPU load with 1 second intervals and the monitor class when blocked to enter a monitor for 10 ms:

try (var rs = new RecordingStream()) {
  rs.enable("jdk.CPULoad").withPeriod(Duration.ofSeconds(1));
  rs.enable("jdk.JavaMonitorEnter").withThreshold(Duration.ofMillis(10));
  rs.onEvent("jdk.CPULoad", event -> {
    System.out.println(event.getFloat("machineTotal"));
  });
  rs.onEvent("jdk.JavaMonitorEnter", event -> {
    System.out.println(event.getClass("monitorClass"));
  });
  rs.start();
}

The RecordingStream is what you would use to control what is gathered from within the Java process, effectively also controlling the recorder.

Here is another example using the default recording template, and printing out the information for garbage collection events, cpu load and the JVM information:

Configuration c = Configuration.getConfiguration("default");
 try (var rs = new RecordingStream(c)) {
     rs.onEvent("jdk.GarbageCollection", System.out::println);
     rs.onEvent("jdk.CPULoad", System.out::println);
     rs.onEvent("jdk.JVMInformation", System.out::println);
     rs.start();
   }
 }

The EventStream class can be used together with the standard flight recorder mechanisms to gather information from ongoing recordings, even ones being done in separate processes or an already recorded file. Here is an example using the EventStream to get some other attributes of the CPU load and information from garbage collections from within the Java process (needs an ongoing recording):

try (var es = EventStream.openRepository()) {
   es.onEvent("jdk.CPULoad", event -> {
     System.out.println("CPU Load " + event.getEndTime());
     System.out.println(" Machine total: " + 100 * event.getFloat("machineTotal") + "%");
     System.out.println(" JVM User: " + 100 * event.getFloat("jvmUser") + "%");
     System.out.println(" JVM System: " + 100 * event.getFloat("jvmSystem") + "%");
     System.out.println();
   });
   es.onEvent("jdk.GarbageCollection", event -> {
     System.out.println("Garbage collection: " + event.getLong("gcId"));
     System.out.println(" Cause: " + event.getString("cause"));
     System.out.println(" Total pause: " + event.getDuration("sumOfPauses"));
     System.out.println(" Longest pause: " + event.getDuration("longestPause"));
     System.out.println();
   });
   es.start();
 }

This is the EventStream interface used to consume and filter an event stream:

public interface EventStream extends AutoCloseable {
  public static EventStream openRepository();
  public static EventStream openRepository(Path directory);
  public static EventStream openFile(Path file);

  void setStartTime(Instant startTime);
  void setEndTime(Instant endTime);
  void setOrdered(boolean ordered);
  void setReuse(boolean reuse);

  void onEvent(Consumer handler);
  void onEvent(String eventName, Consumer<RecordedEvent> handler);
  void onClose(Runnable handler);
  void onError(Runnable handler);
  void remove(Object handler);
  void start();
  void startAsync();
  void awaitTermination();
  void awaitTermination(Duration duration);
  void close();
}

The open* methods allow you to open a specific file or a specific file repository (for example from a different process). The set* methods allow you to filter on time and to select if you want to enforce that the events are delivered in time order. You can also allow the reuse of the event object that gets delivered, to get the memory pressure down a bit.

The onEvent* allows you to register a consumer for handling the events, either all of the events or by event name (type). The start method kicks off the processing in the current thread, startAsync is a convenience method for kicking off the process in a single separate thread.

Where to use streaming

There are several advantages to JFR event streaming. It is a great way to access JFR data for monitoring purposes. You get access to detailed information that was previously unavailable to you, even from different processes, should you want to.

Here are some examples:

  • Directly send monitoring data to your favourite monitoring service
    For example streaming select metrics over to Datadog. 😉 Not that you would need to – we already derive interesting performance metrics from the (full) flight recordings we capture. We even track complex metrics like top hottest methods, or top allocations sites, over time, using what we internally call high cardinality metrics.
  • Pre-aggregating data before sending it off
    For example, you could get the CPU load every second, and then every five minutes roll it up to an average, median, min, max and a standard deviation, not having to send every single entry.
  • Act on profiling data in-process
    You could, for example, make decisions for controlling the normal flight recordings given some statistics you track, like enabling certain events when it looks like it could be interesting.
  • Expose JFR data through other management APIs
    For example, adding an MBean exposing select JFR data over JMX.
    That said, there might be an API to directly connect to an MBeanServerConnection directly in the future[1]:

    MBeanServerConnection conn = connect(host, port);
    try (EventStream es = new RemoteRecordingStream(conn)) {
      es.onEvent("jdk.GarbageCollection", e -> ... );
      es.onEvent("jdk.ExceptionThrown". e -> ...);
      es.onEvent("jdk.JavaMonitorBlocked", e-> ...);
      es.start();
    }
    

It also allows you to skip the metadata part of a normal flight recording. The metadata in JFR contains the information about what was recorded, so that you can parse and view data that you may not even know about beforehand. In the case of monitoring a few well known data points, this is redundant information to keep sending over and over again.

Erik Gahlin has a neat example for producing health reports using JFR streaming, displaying pre-aggregation of the top frames for execution samples and allocation, as well as doing stats for some common data points, like GC metrics and CPU load.

Where Not to Use Streaming

This is from JEP-349[2], the JEP where JFR streaming was introduced:

To consume the data today, a user must start a recording, stop it, dump the contents to disk and then parse the recording file. This works well for application profiling, where typically at least a minute of data is being recorded at a time, but not for monitoring purposes.

Let’s explore why the JEP differentiates between monitoring and profiling. Some events in JFR are simple data points in time. Some are more complex, containing plenty of constants. For example stack traces. JFR takes great care to record these complex data structures in a binary format that doesn’t take a lot of processing time to produce, and which is still compact.

Some of the JFR events occur quite frequently – for example, a typical one minute recording of data can contain hundreds of thousands of events. The file size for such a recording is typically only a couple of MB large. There is a mix of techniques used to keep the size down, such as using constant pools to ensure that information like method names are not repeated, LEB128 encoding of integers etc.

For profiling you typically want quite a few of these events enabled. JFR was built to emit this data at a very low overhead, and the data is eminently useful to get detailed information about things like why your thread is halting. For example, the stack trace to a place where your code had to wait to enter a monitor, complete with the class of the monitor waited on, the exact duration of the wait, which thread was holding on to the monitor (making you not able to enter), the monitor address and more. Not only that, there may have been other events providing context about what was going on in that thread at the time of the monitor enter, shining further light on what was going on. Events that you may not even know about.

If using JFR streaming for profiling, you would spend a lot of effort either naively sending constant information over and over again in an inefficient way (say, serializing all of it to JSon), or trying to spend a lot of effort reproducing the JFR format (introducing your own constant pools etc).

For example, the RecordedEvent class contains a method to get the RecordedStackTrace, which is a List of RecordedFrame. For each event you would walk through, the in-memory object model would be created.

You can externalize some of that cost, i.e. how the process you are monitoring is affected, by using another process to read the data as described above. That would, for example, lessen the allocation pressure in the process you’re monitoring. That is great, for example if you have a very latency sensitive process. That said, you have now created another Java process and put the costs over there (including the cpu overhead of dealing with the memory pressure as well as the memory overhead of running another JVM), typically on the same host. If you can afford to dedicate the memory and pin the event stream reader process to its own processor (cpu affinity), this can be a good solution though. Note that the same can be done for normal flight recordings, i.e. you can stream the recorded data directly from the file repository from a separate process.

Benchmarks of Using JFR Streaming Wrongly

For laughs and giggles, here are some benchmarks using just standard JFR to get all the data in the profiling template, compared to getting the equivalent information and serializing it to a JSon-like format using JFR Streaming. In other words, abusing JFR Streaming as a streaming replacement for getting the full JFR dataset. This is of course not what you should be using JFR Streaming for, but exemplifies how wrong you can land if you use the technology in a way it was never intended. We’ll look at the latency of http requests, the CPU time spent and the allocation pressure. We’ll also look at the size of the payload of information extracted. The benchmark is admittedly being a bit extra mean as well, to explore edge cases.

Note that this is a simplified example – we’re not even bothering with extracting the full stacktrace information and re-encoding it for streaming, whilst the JFRs in comparison already contain the full stack traces for all events efficiently encoded in constant pools. We could trade (even more) CPU for trying to bring the streamed data back to a JFR style format with constant pools again before storing/sending it. For serialization we’re simply doing toString() on the event objects, which gives us the events in a JSon-like format with only the top five frames of the stack trace. Including the full stack traces would add quite a bit more overhead to the streaming example.

The benchmark is available in this GitHub repository. It is based on the standard PetClinic application with some modifications to make it relevant to measure overhead in general.

Note: We are using an early access version of OpenJDK 15 in the benchmarks, since we discovered a bug whilst building this benchmark. See JDK-8245120.

The first chart shows the impact on http request latencies:

Note: Y-axis is in log scale to magnify the difference.

Next chart shows the CPU consumption. We measure in ticks from /proc/<pid>/stat at the end of the benchmark. This way we have a unique number easy to compare with other runs. It represents the CPU consumed when threads were scheduled on the CPU.

 

The following chart is the total amount of heap allocated during the benchmark. Numbers were extracted from GC logs.

The final chart represents the size of extracted information, as described above:

Note that the JFR file contains the full stacktraces, and that the jfr-streaming one is limited to the top 5 frames.

So what is the conclusion of these benchmarks? Not much, except for: “don’t use technology for things it was never intended for”. 😉

Summary

  • JFR streaming is a great new way to expose JFR data for monitoring purposes.
  • JFR streaming is available from JDK 14 and above.
  • JFR streaming is complementary to the already existing JFR APIs, which remains the go-to way to use JFR for low-overhead detailed information (i.e. profiling / execution tracing).
  • Knowing where and how to use JFR streaming is key to avoiding sad pandas.

[1]:https://www.reddit.com/r/java/comments/e97vos/jfr_event_streaming_with_jdk_14_in_outprocess/faiapm8/

[2]:https://openjdk.java.net/jeps/349

Fetching and Building Mission Control 8+

As described in a previous post, Mission Control is now on GitHub. Since this alters how to fetch and build OpenJDK Mission Control, this is an updated version of my old post on how to fetch and build JMC from version 8 and up.

Getting Git

First step is to get Git, the SCM used for OpenJDK Mission Control. Installing Git is different for different platforms, but here is a link to how to get started:

https://git-scm.com/book/en/v2/Getting-Started-Installing-Git

Installing the Skara Tooling (Optional)

This is an optional step, making it easier if you want to contribute to Mission Control:

https://hirt.se/blog/?p=1186

Cloning the Source

Once Git is installed properly, getting the source is as easy as cloning the jmc repo. First change into the directory where you want to check out jmc. Then run:

git clone https://github.com/openjdk/jmc.git

Getting Maven

Since you probably have some Java experience, you probably already have Maven installed on your system. If you do not, you now need to install it. Simply follow the instructions here:

https://maven.apache.org/install.html

Building Mission Control

First we need to ensure that Java 8 is on our path. Some of the build components still use JDK 8, so this is important.

java –version

This will show the Java version in use. If this is not a Java 8 JDK, change your path. Once done, we are now ready to build Mission Control. Open up two terminals. Yep, two!

In the first one, go to where your cloned JMC resides and type in and execute the following commands (for Windows, replace the dash (/) with a backslash (\)):

cd releng/third-party
mvn p2:site
mvn jetty:run

Now, leave that terminal open with the jetty running. Do not touch.

In the second terminal, go to your cloned jmc directory. First we will need to build and install the core libraries:

cd core
mvn install

Next run maven in the jmc root:

mvn clean package

JMC should now be building. The first time you build Maven will download all of the third party dependencies. This will take some time. Subsequent builds will be less painful. On my system, the first build took 6:01 min. The subsequent clean package build took 2:38.

Running Mission Control

To start your recently built Mission Control, run:

Windows

target\products\org.openjdk.jmc\win32\win32\x86_64\jmc.exe -vm %JAVA_HOME%\bin

Mac OS X

target/products/org.openjdk.jmc/macosx/cocoa/x86_64/JDK\ Mission\ Control.app/Contents/MacOS/jmc -vm $JAVA_HOME/bin

Contributing to JDK Mission Control

To contribute to JDK Mission Control, you need to have signed an Oracle Contributor Agreement. More information can be found here:

http://openjdk.java.net/contribute/

Don’t forget to join the dev list:

http://mail.openjdk.java.net/mailman/listinfo/jmc-dev

We also have a Slack (for contributors), which you can join here:

https://join.slack.com/t/jdkmissioncontrol/signup

More Info

For more information on how to run tests, use APIs etc, there is a README.md file in the root of the repo. Let me know in the comments section if there is something you think I should add to this blog post and/or the README!

Mission Control is Now Officially on GitHub!

Since this morning, the JDK Mission Control (JMC) project has gone full Skara! mc_512x512This means that the next version (JMC 8.0) will be developed over at GitHub.

To contribute to JDK Mission Control, you (or the company you work for) need to have signed an OCA, like for any other OpenJDK-project. If you already have an OpenJDK username, you can associate your GitHub account with it.

Just after we open sourced JMC, I created a temporary mirror on GitHub to experiment with working with JMC at GitHub. That mirror is now closed for business. Please use the official OpenJDK one from now on:

https://github.com/openjdk/jmc

If you forked or stared the old repo, please feel free to fork and/or star the new one!

Compressing Flight Recordings

Flight recordings are nifty binary recordings of what is going on in the runtime and the application running on it. A flight recording contains a wide variety of information, such as various kinds of profiling information, threat stall information and a whole host of other information. All adhering to a common event model and with the ability to dynamically add new event types.

In the versions of JFR since JDK 9, some care was taken to reduce the memory footprint by LEB 128 encoding integers, noting that many things, like constant pool indices, usually occupy relatively low numbers. The memory footprint was cut in about half, compared to previous versions of JFR.

Now, sometimes you may want to compress the JFR data even further. The question then is – how much can you save if you compress the recordings further, and what algorithms would be best suited for doing the compression? What if you want the compression activity to use as little CPU as possible?

My friend and colleague at Datadog, Jaroslav Bachorik, set out to answer that question for some typical recording shapes that we see at Datadog, using a set of compression algorithms from Apache Commons Compress (bzip2, LZMA, LZ4), the built in GZip, a dedicated LZ4 library, XZ, and Snappy.

Below is a table of his findings for “small” (~1.5 MiB) and “large” (~5 MiB) recordings from one of our services. The benchmark was run on a MacBook Pro 2019. Now, you’d have to test on your own recordings to truly know, but I suspect that these results will hold up pretty well with other kinds of loads as well.

Algorithm Recording Size Throughput Compression Ratio Utility
Gzip small 24.299 3.28 79.647
Gzip large 5.762 3.54 20.436
BZip2 small 6.616 3.51 23.198
BZip2 large 1.518 3.84 5.826
LZ4 small 133.115 2.40 319.394
LZ4 large 38.901 2.57 100.009
LZ4 (Apache) small 0.055 2.74 0.152
LZ4 (Apache) large 0.013 3.00 0.039
LZMA small 1.828 4.31 7.882
LZMA large 0.351 4.37 1.533
Snappy small 134.598 2.27 305.494
Snappy large 35.986 2.49 89.692
XZ small 1.847 4.31 7.964
XZ large 0.349 4.37 1.523

Throughput is recordings/s. Utility is throughput * compression ratio, and meant to capture the combination of compression strength and performance. Note that the numbers are not normalized – only compare numbers in the same size category.

Summary / TL;DR

  • The built-in GZip is doing a fairly good/balanced job of compressing flight recordings
  • You can get the best utility out of LZ4, closely followed by Snappy, but you sacrifice some compression
  • If you’re prepared to pay for it, LZMA and XZ give a good compression ratio
  • All credz to Jaroslav for his JMH-benchmark and all the data

JFR is Coming to OpenJDK 8!

I recently realized that this isn’t common knowledge, so I thought I’d take the opportunity to talk about the JDK Flight Recorder coming to OpenJDK 8! The backport is a collaboration between Red Hat, Alibaba, Azul and Datadog. These are exciting times for production time profiling nerds like me. Smile

The repository for the backport is available here:

http://hg.openjdk.java.net/jdk8u/jdk8u-jfr-incubator/

The proposed CSR is available here:

https://bugs.openjdk.java.net/browse/JDK-8230764

The backport is keeping the same interfaces and pretty much the same implementation as is available in OpenJDK 11, and is fully compatible. There were a few security fixes, due to there not being any module system to rely upon for isolation of the internals, also, some events will not be available (e.g. the Module related events) but other than that the API and tools work exactly the same.

JDK Mission Control will, of course, be updated to work flawlessly with the OpenJDK 8 version of JFR as well. The changes will be minute and are only necessary since Mission Control has some built-in assumptions that no longer hold true.

You can already build and try out OpenJDK 8 with JFR simply by building the JDK available in the repository mentioned above. Also, Aleksey Shipilev provide binaries – see here for details.

Have fun! Smile

Flight Recorder & Mission Control at Code One 2019

Code One is rapidly approaching (September 16-19). For fans of JDK Flight Recorder and JDK Mission Control, there will be a lot of relevant activities at Code One. This is an attempt to list them. If I missed something, please let me know!

Sessions

Here are the regular sessions:

Session Name

Presenters Day Time

Location

JDK Mission Control: Where We Are, Where We Are Going [DEV4284]

David Buck Monday 9:00 Moscone South
Room 301

Introduction to JDK Mission Control and JDK Flight Recorder [DEV2316]

Marcus Hirt
Klara Ward
Monday 16:00 Moscone South
Room 202
Improving Observability in Your Application with JFR and JMC [DEV3460] Marcus Hirt
Mario Torre
Tuesday 11:30 Moscone South
Room 201
Java Flight Recorder: Black Box of Java Applications[DEV3957] Poonam Parhar Wednesday 12:30

Moscone South
Room 203

Robotics on JDK 11? With Modules? Are You… [DEV2329] Marcus Hirt
Miro Wengner
Robert Savage
Wednesday 16:00

Moscone South
Room 313

Four Productive Ways to Use Open Source JFR and JMC Revisited [DEV3118] Sven Reimers
Martin Klähn
Thursday 11:15 Moscone South
Room 304
Enhanced Java Flight Recorder at Alibaba [DEV3667] Sanhong Li
Fangxi Yin
Guangyu Zhu
Thursday 12:15 Moscone South
Room 203

Performance Monitoring with Java Flight Recorder on OpenJDK [DEV2406]

Hirofumi Iwasaki
Hiroaki Nakada
Thursday 13:15 Moscone South
Room 201

Again, if I’ve missed one, please let me know!

Other Activities

  • There is going to be a hackergarten session around JMC and JFR, Wednesday at 14:30-16:00, inside of the Groundbreakers booth in the Exhibition Area.
  • On Friday a few JMC project members are planning to meet up for some coding between 10:00 and 12:00, and then have lunch together at 12:00. Ping me (Marcus) for an invite.
  • On Wednesday at 18:00 a few JMC project members are planning to go for dinner. Ping me (Marcus) for an invite.

Summary

  • Lots to do at Code One 2019 for fans of JFR and JMC.
  • Helpful links above. Winking smile

Deep Distributed Tracing with OpenTracing and the JDK Flight Recorder

Recently I had a talk at Code One about using OpenTracing together with the JDK Flight Recorder to do deep tracing. Since the session wasn’t recorded, I though I’d do a blog about it instead. Here we go…

Distributed tracing has been of interest for a very long time. Multiple companies have sprung up around the idea over the years, and most APM (Application Performance Management) solutions are built around the idea. Google released a paper around their large scale distributed systems tracing infrastructure in 2010 – Dapper, and there are now several open source alternatives for distributed tracing available inspired by the paper, such as Jaeger and Zipkin.

In Java land, pretty much all of the APMs are doing pretty much the same thing: they use BCI (byte code instrumentation) for getting the data, and then they present that data to the end-user in various ways, oftentimes using some kind of analysis to recognize common problems and suggesting solutions to the end users of the APM. The real differentiation is knowing what data to get, and what to do with the data once captured.

Since there was no standard, one problem was for vendors to inject helpful, vendor specific, information into the distributed traces. The vendor of a software component may have a quite good idea about what information would be helpful to solve problems. Some vendors support APM specific APIs for contributing the data, but more often than not the instrumentation is done using BCI by scores of developers working for the various APM companies. The same is true for maintainers of open source components – either skip the problem entirely and let the APM vendors come up with good instrumentations points (if your component is popular enough), or pick a popular APM and integrate with it. That is, until OpenTracing came along…

Introduction to OpenTracing

OpenTracing is an open source, vendor neutral, distributed tracing API. In other words, library developers can interact with one API to support multiple APM/Tracer vendors. Also, customers can add contextual information to distributed traces without worrying about vendor lock-in. Contributors to OpenTracing include LightStep, Jaeger, Skywalking and Datadog, and the specification is available on GitHub:

https://github.com/opentracing

The core API concepts in OpenTracing are (from the slides of my talk, DEV5435):


Trace


– A distributed operation, potentially spanning multiple processes


– Implicitly defined by the individual Spans in the trace (more soon)


– Can be thought of as a directed acyclic graph (DAG) of Spans


– The span in the root of the DAG is called the root Span


– The edges between the Spans are called References


Span


– Has an operation name


– Has a start timestamp


– Has a finish timestamp


– Has a SpanContext


• Has Baggage Items (key/value pairs which cross process boundaries)


• Implementation specific state used to identify the span across process boundaries)


– Zero or more key/value Span Tags


– Zero or more Span Logs (key/value + timestamp)


Reference


– Defines a direct casual relationship between two spans


– ChildOf


• Parent depends on the child in some way


• Note that it is legal for a finish timestamp of a child to be after that of any parent


– FollowFrom


• Parent does not depend on the result of the child in any way


• Note that it is legal for a FollowsFrom child to be started after the end of any ancestor

Also worth noting is that a Scope is a thread local activation of a span.

The Example

As an example, we’ll be using a simple application consisting of three microservices. It is part of the back-end of a fictional robot store. Robots can be ordered at the Orders service, and they will be produced in a Factory. There is also a Customers service keeping track of the customers. Finally there is a load generator that can be used to exercise the services.

image 

The code is available under https://github.com/thegreystone/problematic-microservices. (Yes, as the name indicates, the services come pre-packaged with built-in problems. :))

The services, as well as the load generator, have built-in tracing support, so for a full systems run with the load generator, you would get a trace (a DAG of spans), looking something along the lines of:

image

Or, in Jaeger, where you have time on the X axis:

tracing

In this case I have scrolled down a bit to focus on the factory. As can be seen, there is great variability in the time it takes to create a chassis and/or paint a robot. We have multiple production lanes, and we’d expect times across the factory lanes to be more even, not to mention much faster. So what gives?

Well, we can expand the operation to see if there was some additional information:

tracing_details

Now, sometimes the tags may include crucial pieces information that may help you solve the problem without needing any additional information. In this particular case, though, knowing that we were building a pink BB-8 isn’t really doing the trick.

What would be the next step? All too often the next step would be to look at the code around the instrumentation point, trying to figure out what was going on at the time simply from analyzing the code. Sometimes that may be quite hard. The problem may be in third party code not expected to behave badly. There may even be some other piece of code not directly in the code path causing the problems, perhaps an agent misbehaving and causing long lasting safe points in the JVM.

So, we’re screwed then? Nah. What if you had a magic tool that could record what was going on in the JVM and the application at the time of the incident? Something providing not only method profiling information, but a deeper view, including information about vm operations, memory allocation profiling, events for the usual application caused thread halts and much, much more. Something that could be always on, with very low overhead. And let’s say you ran with a tracer that added some contextual information, such as information that could be used to identify traces, spans and thread local span activations in the recorded data, and which allowed you to use your favourite tracer too? Then things would get interesting indeed…

Running with the JFR Tracer

For Code One I wrote a little delegating JFR tracer, which allows you to record contextual information into the flight recorder. It was meant as an example on how to do deep distributed tracing. Deep enough to solve entire classes of problems that are hard to solve without more detailed knowledge.

The tracer works with Oracle JDK 7+ and OpenJDK 11+ (it is a multi-release jar, a.k.a. mrjar), and the source is available on GitHub here:

https://github.com/thegreystone/jfr-tracer

The bundle is available from Maven Central, and here is the dependency you need to add:

<dependency>
<groupId>se.hirt.jmc</groupId>
<artifactId>jfr-tracer</artifactId>
<version>0.0.3</version>
</dependency>

Next you need to initiate your tracer and pass it to the constructor of the DelegatingJfrTracer, like so:

GlobalTracer.register(new DelegatingJfrTracer(yourFavTracer));

That’s it. When the tracer is running you will get contextual information recorded into the flight recorder.

Looking at the Recording

Dumping the flight recorder for the factory, and looking at the dump in the Threads view, might look something like this:

image

We can see that we have these long lasting monitor enter (Java Blocking) events, and looking at the stack traces directly by selecting individual events, or at the Lock Instances page, it is fairly obvious where the contention is:

image

We can, of course, create a custom OpenTracing view to make it easier to directly finding and homing in on long lasting traces (I’ll create a repo for a ready made one with some more flair at some point). Simply go the the Event Browser, and right click on the Open Tracing folder. Select “Create a new page using the selected event types”. You will now have a new page in the Outline. You can right click on the title on the page to rename it and switch icon.

Next select an arbitrary event, and right click on it. Select Group-By->Trace Id. In the new Group By table that appeared, select Visible Columns to enable (at least) the attribute showing the longest duration (the total duration of (wall clock) time the trace spent in the process that the recording came from). Next sort on the Longest Duration column.

In this case I’ve ran a few more (press enter in the single step load generator a few times, or let it just continuously add load):

image

You can, of course add additional tables with groupings that can be useful, for example, per thread. To quickly home in the entire user interface on a trace id of interest, just select a trace and choose “Store and Set as Focused Selection”:

SNAGHTML38aee12

Now you can go back to, for example, the Threads view, and click the Time Range: Set button in the upper right corner. Voila, you are in exactly the right place. You may also want to view concurrently occurring events in the same threads (see check boxes on top), and enable additional thread lanes:

SNAGHTML38f17fd

Summary

  • Distributed tracing is great, especially in today’s world of (very µ and plenty) µ-services.
  • For the Java platform, injecting trace/span-identifying information as contextual information into the JDK Flight Recorder is dynamite.
  • A simple example on how to do this automagically is available on my GitHub as a delegating Tracer, in an mrjar, supporting Oracle JDK 7+ and OpenJDK 11+:
        https://github.com/thegreystone/jfr-tracer
  • The slides for my Code One presentations can be found here:
        https://oracle.rainfocus.com/widget/oracle/oow18/catalogcodeone18?search=hirt
    (The relevant session for this blog is DEV5435.)
  • The JDK Flight Recorder r0xx0rz.
  • JDK Mission Control r0xx0rz.

Note that since the article was written, I have donated the tracer to OpenTracing.
See https://github.com/opentracing-contrib/java-jfr-tracer.

JMC Open Sourced!

This is going to be a short blog post, because it is past bedtime and I’m frankly pretty beat. Just wanted to say thank you to everyone who helped open source Java Mission Control in the relatively short period of time it was done in.

So a huge thank you to everyone on the Stockholm JMC team: Klara Ward, Ola Westin, Henrik Dafgård, Erik Greijus and Per Kroon. And to Dalibor Topic, who had the thankless job to go through all the source and built artifacts checking on, among other things, licenses and copyrights. Also thank you to Donald Smith for handling the internal license approval process, and Mark Reinhold for advice. Also, a big thank you to Guru, who is prepping to start building Oracle binary builds of the OpenJDK project. And to Iris who published the repos and set up the project on OpenJDK. And to everyone who I’ve forgotten to mention, since my brain is no longer really functioning at 2 a.m. You guys rock!

The repositories are available here:

      http://hg.openjdk.java.net/jmc

  • The jmc repo contains the source, which is buildable using Maven.
    Everything you need to know about building JMC is in the README.txt
    Please read it first, thoroughly, before asking questions.
  • The jmc-graphics repo contains artwork and graphic files, like splash screens and icons.

We also have a slack at https://jdkmissioncontrol.slack.com/.

If you have any questions, don’t hesitate to ask!

Kind regards,
Marcus

JMC & JFR in JDK 9 Labs

So JavaOne 2017 has now come and gone. This year it was a pretty great one! The Java keynote was IMHO the best one in years, with plenty of great announcements and interesting content. The sessions I managed to attend were also very high quality. Not to mention the dinners and the parties. Winking smile

One of the sessions I did this year was a hands-on-lab on Java Mission Control 6 and Java Flight Recorder in JDK 9.

IMG_6670

Since it was full, and since people who could not attend have asked for it, I promised to put the HoL on my blog. Well, here it is!

There are three files:

  1. The lab instructions.
  2. The full lab content for Windows x64 (easiest, but largest download, 856MiB).
  3. The project files only (all other platforms, 39MiB).

Hope this helps!

Please let me know if you find any errors in the labs, or if you get stuck!

/M