OpenJDK and the Future of Production Profiling

2021-05-31 - By Marcus

Some thoughts on the future of continuous production profiling on the OpenJDK platform.

A long time ago, the JRockit Runtime Analyzer (JRA) was introduced into the JRockit JVM as a means of being able to figure out what was going on in the JVM. It was mainly there to find out how customers were using the JVM, so that the JVM could be optimized for actual real-world production work-loads. The JRA output the data as XML, since customers insisted on the data being human readable so that they could see exactly what they would be sending us. Later LAT (LATency Analyzer) was introduced, since after the introduction of the JRockit low latency garbage collector (a.k.a the DetGC), some customers complained about the GC not keeping its latency promises. More often than not, it turned out it was other kinds of thread stalls causing the latencies, so LAT was introduced so that the JRockit team could figure out where the problematic code was. Since there could be a considerable amount of data in LAT, a binary format was introduced for the events recorded.

Eventually the production profiling (JRA + LAT) had a model overhaul and became the JRockit Flight Recorder. The data format was binary, self describing, extensible and efficient. After Oracle acquired Sun Microsystems, and the Hotspot and JRockit JVM teams merged, it became the Java Flight Recorder (JFR), and in 2018 it was open sourced as the JDK Flight Recorder (still JFR), since calling anything related to Java something with “Java” in the name can be complicated.

The JDK Flight Recorder design philosophy is to be the one-stop-shop production profiler for OpenJDK. JFR needs to be able to do various kinds of profiling, all at the same time, at a low overhead. It also needs to be able to run continuously for as long as someone is interested in the data. Potentially always.

Now, with changes in the Java (and the computing) ecosystem, JFR has some loom-ing challenges to remain relevant for the future.

JFR has a lot of nice properties:

It does multiple kinds of profiling, normally at a low overhead (space and CPU).
(Profiling types include allocation profiling, almost CPU profiling, latency outlier profiling and much more.)
It doesn’t suffer from the same constraints as Java-only based profilers.
(Weaknesses usually include things like safe point bias, safepointing VM operations, allocation pressure, undoing scalarization optimizations, lack of STW/safepoint visibility.)
It provides context helpful to solving problems.
(For example, a monitor enter event has the monitor class, the monitor address, the thread holding the monitor, the thread and stack trace for the blocking call and more.)
It is low overhead, (mostly) designed for predictable data rates and overhead.
It is extensible – you can add your own profilers and data.
(Datadog, for example, has its own rather useful exception profiler which is publishing data into JFR.)
As long as all your data is produced in JFR, all the common constants in the runtime will be recorded into the same constant pool.
(For example, if you record some data using JFR and some data with a custom agent, common class names, method names etc will need to be repeated in both of the serialization formats.)

There are however a few ways in which JFR can, and probably has to, improve, to remain the ultimate production profiling platform for Java:

Data rate (and thereby overhead) needs to be more deterministic.
CPU profiling needs to be improved.
Wall clock profiling needs to be possible.
It needs to be possible to easily and cheaply associate context with events.

Let’s take these in reverse order.

Recording Context

A very common problem is to try to associate some contextual information with the data produced by JFR. For example, in the good old Oracle days, there were events produced by WebLogic Server (WLS) containing what was called the Execution Context IDs (ECIDs). Since JFR events are thread local, it was possible to use them to find out what other, lower level, information was captured during that time, for example where the allocation pressure was, or why a call to a logger leading to a blocking call trying to enter a contended monitor happened. The instrumented code, and thereby the events, were usually emitted where something took a bit of time, such as database IO and similar, so the overhead was not bad.

Fast-forward to today. We now have distributed tracers doing pretty much what WLS was doing back in the day. We now have a trace, which is a directed acyclic graph of spans, which we track. The difference being that with the microservices and async frameworks that are commonly in use these days, these spans can now last a very short time. The tracers do keep track of when the context is propagated to other threads; in other words, the tracer can know where work is carried out on specific threads for a specific trace/span. In the end, the tracer doesn’t really care though – it only keeps track of this to know when a span can be closed. No matter how many related little work items were scheduled on various threads, the tracer will simply note when the span was started, and when the span was closed. For the profiler, however, the information about which thread did work related to a specific span is crucial to be able to figure out what traces were captured in the context of what spans/traces. Since a thread can switch between doing work for different spans at an incredible rate, the overhead of using additional events to keep track of them can be quite costly. Especially if the switches happen way more often than a sample is being captured.

This will only be compounded once Loom becomes part of the Java platform, and we can have millions of Virtual Threads (fibers, p-threads, MxN threads, green threads, whatever you want to call them) running.

A long time ago, the JFR team was discussing something we called thread coloring – the ability to tell a thread what the current context is. This is exactly what Go does today with goroutines. In Go it is called profiling labels. Go pre-aggregates the CPU profiling data, but allows the settings of labels to group the profiling data into buckets. Sadly, in Go it is currently limited to CPU-profiling only.

One solution to the context problem could be to allow setting the context for a thread – conceptually a Map<String, long>, and to provide some new settings for an event, for example record-context, to decide if the context should be recorded into the event or if we don’t care, and record-only-on-context, to decide to record an event only if there is some context present.

Having such a capability would not only make it possible to associate events with a context even where such a context changes rapidly. It could also be used to only record certain events when they occur in a context for which we care, possibly bringing down the overhead of capturing data and stack-traces, and saving memory in the buffers for the things we find to be the most relevant. A concrete example would be to only record certain information if a thread is currently associated with a trace which will be sampled.

See https://bugs.openjdk.java.net/browse/JDK-8264516.

Wall-Clock Profiling Must Be Possible

JFR has an interesting mix of sampling profiling, thresholded execution tracing and metrics. For example, it has a sampling execution profiler, a rate-limited sampling allocation profiler etc. It also has events to locate thread latency outliers. These events are not sampled, but give you exact information about a thread halt lasting longer than a configurable threshold.

Of course, one problem with thresholding is that you can theoretically have a lot of thread latencies just below the threshold. It is easy to construct an artificial benchmark where the data indicates that you have no thread latencies whilst having nothing but thread latencies.

One way to get unbiased information, at a reasonable cost (i.e. not setting the threshold to 0 for the latency outlier events), is to introduce a wall clock profiler. A wall clock profiler basically periodically dumps the thread stack for a thread, no matter what it is up to. A limitation is that the wall clock profiler has no idea of what the thread was actually up to – we can’t get all the juicy, custom, actionable, information at that point – we usually only get the thread dump.

Also, given that you might have millions of (virtual) threads, you probably don’t want to sample them all at any given time, and you certainly don’t want to stop the world whilst doing the sampling. As a matter of fact, for most applications, looking at wall clock profiling information from all threads will be a very boring exercise – most of them will be halted waiting for something to do, for example in thread pools, or waiting in some I/O. For example, for your average Java recording, you will have plenty of threads in thread pools, being parked. To see that you have thousands of threads waiting on a park call in a thread pool will not help you resolve a great many problems. If you, on the other hand, have a really long park causing a trace to take a really long time to complete, then you’ll likely be interested in what is going on.

This means that different profilers probably want these wall-clock samples to be picked differently. One way to accomplish this would be to allow users of JFR to commit events on a separate thread.

Event#commit(Thread)

This would allow someone to build a profiler by simply periodically committing events on the threads it currently cares about. It would also allow building other kinds of profilers and sampling behaviours.

Another way of accomplishing this would be to add an annotation to override default stuff captured by JFR in an event.

class MyEvent extends Event {

@overridejfr
Thread overrideThread;
}

An advantage with this variant is that you could potentially add more override behaviours that are handled by JFR over time.

A third variant might be to introduce a configurable global wall clock event, perhaps one that picks one thread at a time in a round robin fashion, at a certain frequency. Together with the context feature mentioned above, it would be possible to only emit events where there is useful context available.

Yet another variant may be to add a native function (extern “C”) to emit an event. Something similar to AsyncGetCallTrace (probably implemented using AsyncGetCallTrace), but which emits an event. If we have the context feature described above, it could be used to decide where events are emitted. If different kinds of profiling will be implemented, you can simply provide context information about the kind of profiling being done to avoid the profilers interfering with each other.

There is JDK-8237206 for this, but there isn’t much information recorded in the issue.

CPU Profiling Improved

The current way of doing CPU profiling with JFR is to use the Execution Sample Event in JFR. It gives you a good idea of where the JVM is spending the most CPU time executing Java code. It is cheap, has low overhead, and satisfies the needs of many users.

That said, there are some problems. One is that it does not cover the full Java process. If there is a native library running native threads, the CPU time spent there won’t be accounted for. Also, some other native and intrinsified code, for example the JVM native threads, will not be covered. It is easy to write a benchmark where JFR will capture almost no samples at all.

To get a better understanding of where the CPU is being spent, we need to build a profiler actually sampling on CPU time, and that can, at least optionally, capture the full thread stacks including native frames. One approach to this is to do what async profiler does – use kernel profiling data and line it up with information from AsyncGetCallTrace.

At Datadog we’ve built a pretty awesome CPU profiler for Linux, based around perf_event_open – one that does really well, and which handles symbol resolution in containerized environments better than most. Sadly, we don’t have it for Windows and Mac, so, if we contribute it to OpenJDK, we would need help implementing it on other platforms.

There is JDK-8234854, but I haven’t added much context yet.

Deterministic Overhead and Memory Use

One problem that JFR used to have, before JDK 16, was that allocation profiling could get very expensive. Especially for allocation heavy applications, running on plenty of CPUs. It could get very expensive both in terms of overhead but also in terms of size. For example, running a well parallelized, allocation heavy, application on a 96 core machine could produce millions of events per minute, resulting in recording sizes of hundreds of megabytes per minute.

In JDK 16 Datadog submitted a new kind of allocation profiler, which was rate limited. It drew inspiration from PID controllers (yes, I still love robotics, even though there is literally no time for it these days) to let the user specify a budget of the maximum number of events per second, and then the controller will attempt to keep the number of events emitted to that budget, subsampling if necessary.

We had already used the same technique successfully for our exception profiler, since it turns out that some people use not only exceptions, but even subclasses of Error (!), for flow control.

Now, it may be useful to use the same kind, or some similar kind, of safety valve to limit both the overhead and the data production rate for other events that are prone to the same edge cases. For example, the thread latency events today are currently thresholded. This means that you perhaps, given the number of threads, might be able to estimate the maximum number of events you could produce. But it’s not very helpful, since you might be running very few, or, in the future, with Loom, millions of them.

Instead you might want to subsample them, at a max rate that you can set yourself, per event type. You could even allow for thresholding and rate limited sampling at the same time, accepting a certain bias, e.g. if you want to focus on longer lasting thread halts.

Summary

TL;DR, JFR has some interesting challenges ahead. Some of them will be compounded by the introduction of Virtual Threads. For JFR to remain the premium, best in class, production profiling platform, some investment will be needed. Datadog will try to help, but some of these problems will require updates to the JFR file format and updating the API. The OpenJDK community will need to be involved.

A Closer Look at JFR Streaming

2020-07-25 - By Marcus

By Marcus Hirt and JP Bempel

Since JDK 14 there is a new JFR kid on the block – JFR streaming. 🙂 This blog post will discuss some of the things that you can do with JFR streaming, as well as some of the things you may want to avoid.

An Introduction to JFR Streaming

In the most recent version of the JDK a new JFR-related feature was introduced – JFR streaming. It is a feature allowing a developer to subscribe to select JFR data and to decide what to do with that data in the host process. JFR events can also be consumed from a separate process by pointing to the file repo of a separate JVM process – the mechanism is the same.

The JFR streaming works by allowing the reading from the JFR file whilst it is being written, the emissions to disk happening more frequently (usually every seconds, or when the in memory buffers are full) than during your normal flight recordings, where the data is emitted when the in-memory buffers are full. It does not support streaming directly from in-memory buffers, and the events are not delivered synchronously when they are occurring in the JVM.

The new functionality mostly resides in jdk.jfr.consumer. This is how you would open an event stream and start consuming the CPU load with 1 second intervals and the monitor class when blocked to enter a monitor for 10 ms:

try (var rs = new RecordingStream()) {
  rs.enable("jdk.CPULoad").withPeriod(Duration.ofSeconds(1));
  rs.enable("jdk.JavaMonitorEnter").withThreshold(Duration.ofMillis(10));
  rs.onEvent("jdk.CPULoad", event -> {
    System.out.println(event.getFloat("machineTotal"));
  });
  rs.onEvent("jdk.JavaMonitorEnter", event -> {
    System.out.println(event.getClass("monitorClass"));
  });
  rs.start();
}

The RecordingStream is what you would use to control what is gathered from within the Java process, effectively also controlling the recorder.

Here is another example using the default recording template, and printing out the information for garbage collection events, cpu load and the JVM information:

Configuration c = Configuration.getConfiguration("default");
 try (var rs = new RecordingStream(c)) {
     rs.onEvent("jdk.GarbageCollection", System.out::println);
     rs.onEvent("jdk.CPULoad", System.out::println);
     rs.onEvent("jdk.JVMInformation", System.out::println);
     rs.start();
   }
 }

The EventStream class can be used together with the standard flight recorder mechanisms to gather information from ongoing recordings, even ones being done in separate processes or an already recorded file. Here is an example using the EventStream to get some other attributes of the CPU load and information from garbage collections from within the Java process (needs an ongoing recording):

try (var es = EventStream.openRepository()) {
   es.onEvent("jdk.CPULoad", event -> {
     System.out.println("CPU Load " + event.getEndTime());
     System.out.println(" Machine total: " + 100 * event.getFloat("machineTotal") + "%");
     System.out.println(" JVM User: " + 100 * event.getFloat("jvmUser") + "%");
     System.out.println(" JVM System: " + 100 * event.getFloat("jvmSystem") + "%");
     System.out.println();
   });
   es.onEvent("jdk.GarbageCollection", event -> {
     System.out.println("Garbage collection: " + event.getLong("gcId"));
     System.out.println(" Cause: " + event.getString("cause"));
     System.out.println(" Total pause: " + event.getDuration("sumOfPauses"));
     System.out.println(" Longest pause: " + event.getDuration("longestPause"));
     System.out.println();
   });
   es.start();
 }

This is the EventStream interface used to consume and filter an event stream:

public interface EventStream extends AutoCloseable {
  public static EventStream openRepository();
  public static EventStream openRepository(Path directory);
  public static EventStream openFile(Path file);

  void setStartTime(Instant startTime);
  void setEndTime(Instant endTime);
  void setOrdered(boolean ordered);
  void setReuse(boolean reuse);

  void onEvent(Consumer handler);
  void onEvent(String eventName, Consumer<RecordedEvent> handler);
  void onClose(Runnable handler);
  void onError(Runnable handler);
  void remove(Object handler);
  void start();
  void startAsync();
  void awaitTermination();
  void awaitTermination(Duration duration);
  void close();
}

The open* methods allow you to open a specific file or a specific file repository (for example from a different process). The set* methods allow you to filter on time and to select if you want to enforce that the events are delivered in time order. You can also allow the reuse of the event object that gets delivered, to get the memory pressure down a bit.

The onEvent* allows you to register a consumer for handling the events, either all of the events or by event name (type). The start method kicks off the processing in the current thread, startAsync is a convenience method for kicking off the process in a single separate thread.

Where to use streaming

There are several advantages to JFR event streaming. It is a great way to access JFR data for monitoring purposes. You get access to detailed information that was previously unavailable to you, even from different processes, should you want to.

Here are some examples:

Directly send monitoring data to your favourite monitoring service
For example streaming select metrics over to Datadog. 😉 Not that you would need to – we already derive interesting performance metrics from the (full) flight recordings we capture. We even track complex metrics like top hottest methods, or top allocations sites, over time, using what we internally call high cardinality metrics.
Pre-aggregating data before sending it off
For example, you could get the CPU load every second, and then every five minutes roll it up to an average, median, min, max and a standard deviation, not having to send every single entry.
Act on profiling data in-process
You could, for example, make decisions for controlling the normal flight recordings given some statistics you track, like enabling certain events when it looks like it could be interesting.

Expose JFR data through other management APIs
For example, adding an MBean exposing select JFR data over JMX.
That said, there might be an API to directly connect to an MBeanServerConnection directly in the future[1]:

MBeanServerConnection conn = connect(host, port);
try (EventStream es = new RemoteRecordingStream(conn)) {
  es.onEvent("jdk.GarbageCollection", e -> ... );
  es.onEvent("jdk.ExceptionThrown". e -> ...);
  es.onEvent("jdk.JavaMonitorBlocked", e-> ...);
  es.start();
}

It also allows you to skip the metadata part of a normal flight recording. The metadata in JFR contains the information about what was recorded, so that you can parse and view data that you may not even know about beforehand. In the case of monitoring a few well known data points, this is redundant information to keep sending over and over again.

Erik Gahlin has a neat example for producing health reports using JFR streaming, displaying pre-aggregation of the top frames for execution samples and allocation, as well as doing stats for some common data points, like GC metrics and CPU load.

Where Not to Use Streaming

This is from JEP-349[2], the JEP where JFR streaming was introduced:

“To consume the data today, a user must start a recording, stop it, dump the contents to disk and then parse the recording file. This works well for application profiling, where typically at least a minute of data is being recorded at a time, but not for monitoring purposes.“

Let’s explore why the JEP differentiates between monitoring and profiling. Some events in JFR are simple data points in time. Some are more complex, containing plenty of constants. For example stack traces. JFR takes great care to record these complex data structures in a binary format that doesn’t take a lot of processing time to produce, and which is still compact.

Some of the JFR events occur quite frequently – for example, a typical one minute recording of data can contain hundreds of thousands of events. The file size for such a recording is typically only a couple of MB large. There is a mix of techniques used to keep the size down, such as using constant pools to ensure that information like method names are not repeated, LEB128 encoding of integers etc.

For profiling you typically want quite a few of these events enabled. JFR was built to emit this data at a very low overhead, and the data is eminently useful to get detailed information about things like why your thread is halting. For example, the stack trace to a place where your code had to wait to enter a monitor, complete with the class of the monitor waited on, the exact duration of the wait, which thread was holding on to the monitor (making you not able to enter), the monitor address and more. Not only that, there may have been other events providing context about what was going on in that thread at the time of the monitor enter, shining further light on what was going on. Events that you may not even know about.

If using JFR streaming for profiling, you would spend a lot of effort either naively sending constant information over and over again in an inefficient way (say, serializing all of it to JSon), or trying to spend a lot of effort reproducing the JFR format (introducing your own constant pools etc).

For example, the RecordedEvent class contains a method to get the RecordedStackTrace, which is a List of RecordedFrame. For each event you would walk through, the in-memory object model would be created.

You can externalize some of that cost, i.e. how the process you are monitoring is affected, by using another process to read the data as described above. That would, for example, lessen the allocation pressure in the process you’re monitoring. That is great, for example if you have a very latency sensitive process. That said, you have now created another Java process and put the costs over there (including the cpu overhead of dealing with the memory pressure as well as the memory overhead of running another JVM), typically on the same host. If you can afford to dedicate the memory and pin the event stream reader process to its own processor (cpu affinity), this can be a good solution though. Note that the same can be done for normal flight recordings, i.e. you can stream the recorded data directly from the file repository from a separate process.

Benchmarks of Using JFR Streaming Wrongly

For laughs and giggles, here are some benchmarks using just standard JFR to get all the data in the profiling template, compared to getting the equivalent information and serializing it to a JSon-like format using JFR Streaming. In other words, abusing JFR Streaming as a streaming replacement for getting the full JFR dataset. This is of course not what you should be using JFR Streaming for, but exemplifies how wrong you can land if you use the technology in a way it was never intended. We’ll look at the latency of http requests, the CPU time spent and the allocation pressure. We’ll also look at the size of the payload of information extracted. The benchmark is admittedly being a bit extra mean as well, to explore edge cases.

Note that this is a simplified example – we’re not even bothering with extracting the full stacktrace information and re-encoding it for streaming, whilst the JFRs in comparison already contain the full stack traces for all events efficiently encoded in constant pools. We could trade (even more) CPU for trying to bring the streamed data back to a JFR style format with constant pools again before storing/sending it. For serialization we’re simply doing toString() on the event objects, which gives us the events in a JSon-like format with only the top five frames of the stack trace. Including the full stack traces would add quite a bit more overhead to the streaming example.

The benchmark is available in this GitHub repository. It is based on the standard PetClinic application with some modifications to make it relevant to measure overhead in general.

Note: We are using an early access version of OpenJDK 15 in the benchmarks, since we discovered a bug whilst building this benchmark. See JDK-8245120.

The first chart shows the impact on http request latencies:

Note: Y-axis is in log scale to magnify the difference.

Next chart shows the CPU consumption. We measure in ticks from /proc/<pid>/stat at the end of the benchmark. This way we have a unique number easy to compare with other runs. It represents the CPU consumed when threads were scheduled on the CPU.

The following chart is the total amount of heap allocated during the benchmark. Numbers were extracted from GC logs.

The final chart represents the size of extracted information, as described above:

Note that the JFR file contains the full stacktraces, and that the jfr-streaming one is limited to the top 5 frames.

So what is the conclusion of these benchmarks? Not much, except for: “don’t use technology for things it was never intended for”. 😉

Summary

JFR streaming is a great new way to expose JFR data for monitoring purposes.
JFR streaming is available from JDK 14 and above.
JFR streaming is complementary to the already existing JFR APIs, which remains the go-to way to use JFR for low-overhead detailed information (i.e. profiling / execution tracing).
Knowing where and how to use JFR streaming is key to avoiding sad pandas.

[1]:https://www.reddit.com/r/java/comments/e97vos/jfr_event_streaming_with_jdk_14_in_outprocess/faiapm8/

[2]:https://openjdk.java.net/jeps/349

Compressing Flight Recordings

2019-11-09 - By Marcus

Flight recordings are nifty binary recordings of what is going on in the runtime and the application running on it. A flight recording contains a wide variety of information, such as various kinds of profiling information, threat stall information and a whole host of other information. All adhering to a common event model and with the ability to dynamically add new event types.

In the versions of JFR since JDK 9, some care was taken to reduce the memory footprint by LEB 128 encoding integers, noting that many things, like constant pool indices, usually occupy relatively low numbers. The memory footprint was cut in about half, compared to previous versions of JFR.

Now, sometimes you may want to compress the JFR data even further. The question then is – how much can you save if you compress the recordings further, and what algorithms would be best suited for doing the compression? What if you want the compression activity to use as little CPU as possible?

My friend and colleague at Datadog, Jaroslav Bachorik, set out to answer that question for some typical recording shapes that we see at Datadog, using a set of compression algorithms from Apache Commons Compress (bzip2, LZMA, LZ4), the built in GZip, a dedicated LZ4 library, XZ, and Snappy.

Below is a table of his findings for “small” (~1.5 MiB) and “large” (~5 MiB) recordings from one of our services. The benchmark was run on a MacBook Pro 2019. Now, you’d have to test on your own recordings to truly know, but I suspect that these results will hold up pretty well with other kinds of loads as well.

Algorithm	Recording Size	Throughput	Compression Ratio	Utility
Gzip	small	24.299	3.28	79.647
Gzip	large	5.762	3.54	20.436
BZip2	small	6.616	3.51	23.198
BZip2	large	1.518	3.84	5.826
LZ4	small	133.115	2.40	319.394
LZ4	large	38.901	2.57	100.009
LZ4 (Apache)	small	0.055	2.74	0.152
LZ4 (Apache)	large	0.013	3.00	0.039
LZMA	small	1.828	4.31	7.882
LZMA	large	0.351	4.37	1.533
Snappy	small	134.598	2.27	305.494
Snappy	large	35.986	2.49	89.692
XZ	small	1.847	4.31	7.964
XZ	large	0.349	4.37	1.523

Throughput is recordings/s. Utility is throughput * compression ratio, and meant to capture the combination of compression strength and performance. Note that the numbers are not normalized – only compare numbers in the same size category.

Summary / TL;DR

The built-in GZip is doing a fairly good/balanced job of compressing flight recordings
You can get the best utility out of LZ4, closely followed by Snappy, but you sacrifice some compression
If you’re prepared to pay for it, LZMA and XZ give a good compression ratio
All credz to Jaroslav for his JMH-benchmark and all the data

JMC 6 Automated Analysis Headless

2017-09-22 - By Marcus

This post will be a little bit wider than the title implies; it will be about using the JMC core JFR APIs in general. That said, one of the things we will be using it for, towards the end, will be to run the JMC 6 automated analysis headless.

This article will also cheat by talking about JMC 6.1, so that anyone reading this article will have relevant information for JMC 6.1 as well. Winking smile

The JAR-files Needed

To run these examples you will need to have the following Jar-files from the JDK_HOME/lib/missioncontrol/plugins folder on your classpath:

Prefix	Version needed	Comment
com.oracle.jmc.common	>= 6.0.0	Classes common to JMC, such as stacktrace definitions, the content type system (quantities/ units of measurement) etc.
com.oracle.jmc.flightrecorder	>= 6.0.0	The Java Flight Recorder parser, classes for extracting information from a recording.
com.oracle.jmc.flightrecorder.rules	>= 6.0.0	The core definitions and classes for automated analysis of recordings.
com.oracle.jmc.flightrecorder.rules.jdk	>= 6.0.0	Contains the rules for the JDK (such as rules for synchronization trouble, long lasting VM operations and much more).
com.oracle.jmc.flightrecorder.ext.wls.parser	>= 6.0.0	(Optional, only relevant for users wanting WebLogic Server extensions)
com.oracle.jmc.flightrecorder.ext.wls.rules	>= 6.0.0	(Optional, only relevant for users wanting WebLogic Server extensions)

Loading a Recording

Loading a recording is done through a call to the JfrLoaderToolkit#loadEvents(). It takes a file argument, and returns an IItemCollection. The IItemCollection can be thought of as a collection of events.

The IItemCollection supports operations like filter(), and getAggregate(). Used correctly, you should rarely need to rely on external iteration.

Here is an example which loads a recording, and calculates the standard deviation for the java monitor enter events in a recording:

import java.io.File;

import com.oracle.jmc.common.IDisplayable;
import com.oracle.jmc.common.item.Aggregators;
import com.oracle.jmc.common.item.IItemCollection;
import com.oracle.jmc.common.item.ItemFilters;
import com.oracle.jmc.common.unit.IQuantity;
import com.oracle.jmc.flightrecorder.JfrAttributes;
import com.oracle.jmc.flightrecorder.JfrLoaderToolkit;
import com.oracle.jmc.flightrecorder.jdk.JdkTypeIDs;

/**
 * Finds out the standard deviation for the java monitor enter events.
 */
public class LoadRecording {
	public static void main(String[] args) throws Exception {
		DemoToolkit.verifyFirstFileArgument(RunRulesOnFile.class, args);
		
		IItemCollection events = JfrLoaderToolkit.loadEvents(new File(args[0]));
		IQuantity aggregate = events.apply(ItemFilters.type(JdkTypeIDs.MONITOR_ENTER))
				.getAggregate(Aggregators.stddev(JfrAttributes.DURATION));
		
		System.out.println("The standard deviation for the Java monitor enter events was "
				+ aggregate.displayUsing(IDisplayable.AUTO));
	}
}

Note the stream-like syntax. The JMC libraries work well together with streams, but are compiled on JDK 7, and can run on a JDK 7 compliant runtime. Also note that some class names start with Jfr whilst others class names start with Jdk. The difference being that some concepts, like DURATION, are intrinsic to JFR, whilst other are defined in terms of the Java JDK classes or runtime, like MONITOR_ENTER.

The JMC core libraries provide common statistical aggregators, and accessors for common attributes. Should you feel something is missing, it is easy to add to the built-in operations.

Also note the IQuantity returned by the aggregator. The built-in system for handling quantities and units of measurement makes it easy to calculate with, and display, quantities in different units of measurement.

Performing an Automated Analysis Headless

It is also easy to run the JMC rules headless using these core libraries. Simply evaluate the rules against the IItemCollection. Here is a simple example iterating through the rules and evaluating them one by one:

import java.io.File;
import java.util.concurrent.RunnableFuture;

import com.oracle.example.jmc6jfr.rules.util.DemoToolkit;
import com.oracle.jmc.common.item.IItemCollection;
import com.oracle.jmc.common.util.IPreferenceValueProvider;
import com.oracle.jmc.flightrecorder.JfrLoaderToolkit;
import com.oracle.jmc.flightrecorder.rules.IRule;
import com.oracle.jmc.flightrecorder.rules.Result;
import com.oracle.jmc.flightrecorder.rules.RuleRegistry;

public class RunRulesOnFileSimple {
	public static void main(String[] args) throws Exception {
		File recording = DemoToolkit.verifyRecordingArgument(RunRulesOnFileSimple.class, args);
		IItemCollection events = JfrLoaderToolkit.loadEvents(recording);
		
		for (IRule rule : RuleRegistry.getRules()) {
			RunnableFuture<Result> future = rule.evaluate(events, IPreferenceValueProvider.DEFAULT_VALUES);
			future.run();
			Result result = future.get();
			if (result.getScore() > 50) {
				System.out.println(String.format("[Score: %3.0f] Rule ID: %s, Rule name: %s, Short description: %s",
						result.getScore(), result.getRule().getId(), result.getRule().getName(),
						result.getShortDescription()));
			}
		}
	}
}

That said, if you are not constrained to have to run on JDK 7, you can always run the rules in parallel, for example by employing parallel streams:

import java.io.File;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Executor;
import java.util.concurrent.Executors;
import java.util.concurrent.RunnableFuture;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import com.oracle.jmc.common.item.IItemCollection;
import com.oracle.jmc.common.util.IPreferenceValueProvider;
import com.oracle.jmc.flightrecorder.JfrLoaderToolkit;
import com.oracle.jmc.flightrecorder.rules.IRule;
import com.oracle.jmc.flightrecorder.rules.Result;
import com.oracle.jmc.flightrecorder.rules.RuleRegistry;

/**
 * Runs the rules on the events in the specified file in parallel, then prints
 * them in order of descending score.
 */
public class RunRulesOnFile {
	private final static Executor EXECUTOR = Executors
			.newFixedThreadPool(Runtime.getRuntime().availableProcessors() - 1);
	private static int limit;

	public static void main(String[] args) throws Exception {
		if (args.length == 0) {
			System.out.println(
					"Usage: RunRulesOnFile <recording file> [<limit>]\n\tThe recording file must be a flight recording from JDK 7 or above. The limit, if set, will only report rules triggered with a score higher or equal than the limit.");
			System.exit(2);
		}
		IItemCollection events = JfrLoaderToolkit.loadEvents(new File(args[0]));
		if (args.length > 1) {
			limit = Integer.parseInt(args[1]);
		}
		Stream<RunnableFuture<Result>> resultFutures = RuleRegistry.getRules().stream()
				.map((IRule r) -> evaluate(r, events));
		List<Result> results = resultFutures.parallel().map((RunnableFuture<Result> runnable) -> get(runnable))
				.collect(Collectors.toList());
		results.sort((Result r1, Result r2) -> Double.compare(r2.getScore(), r1.getScore()));
		results.stream().forEach(RunRulesOnFile::printResult);
	}

	public static RunnableFuture<Result> evaluate(IRule rule, IItemCollection events) {
		RunnableFuture<Result> evaluation = rule.evaluate(events, IPreferenceValueProvider.DEFAULT_VALUES);
		EXECUTOR.execute(evaluation);
		return evaluation;
	}

	public static Result get(RunnableFuture<Result> resultFuture) {
		try {
			return resultFuture.get();
		} catch (InterruptedException | ExecutionException e) {
			e.printStackTrace();
		}
		return null;
	}

	private static void printResult(Result result) {
		if (result.getScore() >= limit) {
			System.out.printf("(%.0f) [%s]: %s\nDetails:\n%s\n============<End of Result>============\n",
					result.getScore(), result.getRule().getId(), result.getShortDescription(),
					result.getLongDescription() == null ? "<no description>" : result.getLongDescription());
		}
	}
}

If you do not need that kind of control, there is a class available for performing automated analysis included with the JMC core library. To run an automatic analysis, simply run the class:

com.oracle.jmc.flightrecorder.rules.report.JfrRulesReport <files> –format <format> –min <severity>

…where files is one or more recordings, and <format> is and <min> is minimum severity to report [ok | info | warning]. In JMC 6.1.0 there is one additional class for generating the file, which produces an HTML in a format and style very similar to the looks of the JMC 6.0.0 Automated Analysis page:

com.oracle.jmc.flightrecorder.rules.report.html.JfrHtmlRulesReport <file> [<outputfile>]

…where <file> is the (single) recording to analyze, and where the optional <outputfile> is where the resulting HTML should be written. If no outputfile is specified, the result of the analysis will be written on stdout.

Adding Custom Rules

It’s easy to add you own heuristics. Do you find yourself with thousands of recordings, and would like to add a Bayesiean network to do some machine learning? The rules are all pure Java, so you can pretty much do whatever you want in a rule.

The easiest way to get started writing your own custom rules is to get yourself an Eclipse Oxygen or later, and install JMC and the experimental PDE plug-in into your Eclipse.

Install Eclipse Oxygen or later.
Got to the Mission Control homepage, and find the Eclipse Update site.
Follow the instructions to install the Mission Control plug-ins.
Got to the Mission Control homepage, and find the Experimental Update site.
Install the PDE plug-in.

Here is a cheat sheet for the eclipse update sites for the JMC 6.0.0 release:

http://download.oracle.com/technology/products/missioncontrol/updatesites/base/6.0.0/eclipse/

http://download.oracle.com/technology/products/missioncontrol/updatesites/supported/6.0.0/eclipse/

http://download.oracle.com/technology/products/missioncontrol/updatesites/experimental/6.0.0/eclipse/

(JMC 6.1.0 will be released according to the same URL pattern.)

To start building your rule, press ctrl-n (or click the File | New | Other… menu) to bring up the New wizard.

Select Plug-in Project and hit Next. Name your rule project something exciting.

Unclick that this plug-in will make contributions to the UI and hit Next.

Next select the Simple JFR Rule Wizard and click Finish (or Next, if you really wish to do some further customizations).

You will now have a new project in your workspace, containing an example rule. If you have compilation errors, you need to set JMC to be your target platform (see the next section). You can either just export your rule, put it on the class path with the other core libraries, or you can try out your rule by running JMC from within Eclipse with your new rule.

Starting JMC from within Eclipse with Workspace Plug-ins

Running JMC from within Eclipse with any plug-ins you are currently developing is a simple matter of setting the plug-in development target platform to your JMC installation, and launching that platform with your workspace plug-ins.

First set JMC to be your target platform.

Go to Preferences in the main menu (Window | Preferences on Windows).
Find Target Platform by typing Tar in the filter box:
Click Add… and type JMC 6 as name.
Press Add… and select Installation. Hit Next.
Browse to the JMC 6 installation directory (JDK_9_HOME/lib/missioncontrol), and hit Ok.
Hit Finish, and Finish. You should now see your new platform.
Select the new JMC 6 platform, and hit Apply and Close to activate it.

Everything should now compile cleanly. Next step is to run JMC with your new rule.

Context click on your project and select Run as Eclipse Application
This should normally be it. If you for some reason run into trouble, go to Run Configurations, and make sure that your launcher is using the com.oracle.jmc.rcp.application.product.

You should now be able to see your rule in action. If you have not changed the rule code, try setting the environment variable in the launcher to various values between 0 and 100 and see what happens when you run the rule in JMC:

Exporting a Plug-in

You can export your rule by context-clicking on the project and selecting Export. In the Export wizard, select Deployable plug-ins and Fragments.

Click Next, select a folder to export the plug-in to and Finish. The resulting jar can either be put on the class path to be included in the headless analysis, or put in the dropins folder of any Java Mission Control installation where you would want the rule to be available.

Summary

This blog described how to:

Get started using the JMC core libraries to read Java Flight Recordings (JDK 7, 8 and 9)
Get started doing headless analysis of Java Flight Recordings (JDK 7, 8 and 9)
Get started creating custom rules for analyzing Java Flight Recordings

Hope it helps someone!

Java Flight Recorder in JDK 9

2017-09-21 - By Marcus

As you probably already know, JFR is on the road to be open sourced. The plans are being drawn, and if everything goes well, this could happen as early as in JDK 10 (JDK 18.3). No promises, and please do not give anyone a hard time if this doesn’t work out. Know that we tried. Anyways, what we did recently do was to release JDK 9, and there are a LOT of exciting stuff happening for JFR already in JDK 9. Smile

Let’s dive into it…

Supported APIs

Yep. The JFR APIs in the JDK are supported. This means that unlike all my other posts on JFR related APIs, this post does not contain any disclaimer for using these APIs. They are supported! By Oracle! Woho!

On the sad side, this also means that the APIs have changed package, and signatures. On the plus side, they have been immensely enhanced for readability, performance and ease of use.

I will group the API as follows:

API for Custom Events
API for Controlling the Flight Recorder
API for Consuming Recordings

Now, there is an advanced API for consuming recordings included with JMC. That API sports internal iteration, aggregation in various forms, JDK 7, 8 and 9 JFR file compatibility, JDK 7 and later execution compatibility, POJO (no weird dependencies), rules evaluation, and more. This API will also be open sourced, and I will discuss it in a later post.

Creating Custom Events

The API for creating and recording your own events is very simple. All you need to do is to subclass jdk.jfr.Event.and annotate the fields you wish to record with jdk.jfr.Label.

import jdk.jfr.Event;
import jdk.jfr.Label;

public class Hello {
	@Label("Hello World")
	static class HelloWorldEvent extends Event {
		@Label("Message")
		String message;
	}
	public static void main(String... args) {
		HelloWorldEvent event = new HelloWorldEvent();
		event.message = "Hello World event message!";
		event.commit();
	}
}

It is good form to provide a description for your event class and for non-trivial attributes. Here is an example of an event from a JUnit plug-in for JUnit 5 (that I have not released for JDK 9 yet):

/**
 * Event for a test run and completed normally (without exception).
 * 
 * @author Marcus Hirt
 */
@Label("Test Run")
@Description("JUnit test executed and completed normally without exception")
@Category("JUnit")
public class TestEvent extends Event {
    @Label("Display Name")
    @Description("The JUnit display name for the test")
    private String displayName;

    public void setDisplayName(String displayName) {
        this.displayName = displayName;
    }

    public String getDisplayName() {
        return displayName;
    }
}

Other important annotations are, for example, the annotation for providing information about the content. These annotations are put on fields to explain to tools how to interpret the data. For example, Java Mission Control would be able to properly render timestamps and durations properly, if the appropriate @Timestamp and @Timespan annotations were used. I have provided a quick overview of some of the most commonly used ones below:

Annotation	Comment
@Address	The value is a memory address
@BooleanFlag	The value is a boolean flag. Strictly speaking perhaps not the most useful one.
@Frequency	The value is a frequency, in Hz
@MemoryAddress	The value is a memory address.
@MemoryAmount	The value is an amount of memory, in bytes.
@NetworkAddress	The value is a network address. Value is a long below 2^32 or an unsigned integer and clients will likely display it in the usual octet format (e.g. 127.0.0.1).
@Percentage	The event is a fraction, typically between 0 and 1, and is best represented as a percentage
@Timespan	The event is a duration. Defaults to nanoseconds, but can be changed by setting value to one of the static fields defined in Timespan: MICROSECONDS, MILLISECONDS, NANOSECONDS, SECONDS, TICKS. It is always nice to explicitly set value (even for NANOSECONDS), so that it can be readily seen what was intended.
@Timestamp	The event is a timestamp. Defaults to MILLISECONDS_TO_EPOCH, but can be changed by setting value to one of the static fields defined in Timestamp: MILLISECONDS_SINCE_EPOCH, TICKS. It is always nice to explicitly set value (even for MILLISECONDS_SINCE_EPOCH), so that it can be readily seen what was intended.

There are other annotations worth knowing about such as these:

Annotation	Comment
@Enabled	Determines if the event should be enabled by default. Remember the bad old days when you added an event, and then did not see it recorded, since the template did not include it? No more. Also, events are enabled by default, so be careful to explicitly set this to false if the event is expensive. Enable is true by default.
@Experimental	The element is experimental, and may change in future versions without prior notice.
@Label	The human readable label for the element. Use headline-style capitalization, capitalize first and last words, and all nouns, pronouns, verbs and adverbs. Do not include punctuation.
@Period	The default period for a periodic event.
@Registered	Events are automatically registered by default when they are first used. This annotation can override this behaviour. If you want to make sure that an event which is not likely to be encountered is registered, perhaps to make the metadata available to tools, you can use FlightRecorder.register(class) to register the event programatically.
@Threshold	Defines the default threshold for the event. When an even have a duration lower than the threshold, it will not be recorded. The user can override this threshold in the template settings.
@Unsigned	The field value should be treated as an unsigned value.

This is by no means an exhaustive list of all the annotations, and as soon as I find out where the javadocs will end up, I will add a link to them.

Note that the programming model is simpler, compared to previous releases. There is no Producer with which you need to register the events. The life cycle of an event simply follows the life cycle of the event class. Also, new event types are enabled by default. As a matter of fact, all the defaults for an event type can be configured using annotations. Also, compared to previous releases, you do not need to explicitly reuse event objects and keep event tokens around; care has been taken to ensure that events on the hot path should be easy to scalarize.

API for Controlling the FlightRecorder

In JDK 9 there are actually two, related, APIs for controlling the FlightRecorder. Just like before, there will be a JMX API. This API provides similar capabilities as the JDK 8 JMX API, but the API has been cleaned up and moved from the com.oracle.jrockit domain to the jdk.management.flightrecorder domain.

I will not spend much time on the JMX API. A good example on how to use the different versions of it will be provided in the JMC source, once it is open sourced.

Another addition is an easy to use local API for controlling the Flight Recorder. Here is an example which starts a recording, records the calculation of the first 50 Fibonacci numbers and writes the recording to disk:

import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.oracle.example.jdk9jfr.fib.Fibonacci;
import com.oracle.example.jdk9jfr.fib.FibonacciEvent;

import jdk.jfr.Recording;

/**
 * Example program both controlling the recorder and parsing the resulting data.
 * 
 * The program will calculate the 50 first Fibonacci numbers, then print them
 * out.
 */
public class RecordOnly {
	public static void main(String[] args) throws IOException {
		if (args.length == 0) {
			System.out.println("Need to specify a location for the recording!");
			System.exit(2);
		}
		
		Path path = Paths.get(args[0]);

		try (Recording recording = new Recording()) {
			recording.setName("Fibonacci Recording");
			recording.start();
			recording.enable(FibonacciEvent.class);
			for (int n = 0; n &lt; 50; n++) {
				System.out.println("Calculating fib " + n);
				FibonacciEvent event = new FibonacciEvent();
				event.number = n;
				event.begin();
				event.value = Fibonacci.fibonacciIterative(n);
				event.commit();
			}
			recording.stop();
			recording.dump(path);
		}
	}
}

Consuming Recordings

There is a very easy to use API for consuming the recordings included in the JDK as well. It is only supported to use it to read recordings produced from the same JDK as the API reading it, and it is only supported to run the parser in the same JDK with which it is provided. So, in short, JFR recordings produced by JDK 9, and the parser running on JDK 9.

As I mentioned in the beginning, there is another parser included in JMC, with supported reading JDK 7, 8 and 9 recordings, and which supports internal iteration, aggregation in various forms, JDK 7 and later execution compatibility, and rules evaluation. It is still POJO, so it can be included into anything. Which parser to use depends on what you are trying to accomplish.

Here is an addition to the previous example which also reads the produced recording using the Parser built into the JDK, and which then writes out information from the captured events on stdout:

import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.oracle.example.jdk9jfr.fib.Fibonacci;
import com.oracle.example.jdk9jfr.fib.FibonacciEvent;

import jdk.jfr.Recording;
import jdk.jfr.consumer.RecordedEvent;
import jdk.jfr.consumer.RecordingFile;

/**
 * Example program both controlling the recorder and parsing the resulting data.
 * 
 * The program will calculate the 50 first Fibonacci numbers, then print them
 * out.
 */
public class RecordAndConsume {
	public static void main(String[] args) throws IOException {
		if (args.length == 0) {
			System.out.println("Need to specify a location for the recording!");
			System.exit(2);
		}
		
		Path path = Paths.get(args[0]);

		try (Recording recording = new Recording()) {
			recording.setName("Fibonacci Recording");
			recording.start();
			recording.enable(FibonacciEvent.class);
			for (int n = 0; n &lt; 50; n++) {
				System.out.println("Calculating fib " + n);
				FibonacciEvent event = new FibonacciEvent();
				event.number = n;
				event.begin();
				event.value = Fibonacci.fibonacciIterative(n);
				event.commit();
			}
			recording.stop();
			recording.dump(path);
			for (RecordedEvent event : RecordingFile.readAllEvents(path)) {
				int number = event.getValue("number");
				long value = event.getValue("value");
				System.out.printf("fibonacci(%d) = %d (time: %dns)\n", number, value, event.getDuration().getNano());
			}
		}
	}
}

Other Improvements to JFR in JDK 9

Here is a list of some other improvements to JFR in JDK9:

Modularized APIs (if you do not need the JMX API, don’t include it).
Extensible APIs (can provide custom settings for event types, can provide custom metadata using annotations)
Performance improvements (more on this later)
Can emit data even in bad situations (Out of Memory for example)
New events

More detailed safe point information
More detailed code cache information
New PLAB (promotion) events
New compiler events for detailed inlining information
New H1 specific information for better visualization of region states
Module events (loaded modules, full transitive closure)
NativeLibrary (load, a periodic event, by default emitted each chunk)
…and more. Just make a test recording and include them all.

Performance Enhancements

Here is a graphic (provided by my excellent friend Erik Gahlin at the Oracle JPG Serviceability Team), which shows some of the areas where JFR has improved in JDK 9:

Let’s take them in order:

The time spent in safepoints due to chunk rotation to disk has been reduced from a bit over 7ms to almost nothing.
If you have code with a disabled Java event, the overhead of having the event related code in your method will be zero (if that code is on the hot path). After optimization, any trace of the event related code will be optimized away.
The overhead of writing an event (without capturing stack traces), has gone from 8ns to 6ns.
The file size (and buffer memory usage) has roughly been cut by half, due to the introduction of compressed integers.
The time required to capture a stack trace has been almost cut in half.
Even the startup has been slightly improved.

So, not only have the APIs become easier to use, and more comprehensive. JFR has also become significantly faster. That is a bit of a feat, considering that it wasn’t very slow to begin with.

Summary

The new Java release, JDK 9, provides plenty of improvements to JFR:

Supported API
Simplified programming model
Higher performance

This blog provided a quick introduction to JFR in JDK 9, with some simple examples.

Here is a link to the full source for the examples: source_jdk9_jfr_examples.zip