This guide is here to help you get integrated with Opsian and improving your performance as fast as possible.
We recommend following the step-by-step installation wizard inside of the Opsian app. The wizard links to the agent for download, includes pre-populated command line parameters for your JVM and will verify your agents are connecting back correctly.
Once you've got an opsian agent connected you can use the Opsian UI to see several reports about your application. This section breaks them down and explains where they are appropriate.
This report shows you the methods in your application which are taking up the most time. It does this by summarising profiling data based on the method currently being executed when Opsian samples your application. For a given Hotspot, the report can also break down by performance by line and show the common callers of that method. Each Hotspot report consists of the methods in your application sorted by which takes up most time.
Flame Graphs can give you a high-level overview of your application's performance by visualising aggregated profiling data in a way that's easy to identify hot code paths. For Opsian Flame Graphs, methods in boxes higher up call methods in boxes below them in the Flame Graph. The width of the boxes represent the proportion of time that the method was present in. The different in width between a parent and it's children indicates the time spent executing the method itself. The number of samples in a report where the method is present somewhere in the call stack is called "total samples". Alternatively the number of samples where a method is at the bottom of the call stack i.e it is currently executing, is referred to as "self samples". A method with high total samples but very low self samples usually does very little but call other methods.
Tree views give you a top-down view of your profiling information and let you expand down through the call hierachy. In Java, you would see the top-most method for each thread, java.lang.Thread.run, and then be able to expand to see the methods called by it and the number of samples they appeared in.
This report gives a Hotspots based comparison between two selected versions of your application. It can highlight significant changes in performance and give line-level identification of the cause of regressions. In order to use the version comparison report you need to add the agentVersion parameter, described below.
Date and Time
Timezone, Start, End enable you to specify the date and time period over which the report will run. The timezone specified should be your local timezone or timezone used by other data you wish to correlate with (e.g an incident at 3am in Europe/London).
This is the sampling timing type. There are two options CPU and WallClock. With CPU timing Opsian reports the how much time spent executing on the CPU has been consumed. WallClock timing samples reports the amount of real time has elapsed including both time on CPU and time spent blocked or waiting.
CPU timing will enable you to find performance hotspots that are consuming significant amounts of CPU, whereas WallClock timing will find issues where your application is blocked from progress (e.g waiting on external IO or contending for locks between threads).
Hostnames and Agent IDs
Hostname refers to the host name available to the Opsian agent in the environment in which it's running. To enable you to differentiate between multiple JVMs running in the same environment (and thus sharing the same hostname), you can also pass an optional Agent ID parameter to the Opsian agent. These two reporting options enable you to restrict the report to only certain groups of Hostnames or Agent IDs.
This is useful if you are trying to investigate a performance issue present on only a subset of hosts running your Opsian agent.
This option enables you to restrict the report to only specific JVM versions, which is useful when introducing a JVM update to production.
Application version is an argument provided to the Opsian JVM agent that we associate with all profiling samples coming from that agent. This option enables you to restrict reports to a specific version or release of your application. This is especially useful for investigating performance issues when running in environments which may have multiple versions of the application running concurrently (e.g canary deploys).
Application versions in this option are ordered by most recently seen.
Recommended reports for getting started
We recommend starting with a Hotspot report using the CPU timing type. You can restrict a Hotspot report to individual thread groups by using the drop-down at the top right of the report, this is useful if you're trying to track down a particular CPU performance issue on threads that service requests.
Clicking on the individual Hotspots in a Hotspots report will give you a break down of the hottest (ie taking up most CPU time) lines in the method, along with showing you the most common callers and stack traces that Hotspot is present in. These can allow you to rapidly identify the contexts from which the hot method is being called.
It is worth exploring the most common Hotspots and trying to understand whether they fit with your mental model of your application's production performance, often a few stand out as not, and can give you an opportunity for optimisation.
If your application's servers don't show high CPU utilisation but still aren't performing satisfactorily then a WallClock timing type report will be useful. For these reports identifying and restricting to the thread groups actually processing your requests is crucial to getting insight in to your application's behaviours. You will likely find the top-most Hotspot in a WallClock report will be the idle or processing queue for your thread pool, this is usual (and if it is not, there may be opportunities for tuning your pool).
A Flame Graph with CPU timing is the next recommended report, as this will give you a high level overview of your application's performance but also enables you to drill down to certain parts of your codebase. Clicking on any method in the Flame Graph will zoom the report in to only samples that contain that method, which can be useful to understand performance issues with certain operations, e.g if one particular HTTP endpoint is having problems then zooming in on that Servlet or Resource class will let you identify bottlenecks related to only that endpoint.
Methods whose boxes are very wide but whose children occupy a much smaller width indicate significant self-time and are usually a good candidate for exploring further.
As with Hotspots from earlier, if CPU utilisation is not high but performance is still a problem then it may make sense to look at a WallClock timing report to identify issues with blocking I/O and lock contention.
Stuck on any particular report or unsure where to go beyond the original reports? We're happy to help, just use the Chat Widget on the app itself or drop us an email at firstname.lastname@example.org.