How TransferWise fixed Elusive Performance Problems with Opsian

Wed 09 December 2020 Sadiq Jaffer & Richard Warburton

When TransferWise first started using Opsian’s Continuous Profiling platform they had a service with a peculiar and elusive performance problem, one that had stumped existing performance monitoring tools. This post covers the symptoms of the mysterious problem and how TransferWise used Opsian’s tools to track down and solve it.

The Problem

One of TransferWise’s engineering teams had been receiving reports of “weird” multi-second pauses for simple requests going to endpoints on a particular user-facing service. The initial suspicion for this kind of behaviour was a GC-related pause but the service was only under light load and GC stats didn’t show pauses anywhere near the magnitude of those reported.

What made things even more mysterious was that both data from manually instrumented metrics and a well known Application Performance Management system claimed response times were consistently low.

Integrating Opsian

Since manual inspection of the code and metrics hadn’t resulted in any likely causes, one of TransferWise’s platform engineering teams added Opsian’s JVM agent to a production instance of the microservice and began looking at the profiling reports.

Opsian’s main reports consist of Hot Spots, Flame Graphs and Tree Views. Hot Spots aggregate and summarise the top methods where the system spends it’s CPU time or threads are blocked. Each of the top methods also includes information on it’s callers and callees, as well as which source lines in the method consumed the most time.

The first Hot Spot report that the engineering team looked at indicated that a significant amount (>60%) of time was being spent enumerating the entries of Zip files, this was curious and unexpected. Jumping through the callers, it quickly became clear this was related to Java’s class loading machinery. Initially the team assumed this report included time from the service’s startup and so used the reporting options to isolate a time period after the service had started.

A deeper dive into the Tomcat source and the culprit

This new report still showed the majority of time was being spent in class loading. Digging deeper using the Opsian reports, the source lines in Tomcat that were causing the class loading were determined. Class loading was being initiated by the resource caching system which had a default expiry time of 5 seconds in the version of Tomcat being used for the service.

Invalidating the resource classes every 5 seconds is not ideal when they are static but should not have resulted in the kind of CPU usage the team were seeing in Opsian’s reports or the intermittent multi-second pauses. Further investigation of the sources of high CPU usage from the Hot Spots report identified a component of the Swagger framework, swagger-ui, as being the ultimate culprit. On a class reload, swagger-ui would initiate an expensive scanning of the class path that would block requests until finished.

Why was this invisible to existing tooling?

Manual instrumentation and existing Application Performance Management tooling are designed to measure and spot known common issues. This performance issue is a good example of that as all monitoring and the existing APM tool were using Servlet filters, which wrap the request and response process, to record the time spent processing a request. Unfortunately in this instance, the filters themselves were blocked until swagger-ui’s work was finished and so showed no increased latency - despite this being user-visible.

Opsian’s tooling is based on profiling rather than instrumentation. Instead of placing probes around the sites of known common issues, Opsian snapshots the execution state of the application many times a second with very low overhead - less than a 1% impact on performance. It centralises this data, indexes it and then makes it available for reports and visualisations. This means teams can use Opsian to find performance issues that are invisible to metrics and instrumentation-based tools. TransferWise now use Opsian across many of their production services, enabling their developers to safely and securely access performance data from even sensitive environments.

The Problem

Integrating Opsian

A deeper dive into the Tomcat source and the culprit

Why was this invisible to existing tooling?

Related articles

Profiling can be about time.. but which time? - Profiling In Production Series #2

What is profiling and how does it work? - Profiling In Production Series #1

Announcing Continuous Allocation profiling for the Java Virtual Machine

Why Java's TLABs are so important and why write contention is a performance killer in multicore environments