A new metric to quantify the latency overhead of our Viaduct framework
Viaduct, a GraphQL-based data-oriented service mesh, is Airbnb’s paved road solution for fetching internal data and serving public-facing API requests. As a unified data access layer, the Viaduct framework handles high throughput and is capable of dynamically routing to hundreds of downstream destinations when executing arbitrary GraphQL queries.
Performance Challenges in Viaduct
Viaduct’s role as a data access layer puts it in the critical path of most activity on Airbnb. This makes runtime performance of utmost importance as overhead in the framework will apply universally and can have a multiplicative effect. At the same time, Viaduct accepts arbitrary queries against the unified data graph. In practice, this amounts to many thousands of heterogeneous queries in production, each of which is capable of making an arbitrary number of downstream and often concurrent calls during the course of query execution.
This presented a challenge for us. Runtime overhead in Viaduct is crucial for us to monitor and improve, but we did not have a good measure for it. Metrics on end-to-end query latencies are confounded by the performance of downstream services, making it difficult to accurately judge the effect of a performance intervention in Viaduct. We needed a metric that isolates the performance impact of Viaduct changes from the performance impact of downstream services.
Defining Own Time
To do this, we created a metric called “own time”. Own time measures the portion of a request’s wall-clock time that occurs when there are zero downstream requests in flight. The following is pseudocode to compute own time given a root request time span and a set of downstream fetch time spans:
def calculateOwnTime(rootSpan, fetchSpans):
ownTime = 0
maxEndTimeSoFar = rootSpan.startTime
sortedFetchSpans = fetchSpans sorted by increasing start-time
for fetchSpan in sortedFetchSpans:
if (maxEndTimeSoFar < fetchSpan.startTime)
ownTime += (fetchSpan.startTime - maxEndTimeSoFar)
maxEndTimeSoFar = max(maxEndTimeSoFar, fetchSpan.endTime)
ownTime += (rootSpan.endTime - maxEndTimeSoFar)
return ownTime
The own time metric allows us to focus on aspects of Viaduct’s overhead that are clearly unrelated to downstream service dependencies. While it does not capture all aspects of Viaduct’s overhead, we’ve found it captures enough to be a valuable indicator of overhead costs.
Examples
In the trivial case where all downstream calls are made serially, own time is a simple span difference of the root operation span and the sum of the downstream time spans.
When there are multiple downstream calls, they may be made fully or partially in parallel.
In this example, the downstream calls happen partially in parallel, and the resulting own time value doesn’t include the time that any downstream request is in flight, parallel or not.
Identifying and Reducing Runtime Latency
Measuring the influence of CPU vs. I/O on request latency
Normalizing operation own time by overall operation latency gives us an estimate of how CPU-bound vs. I/O-bound an operation is. We call this the “own time ratio” of a query. For example, the Viaduct operation graphed below had an own time ratio of 20%, indicating that 20% of the request runtime in Viaduct was spent with no downstream request in flight. After deploying an internal Viaduct performance improvement, this operation’s own time ratio dropped to 17%, since Viaduct overhead improved while downstream performance remained constant.
A low own time ratio for an operation indicates that the biggest overall latency gains will likely be found by optimizing downstream services, not Viaduct. A high own time ratio indicates that the meaningful latency gains can come from optimizing internal Viaduct runtime for the operation. When making such optimizations for the sake of one operation, we can also use own time ratios across all operations, and especially low-ratio ones, to ensure we aren’t introducing a regression more broadly.
Quantifying the impact of query size on runtime overhead
Viaduct users reported that large queries were running slower than expected, attributing the slow execution to Viaduct overhead. Before own time, we had no metrics to assess such reports. After introducing own time, we had a starting point, but we needed to refine the metric further for this use case.
One would expect own time to increase as the number of fields returned by an operation increases. But was that a reasonable expectation? We found that normalizing own time by the count of fields returned by an operation yields a metric that more usefully indicates, across a heterogeneous set of operations, when own time is excessive. We defined field count to include both object fields and individual array elements.
The following graph shows that there is indeed an overall relationship between own time and field count across our set of operations, as well as some outliers that have unusually high own-time-to-field-count ratios.
This relationship between field count and own time encouraged us to focus on framework logic that runs on every field for all operations, rather than other parts of the codebase. Through some CPU profiling, we were able to quickly identify bottlenecks. One resulting improvement was a change to our multithreading model for field execution, which decreased own time for all operations by 25%.
Quantifying the impact of internal caching on runtime overhead
Viaduct saw another performance issue. For some operations, latency appeared to vary an unusual amount, even between identical requests. Here again, we used own time to guide our investigation into root causes.
Viaduct relies on a number of internal caches to ensure that execution is fast, such as a cache for parsed and validated GraphQL documents. Own time metrics indicated that Viaduct runtime overhead, not downstream service dependencies, was causing the variance in latencies. We theorized that cache misses were the culprit. To test this theory, we instrumented our caches to report whether any lookup miss occurred during an operation execution and attached this hit/miss status to our own time metric output. This allowed us to report on own time by cache hit/miss status on a per-cache, per-operation basis.
Adding this information to own time allowed us to both confirm our theory and quantify the potential benefit of implementing a solution, such as additional cache warming or moving in-memory caches to distributed caches, prior to committing actual engineering resources. Migrating the in-memory cache that stores the validation state of GraphQL documents to a distributed cache reduced miss rates. This had a significant impact on tail latencies, especially for low QPS operations that were more likely to encounter cold cache states.
Setting Runtime Overhead Goals
Establishing the own time metric normalized by field count ended up being a great way to account for changes in query patterns. Thus, we now use this metric, aggregated across all operations, to set framework-level performance targets that are isolated from changes in client query patterns. In particular, after measuring the base rate of normalized own time at the beginning of a quarter, we set a goal to improve normalized own time by a specific percentage quarter-over-quarter.
We also use this metric, aggregated on a per-operation basis, to let operation owners know how their operation overhead compares to the rest of the system.
Integrating Own Time Into The Release Cycle
To quantify the runtime performance impact of a change, we can set up experiments where two staged control and treatment applications receive identical production traffic replay. We can then graph the difference in own time between them. This allows us to quantify the impact of various framework interventions on runtime overhead and measure each intervention’s impact against our performance goals.
While replay experiments help us to assess the potential runtime improvements of a change on a limited set of use cases, narrowly-targeted optimizations can lead to broader performance regressions may still happen accidentally. To guard against such regressions, we leverage an automated canary analysis process before deployment. A canary instance and baseline instance receive identical production replay traffic for a period of time, and large discrepancies between them can automatically stop the deployment process. By inspecting the own time difference between the canary and baseline instances, we can identify unexpected performance regressions prior to the regression making it to production.
In addition to automated canary analysis, graphing day-over-day, week-over-week and month-over-month own time in production shows us long-term isolated performance trends and allows us to bisect any regressions that make it to production.
Limitations and Future Work
By ignoring what Viaduct does during all downstream calls, own time does not account for possible optimizations to a call pattern of the downstream requests themselves. For example, a request execution may be sped up by increasing concurrency of downstream calls or removing some calls altogether.
Although own time gives a measure of wall-clock runtime service overhead, it does not say what is causing the overhead or how to best improve it, which will vary across operations in a GraphQL server. However, tracking downstream request spans in memory provides baseline data that can be enriched with other metadata and further filtered to measure the contribution of application-specific activity to own time.
Tracking down the root cause of unexpected own time changes or understanding why an operation is an own time outlier requires manual inspection and sometimes additional one-off measurements, which take valuable engineering time. We can automate the first steps in these investigations by measuring the contribution of various parts of the application to own time. This would speed up root cause analysis and limit time spent manually profiling CPU usage.
Conclusion
Own time has allowed us to isolate the runtime performance characteristics of Viaduct, our GraphQL-based data-oriented service mesh. Using own time, we can precisely measure the production runtime performance effects of application changes, set downstream-independent performance goals, and measure our long-term progress against those goals for an arbitrary underlying application. Enriching own time with application-specific data, such as fetched field counts and cache hit/miss states in Viaduct, gives us an overarching view of the relationship between an application’s state and its runtime performance characteristics.
Acknowledgements
Thanks to everyone who made this work possible by supporting the Viaduct framework, brainstorming ideas and providing feedback on this post, including Aileen Chen, Yuchun Chen, Zoran Dimitrijevic, Adam Miskiewicz, Parth Shah, Raymie Stata, and Kim Strauch.
All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.