Engineering, Featured, Monitoring

Visualizing Meltdown on AWS

Posted by

Mike Heffner

Introduction

On January 3, 2018, the Meltdown and Spectre CPU architecture flaws were announced to the world. Due to early leaks, the announcement was made roughly a week earlier than planned. These bugs are easily the largest vulnerabilities announced in the last decade and require a complete reassessment of microprocessor architectures, and how software and hardware interact.

Public cloud companies such as Amazon Web Services (AWS®) were informed of the vulnerabilities prior to the release, and worked to prepare system patches that would prevent information disclosure on multi-tenant cloud infrastructure. There is no universal fix for Spectre at the moment, so most mitigations to date have largely been targeting the easier-to-reproduce Meltdown vulnerability.

We won’t go into specific details about the bugs here because those have been widely covered in other forums. In this post, I’ll highlight the impacts we at SolarWinds Cloud® noticed across our AWS infrastructure during the last several weeks due to Meltdown. We, along with many SaaS companies, were impacted by these changes and suffered partial downtime due to AWS efforts to mitigate Meltdown. We’ll be posting a full postmortem for our incident in the coming days. Although this is the impact we saw in our environment, we realize this may not be the same for other environments.

Paravirtualized Instances

The first sign of what was to come began with the reboot maintenance emails we received in mid-December informing us that any Paravirtualized (PV) hosts had to be restarted by January 5. While we largely operate on HMV instances, we do have a handful of PV instances that support our legacy platforms. We planned to reboot these instances ahead of the holidays to avoid any potential downtime, when fewer staff are onsite.

As you can see from the following chart, taken from a Python® worker service tier, when we rebooted our PV instances on December 20, ahead of the maintenance date, we saw CPU jumps of roughly 25%.

There were reports of similar issues related to performance or stability of instances after they were rebooted for this maintenance event.

HVM Instance Patching

AWS was able to live patch HVM instances with the Meltdown mitigation patches without requiring instance reboots. From what we observed, these patches started rolling out around January 4 00:00 UTC in us-east-1 and completed around 20:00 UTC for EC2 HVM instances in us-east-1. This was an incredibly fast rollout across all AZs for AWS, but they were clearly working against the clock given the accelerated timetable. We saw a window as short as just six hours between different availability zones within us-east-1, a relatively small time window between AZs for anyone running large multi-AZ workloads.

CPU bumps like this were noticeable across several service tiers:

During this same time period, we saw additional CPU increases on our PV instances that had been previously upgraded. This seems to imply that some level of HVM patching was occurring on these PV instances around the same time that all pure-HVM instances were patched:

Per-tier observations

The patch rollout impacted pretty much every tier in our platform, including our EC2 infrastructure and AWS managed services (RDS, Elasticache, VPN Gateway). Here are a few snapshots of how it impacted services across our platform.

Kafka

We rely heavily on Kafka for stream processing across SolarWinds Cloud for logs, metrics, and traces. While we noticed moderate bumps in CPU during the patch deploy window, roughly 4%, as shown in the first chart below, the impressive change was the drop in packet rates sent from our Kafka brokers.

On the same Kafka cluster as above, we saw the packet rate drop up to 40% when the patches were deployed. Normally, a drop this significant would be a sign that the workload pattern had shifted, but we weren’t able to detect any change in total throughput. The Kafka consumer API has several configuration settings that allow for batch size and polling frequency to be tuned independently. Our working assumption is that individual consumer poll times increased at the broker, so the time between poll calls increased and this led to improved, larger batch sizes. Larger batches meant fewer small packets sent over the wire.

This theory aligns with the dropoff in read call counts we saw with our Kafka consumers:

Cassandra

The Cassandra tiers that we use for TSDB storage were also impacted across the board. We saw CPU spikes of roughly 25% CPU on m4.2xlarge instances and similar spikes on other instance types. For now, we have enough extra capacity to absorb these spikes, but we are actively looking at how we can reduce the increased load.

Internal Service Tier

The impacts on performance were also seen in our own service tiers. One internal tier, which interacts with Cassandra, saw a 45% jump in p99 latency committing records to Cassandra:

Elasticache

We also detected spikes in latency on AWS managed services, like AWS Elasticache Memcached. This snapshot shows an 8% bump in CPU on a given memcached, but that was almost a 100% increase:

This led to increased tail latencies and cache timeouts, providing us with a pretty unique graph:

Possible Mitigation

As you can tell from performance charts, the new baseline for system performance has been significantly altered by the mitigation patches for Meltdown. The KPTI patches are necessary to maintain memory isolation to help prevent exploitation on multi-tenant, public cloud infrastructure. Software engineers will need to identify how they can reduce the increased costs from user to kernel switches that KPTI adds.

Applications that make frequent systems calls to read/write data either over network sockets or from disk systems will need to be better tuned for batching. Incurring small I/O operations is now more costly, and engineers will need to optimize their code to reduce the frequency of such calls. Finding the sweet spot between larger batch sizes and latency is difficult and will require software that adapts for multiple variables simultaneously. It was promising to see that the Kafka consumer libraries were able to optimize for this dynamically as network call latency increased.

To support these efforts, engineers will need access to real-time, high-fidelity metrics that expose the rate of system calls generated by their applications or databases. Moving forward, all observability platforms must include the capability to actively monitor system calls.

Next

It’s uncertain where the future of Meltdown and Spectre patches will land, but it is likely to continue to impact performance significantly for any business running infrastructure at scale. We need to adapt our software engineering disciplines to accommodate changed assumptions in system performance as we continue to build distributed systems.

It remains to be seen how additional patching in guest kernels will impact performance when run on top of patched cloud hypervisor nodes. We are continuing to explore this and the impacts it may cause.

Observability

The graphs you see above are from AppOptics, the SaaS-based application and infrastructure performance monitoring product.  System metrics like CPU and packet rates are provided by our native AWS® CloudWatch integration as well as our system host agent. If you would like to gain similar visibility in to your system performance signup for a free trial at: https://my.appoptics.com/sign_up.

Update – Jan 12, 2018

As of 10:00 UTC this morning, we are noticing a steep reduction in CPU usage across our instances. It is unclear if there are additional patches being rolled out, but CPU levels appear to be returning to pre-HVM patch levels.

DISCLAIMER: The data set forth herein is based upon our observation of our environment for a set period of time. The data presented may not reflect other environments nor the causality behind the data.

© 2018 SolarWinds Worldwide, LLC. All rights reserved.