A comprehensive set of turnkey infrastructure integrations

Including dozens of AWS and Azure services, web, database, network, containers, orchestrations like Docker and Kubernetes, and more.

START FREE TRIAL

Complete visibility into the health and performance of applications and their underlying infrastructure

Quickly pinpoint the root cause of performance issues across the stack, down to a poor-performing line of code

START FREE TRIAL

Custom metrics and analytics

Analyze custom infrastructure, application, and business metrics

View Custom Metrics Monitoring Info

Powerful API that makes it easy to collect and create any custom metric

Achieve ultimate visibility and enhanced troubleshooting with synthetic and real user monitoring

START FREE TRIAL

Free APM Software

Catch bugs early on, and gain full visibility and insights into the applications you’re developing

View Product Info

Free, full-function APM tool for testing and troubleshooting application performance before moving into production

Dev Edition includes five traces per minute, 100 metrics, three hosts, and six containers

GET FREE TOOL

Log Management and Analytics powered by SolarWinds Loggly

Integrated, cost-effective, hosted, and scalable full-stack, multi-source log management

View Log Management and Analytics Info

Collect, search, and analyze log data in addition to your metrics and traces to quickly pinpoint application performance problems

Reduce mean time to resolution (MTTR) by quickly jumping from a trace or host view into the relevant logs to accelerate troubleshooting

START FRE TRAIL

Digital Experience Monitoring Powered by SolarWinds Pingdom

Make your websites faster and more reliable with easy-to-use web performance and digital experience monitoring

View Digital Experience Monitoring Info

Add client-side web application performance monitoring. Provide maximum observability by adding the user’s perspective.

Achieve ultimate visibility and enhanced troubleshooting with synthetic and real user monitoring

START FREE TRIAL

Introduction

On January 3, 2018, the Meltdown and Spectre CPU architecture flaws were announced to the world. Due to early leaks, the announcement was made roughly a week earlier than planned. These bugs are easily the largest vulnerabilities announced in the last decade and require a complete reassessment of microprocessor architectures, and how software and hardware interact.

Public cloud companies such as Amazon Web Services (AWS®) were informed of the vulnerabilities prior to the release, and worked to prepare system patches that would prevent information disclosure on multi-tenant cloud infrastructure. There is no universal fix for Spectre at the moment, so most mitigations to date have largely been targeting the easier-to-reproduce Meltdown vulnerability.

We won’t go into specific details about the bugs here because those have been widely covered in other forums. In this post, I’ll highlight the impacts we at SolarWinds Cloud® noticed across our AWS infrastructure during the last several weeks due to Meltdown. We, along with many SaaS companies, were impacted by these changes and suffered partial downtime due to AWS efforts to mitigate Meltdown. We’ll be posting a full postmortem for our incident in the coming days. Although this is the impact we saw in our environment, we realize this may not be the same for other environments.

Paravirtualized Instances

The first sign of what was to come began with the reboot maintenance emails we received in mid-December informing us that any Paravirtualized (PV) hosts had to be restarted by January 5. While we largely operate on HMV instances, we do have a handful of PV instances that support our legacy platforms. We planned to reboot these instances ahead of the holidays to avoid any potential downtime, when fewer staff are onsite.

As you can see from the following chart, taken from a Python® worker service tier, when we rebooted our PV instances on December 20, ahead of the maintenance date, we saw CPU jumps of roughly 25%.

There were reports of similar issues related to performance or stability of instances after they were rebooted for this maintenance event.

HVM Instance Patching

AWS was able to live patch HVM instances with the Meltdown mitigation patches without requiring instance reboots. From what we observed, these patches started rolling out around January 4 00:00 UTC in us-east-1 and completed around 20:00 UTC for EC2 HVM instances in us-east-1. This was an incredibly fast rollout across all AZs for AWS, but they were clearly working against the clock given the accelerated timetable. We saw a window as short as just six hours between different availability zones within us-east-1, a relatively small time window between AZs for anyone running large multi-AZ workloads.

CPU bumps like this were noticeable across several service tiers:

During this same time period, we saw additional CPU increases on our PV instances that had been previously upgraded. This seems to imply that some level of HVM patching was occurring on these PV instances around the same time that all pure-HVM instances were patched:

Per-tier observations

The patch rollout impacted pretty much every tier in our platform, including our EC2 infrastructure and AWS managed services (RDS, Elasticache, VPN Gateway). Here are a few snapshots of how it impacted services across our platform.

Kafka

We rely heavily on Kafka for stream processing across SolarWinds Cloud for logs, metrics, and traces. While we noticed moderate bumps in CPU during the patch deploy window, roughly 4%, as shown in the first chart below, the impressive change was the drop in packet rates sent from our Kafka brokers.

On the same Kafka cluster as above, we saw the packet rate drop up to 40% when the patches were deployed. Normally, a drop this significant would be a sign that the workload pattern had shifted, but we weren’t able to detect any change in total throughput. The Kafka consumer API has several configuration settings that allow for batch size and polling frequency to be tuned independently. Our working assumption is that individual consumer poll times increased at the broker, so the time between poll calls increased and this led to improved, larger batch sizes. Larger batches meant fewer small packets sent over the wire.

This theory aligns with the dropoff in read call counts we saw with our Kafka consumers:

Cassandra

The Cassandra tiers that we use for TSDB storage were also impacted across the board. We saw CPU spikes of roughly 25% CPU on m4.2xlarge instances and similar spikes on other instance types. For now, we have enough extra capacity to absorb these spikes, but we are actively looking at how we can reduce the increased load.

Internal Service Tier

The impacts on performance were also seen in our own service tiers. One internal tier, which interacts with Cassandra, saw a 45% jump in p99 latency committing records to Cassandra:

Elasticache

We also detected spikes in latency on AWS managed services, like AWS Elasticache Memcached. This snapshot shows an 8% bump in CPU on a given memcached, but that was almost a 100% increase:

This led to increased tail latencies and cache timeouts, providing us with a pretty unique graph:

Possible Mitigation

As you can tell from performance charts, the new baseline for system performance has been significantly altered by the mitigation patches for Meltdown. The KPTI patches are necessary to maintain memory isolation to help prevent exploitation on multi-tenant, public cloud infrastructure. Software engineers will need to identify how they can reduce the increased costs from user to kernel switches that KPTI adds.

Applications that make frequent systems calls to read/write data either over network sockets or from disk systems will need to be better tuned for batching. Incurring small I/O operations is now more costly, and engineers will need to optimize their code to reduce the frequency of such calls. Finding the sweet spot between larger batch sizes and latency is difficult and will require software that adapts for multiple variables simultaneously. It was promising to see that the Kafka consumer libraries were able to optimize for this dynamically as network call latency increased.

To support these efforts, engineers will need access to real-time, high-fidelity metrics that expose the rate of system calls generated by their applications or databases. Moving forward, all observability platforms must include the capability to actively monitor system calls.

Next

It’s uncertain where the future of Meltdown and Spectre patches will land, but it is likely to continue to impact performance significantly for any business running infrastructure at scale. We need to adapt our software engineering disciplines to accommodate changed assumptions in system performance as we continue to build distributed systems.

It remains to be seen how additional patching in guest kernels will impact performance when run on top of patched cloud hypervisor nodes. We are continuing to explore this and the impacts it may cause.

Observability

The graphs you see above are from AppOptics, the SaaS-based application and infrastructure performance monitoring product.  System metrics like CPU and packet rates are provided by our native AWS® CloudWatch integration as well as our system host agent. If you would like to gain similar visibility in to your system performance signup for a free trial at: https://my.appoptics.com/sign_up.

Update – Jan 12, 2018

As of 10:00 UTC this morning, we are noticing a steep reduction in CPU usage across our instances. It is unclear if there are additional patches being rolled out, but CPU levels appear to be returning to pre-HVM patch levels.

DISCLAIMER: The data set forth herein is based upon our observation of our environment for a set period of time. The data presented may not reflect other environments nor the causality behind the data.

Related articles

© 2021 SolarWinds Worldwide, LLC. All rights reserved.