A comprehensive set of turnkey infrastructure integrations

Including dozens of AWS and Azure services, web, database, network, containers, orchestrations like Docker and Kubernetes, and more.

START FREE TRIAL

Complete visibility into the health and performance of applications and their underlying infrastructure

Quickly pinpoint the root cause of performance issues across the stack, down to a poor-performing line of code

START FREE TRIAL

Custom metrics and analytics

Analyze custom infrastructure, application, and business metrics

View Custom Metrics Monitoring Info

Powerful API that makes it easy to collect and create any custom metric

Achieve ultimate visibility and enhanced troubleshooting with synthetic and real user monitoring

START FREE TRIAL

Free APM Software

Catch bugs early on, and gain full visibility and insights into the applications you’re developing

View Product Info

Free, full-function APM tool for testing and troubleshooting application performance before moving into production

Dev Edition includes five traces per minute, 100 metrics, three hosts, and six containers

GET FREE TOOL

Log Management and Analytics powered by SolarWinds Loggly

Integrated, cost-effective, hosted, and scalable full-stack, multi-source log management

View Log Management and Analytics Info

Collect, search, and analyze log data in addition to your metrics and traces to quickly pinpoint application performance problems

Reduce mean time to resolution (MTTR) by quickly jumping from a trace or host view into the relevant logs to accelerate troubleshooting

START FRE TRAIL

Digital Experience Monitoring Powered by SolarWinds Pingdom

Make your websites faster and more reliable with easy-to-use web performance and digital experience monitoring

View Digital Experience Monitoring Info

Add client-side web application performance monitoring. Provide maximum observability by adding the user’s perspective.

Achieve ultimate visibility and enhanced troubleshooting with synthetic and real user monitoring

START FREE TRIAL

Note: This article was originally published on Librato, which has since been merged with SolarWinds®AppOptics. Explore how you can set up APM alerts using AppOptics.

The world may run on coffee, but it’s the alarm clock that gets us out of bed. It operates on a simple threshold. You set the time that’s important to you and receive an alert when that variable is true.

Like your alarm clock, today’s tooling for web service alerting often operates on simple thresholds, but unlike with your clock, there is a wide variety of metrics and it’s not as clear which should trigger an alert. Until we have something better than thresholds, engineers have to carefully weigh which metrics are actionable, how they are being measured, and what thresholds correspond to real-world problems.

Measure the Thing You Care About

In practice, this arguably simple process of reasoning about what you are monitoring, and how you are monitoring it, is rarely undertaken. More often, our metric choices and threshold values are guided by our preexisting tools. Hence, if our tools cannot measure latency, we do not alert on latency. This practice of letting our tools guide our telemetry content is an anti-pattern which results in unreliable problem detection and alerting.

Effective alerting requires metrics that are reliably actionable. You have to start by reasoning about the application and/or infrastructure you want to monitor. Only then can you choose and implement collection tools that get you the metrics you’re actually interested in, like queue size, DB roundtrip times, and inter-service latency.

One Reliable Signal

Effective alerting requires a singular, reliable telemetry signal, to which every collector can contribute. Developing and ensuring a reliable signal can be difficult, but the orders of magnitude are simpler than building out multiple disparate monitoring systems and trying to make them agree with each other. It’s similar to the way many shops, for example, alert from one system like Nagios, and troubleshoot from another like Ganglia.

It’s arguably impossible to make multiple, fallible systems agree with each other in every case. They may usually agree, but every false positive or false negative undermines the credibility of both systems. Further, multiple systems rarely improve because it’s usually impossible to know which system was at fault when they disagree. Did the alerting system send a bogus alert or is there a problem with the data in the visualization system? If false positives arise from a single telemetry system, you simply iterate and improve that system.

Alert Recipient = Alert Creator

Crafting effective alerts involves knowing how your systems work. Each alert should trigger in the mind of its recipient an actionable cognitive model that describes how the production environment is being threatened. How does the individual piece of infrastructure that fired this alert affect the application? Why is this alert a problem?

Only engineers who understand the systems and applications we care about have the requisite knowledge to craft alerts that describe actionable threats to those systems and applications. Therefore effective alerting requires that the recipients of alerts be able to craft those alerts.

Push Notifications as a Last Resort

Emergencies force context switches. They interrupt workflow and destroy productivity. Many alerts are necessary, but very few of them should be considered emergencies. At AppOptics, the preponderance of our alerts is delivered to group chat. We find that this is a timely notification medium which doesn’t interrupt productivity. Further, group chat allows everyone to react to the alert together, in a group context, rather than individually from an email in-box or pager. This helps us avoid redundant troubleshooting effort, and keeps everyone synchronized on problem resolution.

Effective alerting requires an escalation system that can communicate problems in a way that is not interrupt-driven. There are myriad examples in other industries like healthcare and security systems, where, when every alert is interrupt-driven, human beings quickly begin to ignore the alerts. Push notifications should be a last resort.

Alerting is Hard

Effective alerting is a deceptively hard problem, which represents one of the biggest challenges facing modern operations engineers. A careful balance needs to be struck between the needs of the systems and the needs of the humans tending those systems. To learn more about alerting best practices, check out our other blog posts on the topic, and sign up for a free trial when you’re ready to take control of your alerting.

Related articles

© 2021 SolarWinds Worldwide, LLC. All rights reserved.