Detecting abnormal system behaviour visually
For DevOps engineers and SREs, detecting abnormal system behaviour visually is often the fastest way to understand what’s going wrong, why, and how urgently you need to act. Done well, visual detection turns noisy metrics and logs into…
Detecting abnormal system behaviour visually
For DevOps engineers and SREs, detecting abnormal system behaviour visually is often the fastest way to understand what’s going wrong, why, and how urgently you need to act. Done well, visual detection turns noisy metrics and logs into intuitive signals you can interpret in seconds.
This post walks through practical techniques, dashboard patterns, and code snippets you can use today to make abnormal behaviour “pop out” visually in your observability stack.
Why detecting abnormal system behaviour visually matters
Most production incidents have early warning signs: a slow drift in latency, a subtle increase in 5xxs, or a resource saturation trend that starts hours before the outage. Metrics and logs see these, but humans only react fast when they can recognize them visually on a dashboard.[1][2]
By designing for visual anomaly detection, you:
- Reduce MTTD (Mean Time To Detect) by making issues obvious at a glance.[2]
- Cut alert noise by letting visuals show context before you page someone.[1]
- Improve incident communication: everyone can see the same graphs and agree “this is wrong.”
Core visual patterns for detecting abnormal system behaviour
1. Baseline vs current: show “normal” directly on the chart
The simplest way of detecting abnormal system behaviour visually is to plot current behaviour against a historical baseline (same time last week, last N days, or a trained model output).
Example: Prometheus + Grafana, HTTP latency vs 7‑day baseline.
# Current 95th percentile latency over 5m windows
histogram_quantile(
0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# 7-day baseline for the same time-of-day
histogram_quantile(
0.95,
avg_over_time(
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
[7d:5m])
)
In your panel:
- Series A: current latency (bold, bright color).
- Series B: baseline (thin, muted color or dashed line).
Any divergence between the current line and baseline line becomes a visual anomaly, even before alerts fire.
2. Color-coded thresholds and bands
Static SLO/SLA thresholds are still powerful visual cues. Use bands and color changes to highlight when metrics cross “danger zones.”[1][3]
Common examples:
- CPU > 80% for > 5 minutes.
- Error rate > 1% of all requests.
- Queue length > 75% of max depth.
In Grafana, configure a threshold band on a time series panel (e.g., yellow from 70–85, red > 85). The chart itself becomes a heatmap of risk, ideal for visually detecting abnormal system behaviour.
3. Anomaly detection overlays
Many modern backends (VictoriaMetrics, CloudWatch, DevOps Guru, Datadog, AppDynamics, etc.) can compute anomalies and expose them as metrics.[1][3][4]
Pattern:
- Series A: raw metric (latency, CPU, throughput).
- Series B: anomaly score or boolean (0 = normal, 1 = anomaly), plotted as:
# Example: overlay anomaly markers when > 3 std deviations
(
http_requests_total
- avg_over_time(http_requests_total[30m])
)
/
stddev_over_time(http_requests_total[30m]) > 3
Then show Series B as points or annotations. Spikes in the anomaly series visually mark abnormal behaviour without hiding the raw signal.
4. Correlated layouts: side-by-side cause and effect
DevOps and SRE workflows depend on spotting relationships: “CPU jumped, then latency spiked, then errors increased.”[2]
Make this correlation visual by arranging panels vertically:
- Top: request rate (RPS/QPS).
- Next: latency (P50/P95/P99).
- Next: error rate (4xx/5xx).
- Bottom: infrastructure (CPU, memory, I/O, DB connections).
When all lines move together, it’s clearly a system-wide issue. When only one layer moves, you localize the abnormal system behaviour visually to that layer.
Practical example: Visual anomaly dashboard for a web service
Metrics we’ll use
- http_requests_total (labels:
status,service). - http_request_duration_seconds_bucket (histogram).
- process_cpu_seconds_total, container_memory_working_set_bytes.
1. Error rate panel with visual anomalies
sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))
Visual tips:
- Plot as percentage (0–5%).
- Threshold bands: green 0–0.5%, yellow 0.5–1%, red > 1%.
- Enable background fill when > 1% to make abnormal behaviour unmissable.
2. Latency vs baseline panel
# Current P95
histogram_quantile(
0.95,
sum by (le) (
rate(http_request_duration_seconds_bucket{service="checkout"}[5m])
)
)
# Baseline P95 over last 7 days
histogram_quantile(
0.95,
avg_over_time(
sum by (le) (
rate(http_request_duration_seconds_bucket{service="checkout"}[5m])
)[7d:5m]
)
)
Visual tips:
- Use contrasting line styles (solid vs dashed).
- Visually, any spike in current that exceeds baseline by > X ms signals abnormal behaviour.
3. CPU and memory saturation panel
# CPU usage %
rate(process_cpu_seconds_total{service="checkout"}[5m]) * 100
# Memory usage MB
container_memory_working_set_bytes{service="checkout"} / 1024 / 1024
Visual tips:
- Use unit-aware axes (left: %, right: MB) if combined in one chart.
- Highlight CPU > 80% and rapid memory growth as abnormal system behaviour patterns.
Log-based visual anomaly detection
Metrics are not enough; many anomalies first appear in logs. You can still use detecting abnormal system behaviour visually patterns with log data.
Log frequency over time
In tools like Kibana, Loki, or Splunk, aggregate log counts by level:
- Total logs per minute.
- ERROR and WARN logs per minute.
- Specific patterns (e.g., “timeout”, “database unavailable”).
Then show stacked area charts by level. Sudden bursts of WARN/ERROR are instantly visible as color changes and spikes.[1]
Simple Python log anomaly detection example
This script uses a rolling baseline to flag abnormal log volume; you can export results as a time series for visualization.
import time
from collections import deque
from statistics import mean, pstdev
window = deque(maxlen=60) # last 60 seconds
threshold_sigma = 3
def is_anomalous(count, baseline):
if len(baseline) < 10:
return False
mu = mean(baseline)
sigma = pstdev(baseline) or 1
return (count - mu) / sigma > threshold_sigma
while True:
# Replace this