Detecting abnormal system behaviour visually

For DevOps engineers and SREs, detecting abnormal system behaviour visually is often the fastest way to understand what’s going wrong, why, and how urgently you need to act. Done well, visual detection turns noisy metrics and logs into…

Detecting abnormal system behaviour visually

Detecting abnormal system behaviour visually

For DevOps engineers and SREs, detecting abnormal system behaviour visually is often the fastest way to understand what’s going wrong, why, and how urgently you need to act. Done well, visual detection turns noisy metrics and logs into intuitive signals you can interpret in seconds.

This post walks through practical techniques, dashboard patterns, and code snippets you can use today to make abnormal behaviour “pop out” visually in your observability stack.

Why detecting abnormal system behaviour visually matters

Most production incidents have early warning signs: a slow drift in latency, a subtle increase in 5xxs, or a resource saturation trend that starts hours before the outage. Metrics and logs see these, but humans only react fast when they can recognize them visually on a dashboard.[1][2]

By designing for visual anomaly detection, you:

  • Reduce MTTD (Mean Time To Detect) by making issues obvious at a glance.[2]
  • Cut alert noise by letting visuals show context before you page someone.[1]
  • Improve incident communication: everyone can see the same graphs and agree “this is wrong.”

Core visual patterns for detecting abnormal system behaviour

1. Baseline vs current: show “normal” directly on the chart

The simplest way of detecting abnormal system behaviour visually is to plot current behaviour against a historical baseline (same time last week, last N days, or a trained model output).

Example: Prometheus + Grafana, HTTP latency vs 7‑day baseline.


# Current 95th percentile latency over 5m windows
histogram_quantile(
  0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# 7-day baseline for the same time-of-day
histogram_quantile(
  0.95,
  avg_over_time(
    sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
  [7d:5m])
)

In your panel:

  • Series A: current latency (bold, bright color).
  • Series B: baseline (thin, muted color or dashed line).

Any divergence between the current line and baseline line becomes a visual anomaly, even before alerts fire.

2. Color-coded thresholds and bands

Static SLO/SLA thresholds are still powerful visual cues. Use bands and color changes to highlight when metrics cross “danger zones.”[1][3]

Common examples:

  • CPU > 80% for > 5 minutes.
  • Error rate > 1% of all requests.
  • Queue length > 75% of max depth.

In Grafana, configure a threshold band on a time series panel (e.g., yellow from 70–85, red > 85). The chart itself becomes a heatmap of risk, ideal for visually detecting abnormal system behaviour.

3. Anomaly detection overlays

Many modern backends (VictoriaMetrics, CloudWatch, DevOps Guru, Datadog, AppDynamics, etc.) can compute anomalies and expose them as metrics.[1][3][4]

Pattern:

  • Series A: raw metric (latency, CPU, throughput).
  • Series B: anomaly score or boolean (0 = normal, 1 = anomaly), plotted as:

# Example: overlay anomaly markers when > 3 std deviations
(
  http_requests_total
  - avg_over_time(http_requests_total[30m])
)
/
stddev_over_time(http_requests_total[30m]) > 3

Then show Series B as points or annotations. Spikes in the anomaly series visually mark abnormal behaviour without hiding the raw signal.

4. Correlated layouts: side-by-side cause and effect

DevOps and SRE workflows depend on spotting relationships: “CPU jumped, then latency spiked, then errors increased.”[2]

Make this correlation visual by arranging panels vertically:

  1. Top: request rate (RPS/QPS).
  2. Next: latency (P50/P95/P99).
  3. Next: error rate (4xx/5xx).
  4. Bottom: infrastructure (CPU, memory, I/O, DB connections).

When all lines move together, it’s clearly a system-wide issue. When only one layer moves, you localize the abnormal system behaviour visually to that layer.

Practical example: Visual anomaly dashboard for a web service

Metrics we’ll use

  • http_requests_total (labels: status, service).
  • http_request_duration_seconds_bucket (histogram).
  • process_cpu_seconds_total, container_memory_working_set_bytes.

1. Error rate panel with visual anomalies


sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))

Visual tips:

  • Plot as percentage (0–5%).
  • Threshold bands: green 0–0.5%, yellow 0.5–1%, red > 1%.
  • Enable background fill when > 1% to make abnormal behaviour unmissable.

2. Latency vs baseline panel


# Current P95
histogram_quantile(
  0.95,
  sum by (le) (
    rate(http_request_duration_seconds_bucket{service="checkout"}[5m])
  )
)

# Baseline P95 over last 7 days
histogram_quantile(
  0.95,
  avg_over_time(
    sum by (le) (
      rate(http_request_duration_seconds_bucket{service="checkout"}[5m])
    )[7d:5m]
  )
)

Visual tips:

  • Use contrasting line styles (solid vs dashed).
  • Visually, any spike in current that exceeds baseline by > X ms signals abnormal behaviour.

3. CPU and memory saturation panel


# CPU usage %
rate(process_cpu_seconds_total{service="checkout"}[5m]) * 100

# Memory usage MB
container_memory_working_set_bytes{service="checkout"} / 1024 / 1024

Visual tips:

  • Use unit-aware axes (left: %, right: MB) if combined in one chart.
  • Highlight CPU > 80% and rapid memory growth as abnormal system behaviour patterns.

Log-based visual anomaly detection

Metrics are not enough; many anomalies first appear in logs. You can still use detecting abnormal system behaviour visually patterns with log data.

Log frequency over time

In tools like Kibana, Loki, or Splunk, aggregate log counts by level:

  • Total logs per minute.
  • ERROR and WARN logs per minute.
  • Specific patterns (e.g., “timeout”, “database unavailable”).

Then show stacked area charts by level. Sudden bursts of WARN/ERROR are instantly visible as color changes and spikes.[1]

Simple Python log anomaly detection example

This script uses a rolling baseline to flag abnormal log volume; you can export results as a time series for visualization.


import time
from collections import deque
from statistics import mean, pstdev

window = deque(maxlen=60)  # last 60 seconds
threshold_sigma = 3

def is_anomalous(count, baseline):
  if len(baseline) < 10:
    return False
  mu = mean(baseline)
  sigma = pstdev(baseline) or 1
  return (count - mu) / sigma > threshold_sigma

while True:
  # Replace this