telemetry

Telemetry Sampling Strategies for Cost Control: A Practical Guide for DevOps Engineers and SREs

Published: May 12, 2026

Opsgenie

12 May 2026 — 3 min read

```htmlTelemetry Sampling Strategies for Cost Control: Essential Guide for DevOps and SREs

Telemetry Sampling Strategies for Cost Control: A Practical Guide for DevOps Engineers and SREs

Published: May 12, 2026

In modern microservices architectures, observability telemetry—metrics, traces, and logs—powers incident response, performance optimization, and business decisions. But as systems scale, telemetry costs explode. A single high-traffic service can generate 1.1 million traces per minute, costing $237,600/month at 100% sampling (AWS X-Ray pricing, eu-west-1, Jan 2024). Drop to 0.1% sampling, and you're at $237/month—a 99.9% reduction.

This guide delivers telemetry sampling strategies for cost control that SREs and DevOps engineers can implement today. We'll cover head vs. tail sampling, cardinality governance, OpenTelemetry configurations, and FinOps integration with real-world examples and code snippets.

Why Telemetry Sampling is Your #1 Cost Control Lever

Observability costs break down into ingestion (60-70%), storage (20-25%), and query compute (10-15%). Traces dominate due to their verbosity in distributed systems, followed by high-cardinality logs and metrics.

Traces: Exponential growth in microservices (one request = 50+ spans)
Logs: Verbose, unstructured, high-cardinality fields
Metrics: Explode with unbounded labels (service{team="alpha", env="prod", version="1.2.3"} = millions of series)

Sampling reduces volume while preserving signal. The key: retain high-value signals (errors, latency outliers, critical paths) while dropping noise.

Head Sampling vs. Tail Sampling: Choose Your Strategy

Head Sampling: Simple, Early Decision-Making

Head sampling decides at request start, sampling X% of traces. Fast and cheap, but biases against rare failures.

# OpenTelemetry Collector - head sampling config
processors:
  probabilistic_sampler:
    sampling_percentage: 0.1  # 0.1% sampling rate
    sampling_seed: 42

service:
  pipelines:
    traces:
      processors: [probabilistic_sampler]

Pros: Zero latency overhead, predictable volume

Cons: Misses 99.9% of errors in 0.1% sampling

Tail Sampling: Intelligent, Post-Completion Decisions

Tail sampling buffers complete traces, then applies policies based on error rate, latency, attributes. Perfect for incident analysis.

# OpenTelemetry Collector - tail sampling (requires tail sampling processor)
processors:
  tail_sampling:
    policies:
      # Keep 100% of error traces
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      
      # Keep slow traces (>500ms)
      - name: latency
        type: latency
        latency:
          threshold_ms: 500
      
      # Keep 20% of successful traces as baseline
      - name: always_sample
        type: probabilistic
        probabilistic:
          sampling_percentage: 20

    decision_wait: 10s     # Buffer time
    num_traces: 10000     # Max buffered traces

Real cost math: 1.1M traces/min → $237K/month (100%) vs. $370/month (tail sampling + t2.xlarge sampler) = 99.8% savings.

Practical Telemetry Sampling Strategies for Cost Control

1. Service-Tiered Sampling Policies

Apply different rates by criticality:

Service Tier	Head Sample Rate	Tail Policies	Expected Reduction
Customer-facing (payments, auth)	1%	100% errors + P99.9 latency	95%
Internal APIs	0.1%	50% errors + P99 latency	99%
Background jobs	0.01%	Errors only	99.9%

2. Cardinality Budgets for Metrics

High-cardinality labels kill metric costs. Implement budgets:

# Prometheus exemplar-based cardinality guard
groups:
  - name: cardinality_alerts
    rules:
    - alert: HighCardinalityMetrics
      expr: |
        count by (__name__) (
          count by (__name__, job, instance, team) (
            myapp_http_requests_total{team=~"alpha|beta"}
          ) > 1000
        ) > 10
      for: 5m

Approved label taxonomy:

✅ {service, env, team} (low cardinality)
❌ {user_id, session_id, request_id} (unbounded)

3. Log Sampling with Pattern Deduplication

Sample logs by level and deduplicate patterns:

# Vector.dev log sampling config
[sources.logs]
type = "kubernetes_logs"

[transforms.sample]
type = "sample"
rate = 0.05  # 5% sampling
stage = "start"

[transforms.filter_debug]
type = "filter"
condition = '.level != "debug"'

[transforms.dedupe]
type = "dedupe"
cache.strategy = "full"
cache.max_size = 100_000

Implementing Tail Sampling at Scale: Architecture Patterns

Pattern 1: OpenTelemetry Collector Deployment

Deploy collector as DaemonSet (sidecar mode)
Configure tail sampling processors
Load balance samplers with Envoy L7 (gRPC/OTLP)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
spec:
  template:
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:latest
        args: ["--config=/conf/otel-collector-config.yaml"]
        resources:
          limits:
            cpu: "1"
            memory: "2Gi"

Pattern 2: Kubernetes Service Mesh Integration

Linkerd or Istio can inject sampling at the proxy layer:

# Linkerd trace sampling annotation
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: critical-service.default.svc.cluster.local
spec:
  routes:
  - name: POST /api/payment
    responseClasses:
    - isRetryable: true
    condition:
      response:
        status:
        range: 5xx
    sampleProbability: 1.0  # 100% error sampling

FinOps Integration: Cost Attribution and Guardrails

Showback drives behavior change. Tag telemetry by team/service:

# OTEL resource attributes for billing
resource:
  attributes:
    - key: service.team
      value: ["alpha", "beta"]
    - key: service.cost_center
      value: "observability"

Guardrails checklist:

Automated alerts: ingestion > 110% budget
Policy-as-code: PR reviews for sampling changes
Weekly cost reviews: team + FinOp

Telemetry Sampling Strategies for Cost Control: A Practical Guide for DevOps Engineers and SREs

Opsgenie

Telemetry Sampling Strategies for Cost Control: A Practical Guide for DevOps Engineers and SREs

Why Telemetry Sampling is Your #1 Cost Control Lever

Head Sampling vs. Tail Sampling: Choose Your Strategy

Head Sampling: Simple, Early Decision-Making

Tail Sampling: Intelligent, Post-Completion Decisions

Practical Telemetry Sampling Strategies for Cost Control

1. Service-Tiered Sampling Policies

2. Cardinality Budgets for Metrics

3. Log Sampling with Pattern Deduplication

Implementing Tail Sampling at Scale: Architecture Patterns

Pattern 1: OpenTelemetry Collector Deployment

Pattern 2: Kubernetes Service Mesh Integration

FinOps Integration: Cost Attribution and Guardrails

Read more

Dynamic Alert Suppression During Outages: A Practical Guide for DevOps and SRE Teams

Real-time Incident Correlation Across Services: Essential Guide for DevOps Engineers and SREs

Real-time Incident Correlation Across Services: The SRE Guide to Faster MTTR

Real-Time Incident Correlation Across Services: Reducing Alert Noise and MTTR