Telemetry Sampling Strategies for Cost Control: A Practical Guide for DevOps Engineers and SREs
Published: May 12, 2026
```htmlTelemetry Sampling Strategies for Cost Control: Essential Guide for DevOps and SREs
Telemetry Sampling Strategies for Cost Control: A Practical Guide for DevOps Engineers and SREs
Published: May 12, 2026
In modern microservices architectures, observability telemetry—metrics, traces, and logs—powers incident response, performance optimization, and business decisions. But as systems scale, telemetry costs explode. A single high-traffic service can generate 1.1 million traces per minute, costing $237,600/month at 100% sampling (AWS X-Ray pricing, eu-west-1, Jan 2024). Drop to 0.1% sampling, and you're at $237/month—a 99.9% reduction.
This guide delivers telemetry sampling strategies for cost control that SREs and DevOps engineers can implement today. We'll cover head vs. tail sampling, cardinality governance, OpenTelemetry configurations, and FinOps integration with real-world examples and code snippets.
Why Telemetry Sampling is Your #1 Cost Control Lever
Observability costs break down into ingestion (60-70%), storage (20-25%), and query compute (10-15%). Traces dominate due to their verbosity in distributed systems, followed by high-cardinality logs and metrics.
- Traces: Exponential growth in microservices (one request = 50+ spans)
- Logs: Verbose, unstructured, high-cardinality fields
- Metrics: Explode with unbounded labels (service{team="alpha", env="prod", version="1.2.3"} = millions of series)
Sampling reduces volume while preserving signal. The key: retain high-value signals (errors, latency outliers, critical paths) while dropping noise.
Head Sampling vs. Tail Sampling: Choose Your Strategy
Head Sampling: Simple, Early Decision-Making
Head sampling decides at request start, sampling X% of traces. Fast and cheap, but biases against rare failures.
# OpenTelemetry Collector - head sampling config
processors:
probabilistic_sampler:
sampling_percentage: 0.1 # 0.1% sampling rate
sampling_seed: 42
service:
pipelines:
traces:
processors: [probabilistic_sampler]
Pros: Zero latency overhead, predictable volume
Cons: Misses 99.9% of errors in 0.1% sampling
Tail Sampling: Intelligent, Post-Completion Decisions
Tail sampling buffers complete traces, then applies policies based on error rate, latency, attributes. Perfect for incident analysis.
# OpenTelemetry Collector - tail sampling (requires tail sampling processor)
processors:
tail_sampling:
policies:
# Keep 100% of error traces
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# Keep slow traces (>500ms)
- name: latency
type: latency
latency:
threshold_ms: 500
# Keep 20% of successful traces as baseline
- name: always_sample
type: probabilistic
probabilistic:
sampling_percentage: 20
decision_wait: 10s # Buffer time
num_traces: 10000 # Max buffered traces
Real cost math: 1.1M traces/min → $237K/month (100%) vs. $370/month (tail sampling + t2.xlarge sampler) = 99.8% savings.
Practical Telemetry Sampling Strategies for Cost Control
1. Service-Tiered Sampling Policies
Apply different rates by criticality:
| Service Tier | Head Sample Rate | Tail Policies | Expected Reduction |
|---|---|---|---|
| Customer-facing (payments, auth) | 1% | 100% errors + P99.9 latency | 95% |
| Internal APIs | 0.1% | 50% errors + P99 latency | 99% |
| Background jobs | 0.01% | Errors only | 99.9% |
2. Cardinality Budgets for Metrics
High-cardinality labels kill metric costs. Implement budgets:
# Prometheus exemplar-based cardinality guard
groups:
- name: cardinality_alerts
rules:
- alert: HighCardinalityMetrics
expr: |
count by (__name__) (
count by (__name__, job, instance, team) (
myapp_http_requests_total{team=~"alpha|beta"}
) > 1000
) > 10
for: 5m
Approved label taxonomy:
- ✅
{service, env, team}(low cardinality) - ❌
{user_id, session_id, request_id}(unbounded)
3. Log Sampling with Pattern Deduplication
Sample logs by level and deduplicate patterns:
# Vector.dev log sampling config
[sources.logs]
type = "kubernetes_logs"
[transforms.sample]
type = "sample"
rate = 0.05 # 5% sampling
stage = "start"
[transforms.filter_debug]
type = "filter"
condition = '.level != "debug"'
[transforms.dedupe]
type = "dedupe"
cache.strategy = "full"
cache.max_size = 100_000
Implementing Tail Sampling at Scale: Architecture Patterns
Pattern 1: OpenTelemetry Collector Deployment
- Deploy collector as DaemonSet (sidecar mode)
- Configure tail sampling processors
- Load balance samplers with Envoy L7 (gRPC/OTLP)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
spec:
template:
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/conf/otel-collector-config.yaml"]
resources:
limits:
cpu: "1"
memory: "2Gi"
Pattern 2: Kubernetes Service Mesh Integration
Linkerd or Istio can inject sampling at the proxy layer:
# Linkerd trace sampling annotation
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: critical-service.default.svc.cluster.local
spec:
routes:
- name: POST /api/payment
responseClasses:
- isRetryable: true
condition:
response:
status:
range: 5xx
sampleProbability: 1.0 # 100% error sampling
FinOps Integration: Cost Attribution and Guardrails
Showback drives behavior change. Tag telemetry by team/service:
# OTEL resource attributes for billing
resource:
attributes:
- key: service.team
value: ["alpha", "beta"]
- key: service.cost_center
value: "observability"
Guardrails checklist:
- Automated alerts: ingestion > 110% budget
- Policy-as-code: PR reviews for sampling changes
- Weekly cost reviews: team + FinOp