Unified Monitoring for Multi-Cloud Ecosystems
As a South African SRE working with teams spread across Johannesburg, Cape Town, London, and Dublin, I have learned that Unified Monitoring for Multi-Cloud Ecosystems is not a luxury — it is survival. When your workloads span AWS…
Unified Monitoring for Multi-Cloud Ecosystems
As a South African SRE working with teams spread across Johannesburg, Cape Town, London, and Dublin, I have learned that Unified Monitoring for Multi-Cloud Ecosystems is not a luxury — it is survival. When your workloads span AWS in eu-west-1, Azure in westeurope, and GCP in europe-west4, the last thing you want at 3 a.m. is five tabs of CloudWatch, Azure Monitor, and Cloud Logging competing for your attention.
In this post, I will walk through how to build Unified Monitoring for Multi-Cloud Ecosystems using Grafana as the single pane of glass, with practical patterns and snippets you can adapt for your own DevOps and SRE workflows.
Why Unified Monitoring for Multi-Cloud Ecosystems Matters
Unified Monitoring for Multi-Cloud Ecosystems is about building a single, consistent observability layer across AWS, Azure, GCP, and on‑prem so that you can detect, debug, and remediate incidents without stitching together multiple dashboards and CLIs.[2] Multi‑cloud observability best practices emphasize a unified view of performance, health, and security across providers, aggregating logs, metrics, and traces into one platform.[2][4] This single pane of glass drastically reduces your mean time to detect (MTTD) and mean time to resolve (MTTR).[2][5]
From an SRE perspective, especially in cost‑sensitive South African environments where bandwidth and latency to EU or US regions matter, unified monitoring helps you:
- Avoid blind spots between providers.
- Standardize SLOs and error budgets across regions.
- Control observability costs by centralizing telemetry.
- Reduce cognitive load for on‑call engineers.
Architecture: Grafana as the Single Pane of Glass
Grafana is an open‑source analytics and monitoring platform that visualizes data across multiple sources in unified dashboards.[1] It integrates natively with AWS CloudWatch, Azure Monitor, and Google Cloud Monitoring, enabling you to track performance and detect anomalies across all environments from one console.[1][3]
A typical reference architecture for Unified Monitoring for Multi-Cloud Ecosystems with Grafana looks like this:[2][5]
- Deploy a metrics/logs agent (Prometheus, OpenTelemetry Collector, Fluent Bit) into each cloud.[2][5]
- Standardize labels and resource identifiers (e.g.,
cloud,region,env,service) across all telemetry.[2] - Forward telemetry to a central observability backend — a Grafana stack, Grafana Cloud, or a self‑hosted time‑series database plus log store.[2][5]
- Connect cloud‑native monitoring (CloudWatch, Azure Monitor, GCP Monitoring) to Grafana as additional data sources.[1][3]
- Build cross‑cloud dashboards, SLOs, and alerts that slice by provider and region instead of logging into each provider’s portal.[2][4]
Connecting Cloud Providers to Grafana
Here are the key steps to connect each cloud to Grafana for unified dashboards:[1]
- AWS: Create an IAM role with CloudWatch read permissions and configure the CloudWatch data source in Grafana using access key or assume-role.[1]
- Azure: Register an Azure AD application with Monitor Reader permissions and configure the Azure Monitor data source.[1]
- GCP: Create a service account with Monitoring Viewer role and upload its JSON key to the Google Cloud Monitoring (Stackdriver) data source.[1][5]
To keep things reproducible across our SA and EU environments, we use Grafana provisioning via YAML to automate these data sources.[1]
Provisioning Multi-Cloud Data Sources in Grafana
Grafana supports data source provisioning through simple YAML files. Below is a practical example that configures three data sources: AWS CloudWatch, Azure Monitor, and GCP Monitoring.
apiVersion: 1
datasources:
- name: AWS-CloudWatch
type: cloudwatch
access: proxy
isDefault: false
jsonData:
authType: keys
defaultRegion: eu-west-1
secureJsonData:
accessKey: $__file{/etc/grafana/secrets/aws_access_key}
secretKey: $__file{/etc/grafana/secrets/aws_secret_key}
- name: Azure-Monitor
type: grafana-azure-monitor-datasource
access: proxy
jsonData:
tenantId: "<azure-tenant-id>"
clientId: "<azure-app-id>"
cloudName: azuremonitor
secureJsonData:
clientSecret: "<azure-client-secret>"
- name: GCP-Monitoring
type: stackdriver
access: proxy
jsonData:
defaultProject: "my-gcp-project"
authenticationType: jwt
tokenUri: "https://oauth2.googleapis.com/token"
secureJsonData:
credentialsJson: $__file{/etc/grafana/secrets/gcp_service_account.json}
In my setups, these YAML files live in a Git repository, pushed through CI/CD to a Grafana instance running either on‑prem in Johannesburg or in a neutral cloud region. This approach gives us GitOps for observability — any changes to the multi-cloud configuration go through code review.
Standardizing Metrics Across Clouds
One of the biggest challenges in Unified Monitoring for Multi-Cloud Ecosystems is that each provider names and labels things differently.[2][5] To make a single dashboard meaningful, you need a common vocabulary.
We enforce a consistent labeling scheme on all metrics and logs via Prometheus and OpenTelemetry Collector:[2][5]
cloud:aws,azure,gcp,onpremregion:eu-west-1,westeurope,europe-west4,za-jhb-1env:prod,staging,devservice: logical service name, not provider-specific (e.g.,payments-api)
Here is a minimal Prometheus scrape config snippet for Kubernetes clusters in multiple clouds:
scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__meta_kubernetes_node_label_topology_kubernetes_io_region]
target_label: region
- source_labels: [__meta_kubernetes_node_label_topology_kubernetes_io_zone]
target_label: zone
- source_labels: [__meta_kubernetes_node_label_cloud_google_com_gke_nodepool]
target_label: nodepool
- target_label: cloud
replacement: 'aws' # override per cluster
- target_label: env
replacement: 'prod'
For Azure or GCP clusters, we change the cloud replacement value and, where needed, map provider-specific labels to these standard ones.
Building Cross-Cloud Dashboards in Grafana
Once data is normalized, Grafana makes it straightforward to build dashboards that cut across providers.[1][2] For example, we maintain a “Multi‑Cloud API Health” dashboard that lets our on‑call engineer quickly see error rates and latencies per cloud.
A typical PromQL query we use:
sum by (cloud, region, service) (
rate(http_requests_total{
env="prod",
service="payments-api",
status=~"5.."
}[5m])
)And for latency (95th percentile):
histogram_quantile(
0.95,
sum by (le, cloud, region, service) (
rate(http_request_duration_seconds_bucket{
env="prod",
service="payments-api"
}[5m])
)
)In Grafana, I add template variables for $cloud, $region, and $env so that the S