Unified Monitoring for Multi-Cloud Ecosystems

As a South African SRE working with teams spread across Johannesburg, Cape Town, London, and Dublin, I have learned that Unified Monitoring for Multi-Cloud Ecosystems is not a luxury — it is survival. When your workloads span AWS…

Unified Monitoring for Multi-Cloud Ecosystems

Unified Monitoring for Multi-Cloud Ecosystems

As a South African SRE working with teams spread across Johannesburg, Cape Town, London, and Dublin, I have learned that Unified Monitoring for Multi-Cloud Ecosystems is not a luxury — it is survival. When your workloads span AWS in eu-west-1, Azure in westeurope, and GCP in europe-west4, the last thing you want at 3 a.m. is five tabs of CloudWatch, Azure Monitor, and Cloud Logging competing for your attention.

In this post, I will walk through how to build Unified Monitoring for Multi-Cloud Ecosystems using Grafana as the single pane of glass, with practical patterns and snippets you can adapt for your own DevOps and SRE workflows.

Why Unified Monitoring for Multi-Cloud Ecosystems Matters

Unified Monitoring for Multi-Cloud Ecosystems is about building a single, consistent observability layer across AWS, Azure, GCP, and on‑prem so that you can detect, debug, and remediate incidents without stitching together multiple dashboards and CLIs.[2] Multi‑cloud observability best practices emphasize a unified view of performance, health, and security across providers, aggregating logs, metrics, and traces into one platform.[2][4] This single pane of glass drastically reduces your mean time to detect (MTTD) and mean time to resolve (MTTR).[2][5]

From an SRE perspective, especially in cost‑sensitive South African environments where bandwidth and latency to EU or US regions matter, unified monitoring helps you:

  • Avoid blind spots between providers.
  • Standardize SLOs and error budgets across regions.
  • Control observability costs by centralizing telemetry.
  • Reduce cognitive load for on‑call engineers.

Architecture: Grafana as the Single Pane of Glass

Grafana is an open‑source analytics and monitoring platform that visualizes data across multiple sources in unified dashboards.[1] It integrates natively with AWS CloudWatch, Azure Monitor, and Google Cloud Monitoring, enabling you to track performance and detect anomalies across all environments from one console.[1][3]

A typical reference architecture for Unified Monitoring for Multi-Cloud Ecosystems with Grafana looks like this:[2][5]

  1. Deploy a metrics/logs agent (Prometheus, OpenTelemetry Collector, Fluent Bit) into each cloud.[2][5]
  2. Standardize labels and resource identifiers (e.g., cloud, region, env, service) across all telemetry.[2]
  3. Forward telemetry to a central observability backend — a Grafana stack, Grafana Cloud, or a self‑hosted time‑series database plus log store.[2][5]
  4. Connect cloud‑native monitoring (CloudWatch, Azure Monitor, GCP Monitoring) to Grafana as additional data sources.[1][3]
  5. Build cross‑cloud dashboards, SLOs, and alerts that slice by provider and region instead of logging into each provider’s portal.[2][4]

Connecting Cloud Providers to Grafana

Here are the key steps to connect each cloud to Grafana for unified dashboards:[1]

  • AWS: Create an IAM role with CloudWatch read permissions and configure the CloudWatch data source in Grafana using access key or assume-role.[1]
  • Azure: Register an Azure AD application with Monitor Reader permissions and configure the Azure Monitor data source.[1]
  • GCP: Create a service account with Monitoring Viewer role and upload its JSON key to the Google Cloud Monitoring (Stackdriver) data source.[1][5]

To keep things reproducible across our SA and EU environments, we use Grafana provisioning via YAML to automate these data sources.[1]

Provisioning Multi-Cloud Data Sources in Grafana

Grafana supports data source provisioning through simple YAML files. Below is a practical example that configures three data sources: AWS CloudWatch, Azure Monitor, and GCP Monitoring.

apiVersion: 1

datasources:
  - name: AWS-CloudWatch
    type: cloudwatch
    access: proxy
    isDefault: false
    jsonData:
      authType: keys
      defaultRegion: eu-west-1
    secureJsonData:
      accessKey: $__file{/etc/grafana/secrets/aws_access_key}
      secretKey: $__file{/etc/grafana/secrets/aws_secret_key}

  - name: Azure-Monitor
    type: grafana-azure-monitor-datasource
    access: proxy
    jsonData:
      tenantId: "<azure-tenant-id>"
      clientId: "<azure-app-id>"
      cloudName: azuremonitor
    secureJsonData:
      clientSecret: "<azure-client-secret>"

  - name: GCP-Monitoring
    type: stackdriver
    access: proxy
    jsonData:
      defaultProject: "my-gcp-project"
      authenticationType: jwt
      tokenUri: "https://oauth2.googleapis.com/token"
    secureJsonData:
      credentialsJson: $__file{/etc/grafana/secrets/gcp_service_account.json}

In my setups, these YAML files live in a Git repository, pushed through CI/CD to a Grafana instance running either on‑prem in Johannesburg or in a neutral cloud region. This approach gives us GitOps for observability — any changes to the multi-cloud configuration go through code review.

Standardizing Metrics Across Clouds

One of the biggest challenges in Unified Monitoring for Multi-Cloud Ecosystems is that each provider names and labels things differently.[2][5] To make a single dashboard meaningful, you need a common vocabulary.

We enforce a consistent labeling scheme on all metrics and logs via Prometheus and OpenTelemetry Collector:[2][5]

  • cloud: aws, azure, gcp, onprem
  • region: eu-west-1, westeurope, europe-west4, za-jhb-1
  • env: prod, staging, dev
  • service: logical service name, not provider-specific (e.g., payments-api)

Here is a minimal Prometheus scrape config snippet for Kubernetes clusters in multiple clouds:

scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__meta_kubernetes_node_label_topology_kubernetes_io_region]
        target_label: region
      - source_labels: [__meta_kubernetes_node_label_topology_kubernetes_io_zone]
        target_label: zone
      - source_labels: [__meta_kubernetes_node_label_cloud_google_com_gke_nodepool]
        target_label: nodepool
      - target_label: cloud
        replacement: 'aws'   # override per cluster
      - target_label: env
        replacement: 'prod'

For Azure or GCP clusters, we change the cloud replacement value and, where needed, map provider-specific labels to these standard ones.

Building Cross-Cloud Dashboards in Grafana

Once data is normalized, Grafana makes it straightforward to build dashboards that cut across providers.[1][2] For example, we maintain a “Multi‑Cloud API Health” dashboard that lets our on‑call engineer quickly see error rates and latencies per cloud.

A typical PromQL query we use:

sum by (cloud, region, service) (
  rate(http_requests_total{
    env="prod",
    service="payments-api",
    status=~"5.."
  }[5m])
)

And for latency (95th percentile):

histogram_quantile(
  0.95,
  sum by (le, cloud, region, service) (
    rate(http_request_duration_seconds_bucket{
      env="prod",
      service="payments-api"
    }[5m])
  )
)

In Grafana, I add template variables for $cloud, $region, and $env so that the S