Skip to content

Observability

Kupe Cloud provides a fully managed observability stack for metrics, logs, and alerting. Every cluster ships with core telemetry and monitoring capabilities out of the box, so you don’t need to install or operate observability infrastructure yourself.

ComponentRoleWhat you interact with
Grafana AlloyCollection agent (DaemonSet on every node)You don’t — it works automatically
MimirLong-term metrics storage and PromQL query engineQuery metrics in Grafana or define alert rules
LokiLog aggregation and searchQuery logs in Grafana by namespace, pod, or label
AlertmanagerAlert routing and notification deliveryConfigure receivers and routing
GrafanaDashboards, exploration, and visualizationBuild dashboards, explore metrics and logs
  1. Alloy scrapes Prometheus metrics from kubelet, cAdvisor, and any pods with scrape annotations. It also collects container logs from every node.
  2. Metrics are pushed to Mimir. Logs are pushed to Loki.
  3. PrometheusRule resources in your clusters are synced to the Mimir ruler, which evaluates your PromQL alert expressions.
  4. Firing alerts are sent to Alertmanager, which groups, deduplicates, and routes them to your configured receivers (Slack, PagerDuty, email, Teams, or webhooks).
  5. Grafana queries Mimir and Loki as data sources for dashboards and ad-hoc exploration.

Every cluster includes these without any configuration:

  • Container CPU and memory metrics from cAdvisor (per pod, per container).
  • Pod restart, scheduling, and lifecycle events.
  • Container logs indexed by namespace, pod, and container name.
  • Pre-built platform dashboards for cluster health and resource utilization.
  • Custom dashboards deployed as ConfigMaps - see Grafana Dashboards.
  • Application metrics exposed via Prometheus scrape annotations - see Metrics.
  • Alert rules defined as PrometheusRule resources - see Alerting.
  • Notification receivers configured in the console - see Alerting.

When something goes wrong, follow this sequence:

  1. Alert tells you something needs attention and which service is affected.
  2. Dashboard shows the scope — when it started, how severe, what’s impacted.
  3. Metrics isolate the signal — error rates, latency spikes, resource saturation.
  4. Logs confirm the root cause — stack traces, error messages, request details.