Observability
Kupe Cloud provides a fully managed observability stack for metrics, logs, and alerting. Every cluster ships with core telemetry and monitoring capabilities out of the box, so you don’t need to install or operate observability infrastructure yourself.
The stack
Section titled “The stack”| Component | Role | What you interact with |
|---|---|---|
| Grafana Alloy | Collection agent (DaemonSet on every node) | You don’t — it works automatically |
| Mimir | Long-term metrics storage and PromQL query engine | Query metrics in Grafana or define alert rules |
| Loki | Log aggregation and search | Query logs in Grafana by namespace, pod, or label |
| Alertmanager | Alert routing and notification delivery | Configure receivers and routing |
| Grafana | Dashboards, exploration, and visualization | Build dashboards, explore metrics and logs |
How data flows
Section titled “How data flows”- Alloy scrapes Prometheus metrics from kubelet, cAdvisor, and any pods with scrape annotations. It also collects container logs from every node.
- Metrics are pushed to Mimir. Logs are pushed to Loki.
- PrometheusRule resources in your clusters are synced to the Mimir ruler, which evaluates your PromQL alert expressions.
- Firing alerts are sent to Alertmanager, which groups, deduplicates, and routes them to your configured receivers (Slack, PagerDuty, email, Teams, or webhooks).
- Grafana queries Mimir and Loki as data sources for dashboards and ad-hoc exploration.
What you get automatically
Section titled “What you get automatically”Every cluster includes these without any configuration:
- Container CPU and memory metrics from cAdvisor (per pod, per container).
- Pod restart, scheduling, and lifecycle events.
- Container logs indexed by namespace, pod, and container name.
- Pre-built platform dashboards for cluster health and resource utilization.
What you configure
Section titled “What you configure”- Custom dashboards deployed as ConfigMaps - see Grafana Dashboards.
- Application metrics exposed via Prometheus scrape annotations - see Metrics.
- Alert rules defined as PrometheusRule resources - see Alerting.
- Notification receivers configured in the console - see Alerting.
Recommended triage path
Section titled “Recommended triage path”When something goes wrong, follow this sequence:
- Alert tells you something needs attention and which service is affected.
- Dashboard shows the scope — when it started, how severe, what’s impacted.
- Metrics isolate the signal — error rates, latency spikes, resource saturation.
- Logs confirm the root cause — stack traces, error messages, request details.