Skip to content

Alerting

Kupe Cloud uses Prometheus-style alert rules with Alertmanager for delivery. Rules are evaluated centrally against your tenant metrics, and notifications are routed to receivers that your team configures in the console.

  1. Alert rules come from either the managed rule catalog or your own PrometheusRule resources.
  2. Kupe syncs those rules into the managed evaluation path.
  3. The ruler evaluates the PromQL expressions on schedule.
  4. When a condition holds for the configured for duration, the alert begins firing.
  5. Alertmanager routes the alert to the receivers and routes you configured.

Kupe includes a baseline catalog of managed rules for common platform and workload issues. In the console, open Alerting > Rules > Managed Rules to review the current state for a cluster and adjust the rules that support overrides.

Typical managed rules cover:

  • workload health, such as pod crash looping, pod not ready, rollout problems, and failed jobs
  • resource pressure, such as CPU throttling and quota pressure
  • storage health, such as persistent volumes filling up or entering a bad state

The console is the source of truth for whether a managed rule is currently enabled for a given cluster and what overrides are in effect.

Create your own PrometheusRule resources when you need application-specific alerting. These appear under Alerting > Rules > Custom Rules.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: my-namespace
spec:
groups:
- name: my-app
rules:
- alert: HighRequestLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="api"}[5m]))
by (le)
) > 2.0
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High p99 latency on {{ $labels.service }}"
description: >-
p99 latency is above threshold for the last 10 minutes.
runbook_url: "https://runbooks.example.com/HighRequestLatency"
  • severity: use info, warning, or critical
  • team: use a stable team label if you route alerts by ownership
  • summary: short notification text
  • description: enough context for a responder to understand the problem
  • runbook_url: a link to the expected response steps
  • alert on symptoms that need action, not on every low-level signal
  • use for to avoid flapping
  • keep severities consistent across services
  • include a runbook link for anything that might page someone
  • prefer a small number of high-signal alerts over a large number of noisy ones

Every receiver in the console has a Test action. Use it to verify that the delivery path works before you depend on it in production.

To test the full rule-to-notification path, deploy a short-lived rule that always fires:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: test-alert
namespace: default
spec:
groups:
- name: test
rules:
- alert: TestAlert
expr: vector(1) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Test alert"

Apply it, wait for the rule to evaluate and notify, then remove it:

Terminal window
kubectl apply -f test-alert.yaml
kubectl delete prometheusrule test-alert -n default

Use promtool before you commit a larger rule set:

Terminal window
promtool check rules rules.yaml