Alerting
Kupe Cloud uses Prometheus-style alert rules with Alertmanager for delivery. Rules are evaluated centrally against your tenant metrics, and notifications are routed to receivers that your team configures in the console.
How alerting works
Section titled “How alerting works”- Alert rules come from either the managed rule catalog or your own
PrometheusRuleresources. - Kupe syncs those rules into the managed evaluation path.
- The ruler evaluates the PromQL expressions on schedule.
- When a condition holds for the configured
forduration, the alert begins firing. - Alertmanager routes the alert to the receivers and routes you configured.
Managed rules
Section titled “Managed rules”Kupe includes a baseline catalog of managed rules for common platform and workload issues. In the console, open Alerting > Rules > Managed Rules to review the current state for a cluster and adjust the rules that support overrides.
Typical managed rules cover:
- workload health, such as pod crash looping, pod not ready, rollout problems, and failed jobs
- resource pressure, such as CPU throttling and quota pressure
- storage health, such as persistent volumes filling up or entering a bad state
The console is the source of truth for whether a managed rule is currently enabled for a given cluster and what overrides are in effect.
Custom rules
Section titled “Custom rules”Create your own PrometheusRule resources when you need application-specific alerting. These appear under Alerting > Rules > Custom Rules.
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: my-app-alerts namespace: my-namespacespec: groups: - name: my-app rules: - alert: HighRequestLatency expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="api"}[5m])) by (le) ) > 2.0 for: 10m labels: severity: warning team: backend annotations: summary: "High p99 latency on {{ $labels.service }}" description: >- p99 latency is above threshold for the last 10 minutes. runbook_url: "https://runbooks.example.com/HighRequestLatency"Recommended labels and annotations
Section titled “Recommended labels and annotations”severity: useinfo,warning, orcriticalteam: use a stable team label if you route alerts by ownershipsummary: short notification textdescription: enough context for a responder to understand the problemrunbook_url: a link to the expected response steps
Alert design guidance
Section titled “Alert design guidance”- alert on symptoms that need action, not on every low-level signal
- use
forto avoid flapping - keep severities consistent across services
- include a runbook link for anything that might page someone
- prefer a small number of high-signal alerts over a large number of noisy ones
Test the pipeline
Section titled “Test the pipeline”Test a receiver
Section titled “Test a receiver”Every receiver in the console has a Test action. Use it to verify that the delivery path works before you depend on it in production.
Test a rule end to end
Section titled “Test a rule end to end”To test the full rule-to-notification path, deploy a short-lived rule that always fires:
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: test-alert namespace: defaultspec: groups: - name: test rules: - alert: TestAlert expr: vector(1) > 0 for: 1m labels: severity: warning annotations: summary: "Test alert"Apply it, wait for the rule to evaluate and notify, then remove it:
kubectl apply -f test-alert.yamlkubectl delete prometheusrule test-alert -n defaultValidate rules locally
Section titled “Validate rules locally”Use promtool before you commit a larger rule set:
promtool check rules rules.yamlNext steps
Section titled “Next steps”- Notifications: configure receivers and routing
- Metrics: test the underlying queries in Grafana