Alerting

Kupe Cloud uses Prometheus-style alert rules with Alertmanager for delivery. Rules are evaluated centrally against your tenant metrics, and notifications are routed to receivers that your team configures in the console.

How alerting works

Alert rules come from either the managed rule catalog or your own PrometheusRule resources.
Kupe syncs those rules into the managed evaluation path.
The ruler evaluates the PromQL expressions on schedule.
When a condition holds for the configured for duration, the alert begins firing.
Alertmanager routes the alert to the receivers and routes you configured.

Managed rules

Kupe includes a baseline catalog of managed rules for common platform and workload issues. In the console, open Alerting > Rules > Managed Rules to review the current state for a cluster and adjust the rules that support overrides.

Typical managed rules cover:

workload health, such as pod crash looping, pod not ready, rollout problems, and failed jobs
resource pressure, such as CPU throttling and quota pressure
storage health, such as persistent volumes filling up or entering a bad state

The console is the source of truth for whether a managed rule is currently enabled for a given cluster and what overrides are in effect.

Custom rules

Create your own PrometheusRule resources when you need application-specific alerting. These appear under Alerting > Rules > Custom Rules.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: my-namespace
spec:
  groups:
    - name: my-app
      rules:
        - alert: HighRequestLatency
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{service="api"}[5m]))
              by (le)
            ) > 2.0
          for: 10m
          labels:
            severity: warning
            team: backend
          annotations:
            summary: "High p99 latency on {{ $labels.service }}"
            description: >-
              p99 latency is above threshold for the last 10 minutes.
            runbook_url: "https://runbooks.example.com/HighRequestLatency"

Recommended labels and annotations

severity: use info, warning, or critical
team: use a stable team label if you route alerts by ownership
summary: short notification text
description: enough context for a responder to understand the problem
runbook_url: a link to the expected response steps

Alert design guidance

alert on symptoms that need action, not on every low-level signal
use for to avoid flapping
keep severities consistent across services
include a runbook link for anything that might page someone
prefer a small number of high-signal alerts over a large number of noisy ones

Test the pipeline

Test a receiver

Every receiver in the console has a Test action. Use it to verify that the delivery path works before you depend on it in production.

Test a rule end to end

To test the full rule-to-notification path, deploy a short-lived rule that always fires:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: test-alert
  namespace: default
spec:
  groups:
    - name: test
      rules:
        - alert: TestAlert
          expr: vector(1) > 0
          for: 1m
          labels:
            severity: warning
          annotations:
            summary: "Test alert"

Apply it, wait for the rule to evaluate and notify, then remove it:

kubectl apply -f test-alert.yaml
kubectl delete prometheusrule test-alert -n default

Validate rules locally

Use promtool before you commit a larger rule set:

promtool check rules rules.yaml

Next steps

Notifications: configure receivers and routing
Metrics: test the underlying queries in Grafana