Alerting
Kupe Cloud uses PrometheusRule resources for alerting and Alertmanager for notification delivery. Alert rules are defined as Kubernetes resources, evaluated by the Mimir ruler, and routed to receivers you configure in the console.
How alerting works
Section titled “How alerting works”- A
PrometheusRuleresource is created in your cluster (either a platform default or your own custom rule). - Alloy discovers the resource and syncs it to the Mimir ruler.
- The ruler evaluates your PromQL expressions on a schedule.
- When a condition holds for the configured
forduration, the alert transitions from pending to firing. - Firing alerts are sent to Alertmanager, where you define grouping, deduplication, routing logic, and receivers (Slack, PagerDuty, email, Teams, webhook) in the console.
Default alert rules
Section titled “Default alert rules”Every cluster comes with a set of baseline alert rules that cover common failure modes. These are managed by the platform and appear under the Managed Rules tab in the console at Alerting > Rules.
All default rules are disabled by default. Enable them individually from the console — no YAML required.
Available default rules
Section titled “Available default rules”Workload alerts
Section titled “Workload alerts”| Alert | Description | Default for | Default severity |
|---|---|---|---|
| Pod Crash Looping | Pod is restarting frequently (CrashLoopBackOff) | 15m | warning |
| Pod Not Ready | Pod has been in a non-ready state for too long | 15m | warning |
| Container Waiting | Container stuck in a waiting state | 1h | warning |
| Deployment Replicas Mismatch | Deployment hasn’t reached desired replica count | 15m | warning |
| Deployment Rollout Stuck | Deployment rollout is not progressing | 15m | warning |
| StatefulSet Replicas Mismatch | StatefulSet hasn’t reached desired replica count | 15m | warning |
| Job Failed | Job failed to complete | 15m | warning |
Resource alerts
Section titled “Resource alerts”| Alert | Description | Default threshold | Default severity |
|---|---|---|---|
| CPU Throttling High | Processes experiencing elevated CPU throttling | 25% | warning |
| Quota Almost Full | Namespace resource quota approaching its limit | 90% | warning |
| Quota Exceeded | Namespace resource quota exceeded | — | warning |
Storage alerts
Section titled “Storage alerts”| Alert | Description | Default threshold | Default severity |
|---|---|---|---|
| Persistent Volume Filling Up | PV running low on available space | Critical: 3% remaining, Warning: 15% remaining | critical / warning |
| Persistent Volume Errors | PV in Failed or Pending state | — | critical |
Enabling and overriding defaults
Section titled “Enabling and overriding defaults”In the console under Alerting > Rules > Managed Rules:
- Toggle any rule on or off with the enable switch.
- Edit a rule to override its defaults:
- Pending duration (
for) — how long the condition must hold before firing. - Severity —
info,warning, orcritical. - Threshold — for rules with configurable thresholds (e.g., CPU throttling percentage, quota ratio).
- Pending duration (
Changes are applied through the platform operator — you don’t need to edit YAML or manage PrometheusRule resources directly for default rules.
Custom alert rules
Section titled “Custom alert rules”For application-specific alerts, create your own PrometheusRule resources. These appear automatically under the Custom Rules tab in the console.
Creating a PrometheusRule
Section titled “Creating a PrometheusRule”apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: my-app-alerts namespace: my-namespacespec: groups: - name: my-app rules: - alert: HighRequestLatency expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="api"}[5m])) by (le) ) > 2.0 for: 10m labels: severity: warning team: backend annotations: summary: "High p99 latency on {{ $labels.service }}" description: >- p99 latency is {{ $value | humanizeDuration }} over the last 10 minutes. runbook_url: "https://runbooks.example.com/HighRequestLatency"Key fields
Section titled “Key fields”| Field | Purpose |
|---|---|
alert | Alert name. Becomes the alertname label. |
expr | PromQL expression. Alert fires when this returns results. |
for | How long the condition must hold before firing. Prevents flapping. |
labels | Added to the alert. Used for routing and grouping in Alertmanager. |
annotations | Informational fields included in notifications (summary, description, runbook). |
Label conventions
Section titled “Label conventions”Use standard labels so Alertmanager can route correctly:
severity:critical,warning, orinfoteam: owning team name (use this in routing rule matchers)
Annotation conventions
Section titled “Annotation conventions”summary: one-line description, supports{{ $labels.name }}templatingdescription: detailed context with{{ $value }}for the current metric valuerunbook_url: link to the response procedure for this alert
Alert design principles
Section titled “Alert design principles”- Alert on symptoms (error rate, latency, availability), not causes.
- Every alert should have a clear action. If there’s nothing to do, it shouldn’t be an alert.
- Include a
runbook_urlso responders know what to do. - Use
foron all rules — 5–15 minutes for warnings, 2–5 minutes for critical.
Recording rules
Section titled “Recording rules”Recording rules precompute expensive queries and store results as new time series. Use them to speed up dashboards and simplify alert expressions.
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: my-app-recording-rules namespace: my-namespacespec: groups: - name: my-app-recording interval: 1m rules: - record: namespace:http_request_rate:sum expr: | sum by (namespace) ( rate(http_requests_total[5m]) )Use the naming convention level:metric:operations for recording rule names.
Notification receivers
Section titled “Notification receivers”Receivers define where alert notifications are delivered. Configure them in the console at Alerting > Notifications > Receivers.
Supported receiver types
Section titled “Supported receiver types”| Type | What you need |
|---|---|
| Slack | Incoming webhook URL (create one here) |
| Microsoft Teams | Incoming webhook URL from a Teams channel connector |
| PagerDuty | Events API v2 integration key from your PagerDuty service |
| SMTP server, sender address, recipient address | |
| Webhook | Any HTTP endpoint that accepts POST requests |
All receiver types support Send resolved — enable this to get notified when alerts recover.
Default notification templates
Section titled “Default notification templates”Every receiver ships with rich notification templates out of the box. You don’t need to configure templates unless you want to customize the format.
Slack defaults include:
- Title with alert status, count, and group labels
- Per-alert detail with summary, description, runbook link, and labels
- Severity-aware color coding (red for critical, yellow for warning, blue for info, green for resolved)
PagerDuty defaults include:
- Severity mapped from alert labels (falls back to critical)
- Description with summary and detail
Microsoft Teams defaults include:
- Title with status and count
- Firing and resolved sections with full alert detail
Customizing templates with Go templating
Section titled “Customizing templates with Go templating”All text fields in Slack, Teams, and PagerDuty receivers accept Go template syntax. You can customize any field by expanding Show optional settings when creating or editing a receiver.
Available template data
Section titled “Available template data”Templates receive the Alertmanager notification data structure:
| Variable | Description |
|---|---|
{{ .Status }} | "firing" or "resolved" |
{{ .Alerts }} | List of all alerts in the group |
{{ .Alerts.Firing }} | Only firing alerts |
{{ .Alerts.Resolved }} | Only resolved alerts |
{{ .GroupLabels }} | Labels used to group the alerts |
{{ .CommonLabels }} | Labels shared by all alerts in the group |
{{ .CommonAnnotations }} | Annotations shared by all alerts |
Each alert in the list has:
| Variable | Description |
|---|---|
{{ .Labels.alertname }} | Alert name |
{{ .Labels.severity }} | Severity label |
{{ .Annotations.summary }} | Summary annotation |
{{ .Annotations.description }} | Description annotation |
{{ .Annotations.runbook_url }} | Runbook link |
{{ .GeneratorURL }} | Link back to the query source |
Example: custom Slack text
Section titled “Example: custom Slack text”{{ range .Alerts -}}*{{ .Labels.alertname }}* ({{ .Labels.severity }}){{ .Annotations.summary }}Namespace: {{ .Labels.namespace }}{{ end }}Built-in functions
Section titled “Built-in functions”| Function | Example | Output |
|---|---|---|
toUpper | {{ .Status | toUpper }} | FIRING |
join | {{ .GroupLabels.SortedPairs.Values | join " " }} | MyAlert warning |
humanize | {{ $value | humanize }} | 1.234k |
Routing rules
Section titled “Routing rules”Routing rules control which alerts go to which receiver. Configure them in the console at Alerting > Notifications > Routing.
Without any routing rules, all alerts go to the default receiver (the first one you created).
Creating a routing rule
Section titled “Creating a routing rule”- Click Add Routing Rule.
- Select the Receiver for matching alerts.
- Add Matchers — label conditions that determine which alerts match:
severity = critical— route all critical alertsteam = backend— route alerts tagged for a specific teamnamespace = my-app— route alerts from a specific namespace
- Click Save.
Matcher syntax
Section titled “Matcher syntax”Matchers use Alertmanager’s label matching:
| Syntax | Meaning |
|---|---|
severity = critical | Exact match |
severity != info | Not equal |
namespace =~ "prod|staging" | Regex match |
team !~ "test.*" | Negative regex |
Continue matching
Section titled “Continue matching”Enable Continue on a route to send matching alerts to this receiver and keep evaluating subsequent routes. This is useful for sending critical alerts to both Slack and PagerDuty.
Grouping and timing overrides
Section titled “Grouping and timing overrides”Each routing rule can override the default grouping and timing:
| Setting | Default | Purpose |
|---|---|---|
| Group by | alertname | Labels used to batch alerts into a single notification |
| Group wait | 30s | Delay before the first notification for a new group |
| Group interval | 5m | Minimum wait between updates to a group |
| Repeat interval | 4h | Resend interval when nothing has changed |
Click Show default overrides in the routing rule form to configure these.
Testing alerts
Section titled “Testing alerts”Test button
Section titled “Test button”Each receiver has a Test button in the console. Click it to send a one-off test alert to verify the receiver is working. The test alert auto-resolves in a few seconds and does not persist.
Test PrometheusRule
Section titled “Test PrometheusRule”To test the full pipeline (rule → ruler → Alertmanager → receiver), deploy a rule that fires immediately:
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: test-alert namespace: defaultspec: groups: - name: test rules: - alert: TestAlert expr: vector(1) > 0 for: 1m labels: severity: warning annotations: summary: "Test alert — safe to ignore"Apply it, wait 2–3 minutes for the full sync and evaluation cycle, then delete it:
kubectl apply -f test-alert.yaml# Wait for notification...kubectl delete prometheusrule test-alertValidating rules locally
Section titled “Validating rules locally”Test your PromQL expressions before deploying:
promtool check rules rules.yamlFurther reading
Section titled “Further reading”- Tutorial: Slack Notifications — step-by-step first-time setup walkthrough
- Prometheus alerting rules
- Alertmanager notification template reference
- PrometheusRule CRD reference