Skip to content

Alerting

Kupe Cloud uses PrometheusRule resources for alerting and Alertmanager for notification delivery. Alert rules are defined as Kubernetes resources, evaluated by the Mimir ruler, and routed to receivers you configure in the console.

  1. A PrometheusRule resource is created in your cluster (either a platform default or your own custom rule).
  2. Alloy discovers the resource and syncs it to the Mimir ruler.
  3. The ruler evaluates your PromQL expressions on a schedule.
  4. When a condition holds for the configured for duration, the alert transitions from pending to firing.
  5. Firing alerts are sent to Alertmanager, where you define grouping, deduplication, routing logic, and receivers (Slack, PagerDuty, email, Teams, webhook) in the console.

Every cluster comes with a set of baseline alert rules that cover common failure modes. These are managed by the platform and appear under the Managed Rules tab in the console at Alerting > Rules.

All default rules are disabled by default. Enable them individually from the console — no YAML required.

AlertDescriptionDefault forDefault severity
Pod Crash LoopingPod is restarting frequently (CrashLoopBackOff)15mwarning
Pod Not ReadyPod has been in a non-ready state for too long15mwarning
Container WaitingContainer stuck in a waiting state1hwarning
Deployment Replicas MismatchDeployment hasn’t reached desired replica count15mwarning
Deployment Rollout StuckDeployment rollout is not progressing15mwarning
StatefulSet Replicas MismatchStatefulSet hasn’t reached desired replica count15mwarning
Job FailedJob failed to complete15mwarning
AlertDescriptionDefault thresholdDefault severity
CPU Throttling HighProcesses experiencing elevated CPU throttling25%warning
Quota Almost FullNamespace resource quota approaching its limit90%warning
Quota ExceededNamespace resource quota exceededwarning
AlertDescriptionDefault thresholdDefault severity
Persistent Volume Filling UpPV running low on available spaceCritical: 3% remaining, Warning: 15% remainingcritical / warning
Persistent Volume ErrorsPV in Failed or Pending statecritical

In the console under Alerting > Rules > Managed Rules:

  1. Toggle any rule on or off with the enable switch.
  2. Edit a rule to override its defaults:
    • Pending duration (for) — how long the condition must hold before firing.
    • Severityinfo, warning, or critical.
    • Threshold — for rules with configurable thresholds (e.g., CPU throttling percentage, quota ratio).

Changes are applied through the platform operator — you don’t need to edit YAML or manage PrometheusRule resources directly for default rules.


For application-specific alerts, create your own PrometheusRule resources. These appear automatically under the Custom Rules tab in the console.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: my-namespace
spec:
groups:
- name: my-app
rules:
- alert: HighRequestLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="api"}[5m]))
by (le)
) > 2.0
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High p99 latency on {{ $labels.service }}"
description: >-
p99 latency is {{ $value | humanizeDuration }}
over the last 10 minutes.
runbook_url: "https://runbooks.example.com/HighRequestLatency"
FieldPurpose
alertAlert name. Becomes the alertname label.
exprPromQL expression. Alert fires when this returns results.
forHow long the condition must hold before firing. Prevents flapping.
labelsAdded to the alert. Used for routing and grouping in Alertmanager.
annotationsInformational fields included in notifications (summary, description, runbook).

Use standard labels so Alertmanager can route correctly:

  • severity: critical, warning, or info
  • team: owning team name (use this in routing rule matchers)
  • summary: one-line description, supports {{ $labels.name }} templating
  • description: detailed context with {{ $value }} for the current metric value
  • runbook_url: link to the response procedure for this alert
  • Alert on symptoms (error rate, latency, availability), not causes.
  • Every alert should have a clear action. If there’s nothing to do, it shouldn’t be an alert.
  • Include a runbook_url so responders know what to do.
  • Use for on all rules — 5–15 minutes for warnings, 2–5 minutes for critical.

Recording rules precompute expensive queries and store results as new time series. Use them to speed up dashboards and simplify alert expressions.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-recording-rules
namespace: my-namespace
spec:
groups:
- name: my-app-recording
interval: 1m
rules:
- record: namespace:http_request_rate:sum
expr: |
sum by (namespace) (
rate(http_requests_total[5m])
)

Use the naming convention level:metric:operations for recording rule names.


Receivers define where alert notifications are delivered. Configure them in the console at Alerting > Notifications > Receivers.

TypeWhat you need
SlackIncoming webhook URL (create one here)
Microsoft TeamsIncoming webhook URL from a Teams channel connector
PagerDutyEvents API v2 integration key from your PagerDuty service
EmailSMTP server, sender address, recipient address
WebhookAny HTTP endpoint that accepts POST requests

All receiver types support Send resolved — enable this to get notified when alerts recover.

Every receiver ships with rich notification templates out of the box. You don’t need to configure templates unless you want to customize the format.

Slack defaults include:

  • Title with alert status, count, and group labels
  • Per-alert detail with summary, description, runbook link, and labels
  • Severity-aware color coding (red for critical, yellow for warning, blue for info, green for resolved)

PagerDuty defaults include:

  • Severity mapped from alert labels (falls back to critical)
  • Description with summary and detail

Microsoft Teams defaults include:

  • Title with status and count
  • Firing and resolved sections with full alert detail

All text fields in Slack, Teams, and PagerDuty receivers accept Go template syntax. You can customize any field by expanding Show optional settings when creating or editing a receiver.

Templates receive the Alertmanager notification data structure:

VariableDescription
{{ .Status }}"firing" or "resolved"
{{ .Alerts }}List of all alerts in the group
{{ .Alerts.Firing }}Only firing alerts
{{ .Alerts.Resolved }}Only resolved alerts
{{ .GroupLabels }}Labels used to group the alerts
{{ .CommonLabels }}Labels shared by all alerts in the group
{{ .CommonAnnotations }}Annotations shared by all alerts

Each alert in the list has:

VariableDescription
{{ .Labels.alertname }}Alert name
{{ .Labels.severity }}Severity label
{{ .Annotations.summary }}Summary annotation
{{ .Annotations.description }}Description annotation
{{ .Annotations.runbook_url }}Runbook link
{{ .GeneratorURL }}Link back to the query source
{{ range .Alerts -}}
*{{ .Labels.alertname }}* ({{ .Labels.severity }})
{{ .Annotations.summary }}
Namespace: {{ .Labels.namespace }}
{{ end }}
FunctionExampleOutput
toUpper{{ .Status | toUpper }}FIRING
join{{ .GroupLabels.SortedPairs.Values | join " " }}MyAlert warning
humanize{{ $value | humanize }}1.234k

Routing rules control which alerts go to which receiver. Configure them in the console at Alerting > Notifications > Routing.

Without any routing rules, all alerts go to the default receiver (the first one you created).

  1. Click Add Routing Rule.
  2. Select the Receiver for matching alerts.
  3. Add Matchers — label conditions that determine which alerts match:
    • severity = critical — route all critical alerts
    • team = backend — route alerts tagged for a specific team
    • namespace = my-app — route alerts from a specific namespace
  4. Click Save.

Matchers use Alertmanager’s label matching:

SyntaxMeaning
severity = criticalExact match
severity != infoNot equal
namespace =~ "prod|staging"Regex match
team !~ "test.*"Negative regex

Enable Continue on a route to send matching alerts to this receiver and keep evaluating subsequent routes. This is useful for sending critical alerts to both Slack and PagerDuty.

Each routing rule can override the default grouping and timing:

SettingDefaultPurpose
Group byalertnameLabels used to batch alerts into a single notification
Group wait30sDelay before the first notification for a new group
Group interval5mMinimum wait between updates to a group
Repeat interval4hResend interval when nothing has changed

Click Show default overrides in the routing rule form to configure these.


Each receiver has a Test button in the console. Click it to send a one-off test alert to verify the receiver is working. The test alert auto-resolves in a few seconds and does not persist.

To test the full pipeline (rule → ruler → Alertmanager → receiver), deploy a rule that fires immediately:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: test-alert
namespace: default
spec:
groups:
- name: test
rules:
- alert: TestAlert
expr: vector(1) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Test alert — safe to ignore"

Apply it, wait 2–3 minutes for the full sync and evaluation cycle, then delete it:

Terminal window
kubectl apply -f test-alert.yaml
# Wait for notification...
kubectl delete prometheusrule test-alert

Test your PromQL expressions before deploying:

Terminal window
promtool check rules rules.yaml