Skip to main content

Prometheus & Grafana

Kymaros exposes five Prometheus metrics that cover test outcomes, timing, and backup age. This guide covers the metric definitions, useful PromQL queries, AlertManager rules for production alerting, and how to configure scraping and Grafana.


Metrics reference

All metrics are registered by the operator and exposed on its metrics endpoint (default: port 8080, path /metrics).

kymaros_tests_total

Type: Counter
Labels: test, result

Total number of restore tests executed since the operator started. The result label takes one of three values: passed, failed, or error (infrastructure error, not a health check failure).

kymaros_tests_total{test="webapp-nightly", result="passed"} 42
kymaros_tests_total{test="webapp-nightly", result="failed"} 3

kymaros_score

Type: Gauge
Labels: test

The score from the most recent completed test run, on a scale of 0 to 100. A score of 100 means all checks passed. A score of 0 means the test failed entirely or could not run.

kymaros_score{test="webapp-nightly"} 87

kymaros_rto_seconds

Type: Gauge
Labels: test

The measured Recovery Time Objective in seconds for the most recent test run. This is the elapsed time from restore initiation to the moment all health checks passed.

kymaros_rto_seconds{test="webapp-nightly"} 342

kymaros_test_duration_seconds

Type: Histogram
Labels: test
Buckets: Exponential, starting at 60 seconds, doubling to 7680 seconds (60, 120, 240, 480, 960, 1920, 3840, 7680)

The total duration of each test run from start to finish, including backup restore time, pod scheduling, and all health check execution. Stored as a histogram to allow percentile queries across test runs.

kymaros_test_duration_seconds_bucket{test="webapp-nightly", le="240"} 38
kymaros_test_duration_seconds_bucket{test="webapp-nightly", le="480"} 44
kymaros_test_duration_seconds_sum{test="webapp-nightly"} 9840
kymaros_test_duration_seconds_count{test="webapp-nightly"} 44

kymaros_backup_age_seconds

Type: Gauge
Labels: test

The age of the backup that was most recently tested, measured in seconds from the backup creation timestamp to the test run start. This metric is useful for detecting stale backups that have not been updated.

kymaros_backup_age_seconds{test="webapp-nightly"} 82800

A value of 82800 seconds indicates the backup was approximately 23 hours old when tested — consistent with a nightly backup schedule.


PromQL queries

Pass rate over the last 30 days

sum(increase(kymaros_tests_total{result="passed"}[30d]))
/
sum(increase(kymaros_tests_total[30d]))

Current score for all tests

kymaros_score

Tests with score below threshold (e.g., below 80)

kymaros_score < 80

RTO trend over the last 7 days (per test)

kymaros_rto_seconds

Plot this over a time range to visualize trends. A rising trend may indicate data volume growth or infrastructure degradation.

95th percentile test duration over the last 30 days

histogram_quantile(0.95, sum(rate(kymaros_test_duration_seconds_bucket[30d])) by (le, test))

Tests that have not run in 25 hours

changes(kymaros_tests_total[25h]) == 0

Returns series for any test where the counter has not incremented in 25 hours. This is the same expression used in the RestoreTestNotRun alert.

Backup age at last test exceeding 26 hours

kymaros_backup_age_seconds > 93600

This metric records the age of the backup that was used in the most recent test run, not how long ago the test ran. A value above 93600 seconds (26 hours) indicates the backup itself was stale when tested — the backup pipeline may have missed a run. A 26-hour threshold gives a 1-hour buffer above a 25-hour max-drift daily schedule.

Failure rate by test

rate(kymaros_tests_total{result="failed"}[7d])
/
rate(kymaros_tests_total[7d])

AlertManager rules

The following three AlertManager rules cover the most important failure conditions for production use. Add them to your AlertManager configuration or to a PrometheusRule resource.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kymaros-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: kymaros.rules
interval: 1m
rules:
# Alert when a restore test fails.
- alert: RestoreTestFailed
expr: |
increase(kymaros_tests_total{result="failed"}[1h]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Restore test failed: {{ $labels.test }}"
description: >
RestoreTest {{ $labels.test }} failed in the last hour.
Check the RestoreReport resource and operator logs for the
failing health check details.
runbook_url: "https://docs.kymaros.io/docs/guides/health-checks"

# Alert when the measured RTO exceeds the SLA threshold.
# Adjust 900 (15 minutes) to match your RTO SLA.
- alert: RTOExceedsSLA
expr: |
kymaros_rto_seconds > 900
for: 0m
labels:
severity: warning
annotations:
summary: "RTO exceeds SLA for {{ $labels.test }}"
description: >
RestoreTest {{ $labels.test }} measured an RTO of
{{ $value | humanizeDuration }}, which exceeds the configured
SLA of 15 minutes. Review infrastructure performance and
backup size growth.
runbook_url: "https://docs.kymaros.io/docs/guides/stateful-apps"

# Alert when a test has not run in over 25 hours.
# Catches schedule failures, operator crashes, or backup tool outages.
- alert: RestoreTestNotRun
expr: |
changes(kymaros_tests_total[25h]) == 0
for: 10m
labels:
severity: warning
annotations:
summary: "Restore test not run recently: {{ $labels.test }}"
description: >
No restore test has completed for {{ $labels.test }} in the
last 25 hours. Verify the operator is running and the backup
schedule is producing new backups.
runbook_url: "https://docs.kymaros.io/docs/operations/troubleshooting"

Alert severity guide

AlertSeverityReason
RestoreTestFailedcriticalA failed test means your current backup cannot be verified as restorable. This is a data recovery risk.
RTOExceedsSLAwarningRecovery is still possible but slower than committed. Investigate before the next test.
RestoreTestNotRunwarningThe test machinery itself may be broken. Not an immediate risk but masks potential failures.

ServiceMonitor

If you use the Prometheus Operator, create a ServiceMonitor to scrape the Kymaros operator metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kymaros-operator
namespace: monitoring
labels:
prometheus: kube-prometheus
spec:
namespaceSelector:
matchNames:
- kymaros-system
selector:
matchLabels:
app: kymaros-operator
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http

This assumes the Kymaros operator Service has a named port metrics pointing to port 8080. Verify the port name with:

kubectl get svc -n kymaros-system -l app=kymaros-operator -o jsonpath='{.items[0].spec.ports}'

Grafana dashboard

Kymaros does not ship a pre-built Grafana dashboard at this time. The following panels cover the most useful views:

PanelQueryVisualization
Score over time (per test)kymaros_scoreTime series
Pass/fail ratiosum by (result)(increase(kymaros_tests_total[24h]))Pie chart
RTO trendkymaros_rto_secondsTime series
Backup agekymaros_backup_age_seconds / 3600Gauge (hours)
P95 test durationhistogram_quantile(0.95, sum(rate(kymaros_test_duration_seconds_bucket[7d])) by (le, test))Time series
Tests with score below 80kymaros_score < 80Table

Importing a dashboard

If your team has created a Kymaros dashboard JSON, import it via:

  1. Open Grafana and navigate to Dashboards > Import.
  2. Paste the JSON or upload the file.
  3. Select your Prometheus datasource when prompted.
  4. Click Import.

The dashboard will appear under the folder you selected during import. Grafana Dashboard ID lookup is not available for Kymaros at this time — use a local JSON file.

Variables

For dashboards that cover multiple tests, add a Grafana variable:

  • Variable name: test
  • Type: Query
  • Query: label_values(kymaros_score, test)
  • Refresh: On dashboard load

Then replace the literal test name in queries with $test:

kymaros_score{test="$test"}