Prometheus & Grafana
Kymaros exposes five Prometheus metrics that cover test outcomes, timing, and backup age. This guide covers the metric definitions, useful PromQL queries, AlertManager rules for production alerting, and how to configure scraping and Grafana.
Metrics reference
All metrics are registered by the operator and exposed on its metrics endpoint (default: port 8080, path /metrics).
kymaros_tests_total
Type: Counter
Labels: test, result
Total number of restore tests executed since the operator started. The result label takes one of three values: passed, failed, or error (infrastructure error, not a health check failure).
kymaros_tests_total{test="webapp-nightly", result="passed"} 42
kymaros_tests_total{test="webapp-nightly", result="failed"} 3
kymaros_score
Type: Gauge
Labels: test
The score from the most recent completed test run, on a scale of 0 to 100. A score of 100 means all checks passed. A score of 0 means the test failed entirely or could not run.
kymaros_score{test="webapp-nightly"} 87
kymaros_rto_seconds
Type: Gauge
Labels: test
The measured Recovery Time Objective in seconds for the most recent test run. This is the elapsed time from restore initiation to the moment all health checks passed.
kymaros_rto_seconds{test="webapp-nightly"} 342
kymaros_test_duration_seconds
Type: Histogram
Labels: test
Buckets: Exponential, starting at 60 seconds, doubling to 7680 seconds (60, 120, 240, 480, 960, 1920, 3840, 7680)
The total duration of each test run from start to finish, including backup restore time, pod scheduling, and all health check execution. Stored as a histogram to allow percentile queries across test runs.
kymaros_test_duration_seconds_bucket{test="webapp-nightly", le="240"} 38
kymaros_test_duration_seconds_bucket{test="webapp-nightly", le="480"} 44
kymaros_test_duration_seconds_sum{test="webapp-nightly"} 9840
kymaros_test_duration_seconds_count{test="webapp-nightly"} 44
kymaros_backup_age_seconds
Type: Gauge
Labels: test
The age of the backup that was most recently tested, measured in seconds from the backup creation timestamp to the test run start. This metric is useful for detecting stale backups that have not been updated.
kymaros_backup_age_seconds{test="webapp-nightly"} 82800
A value of 82800 seconds indicates the backup was approximately 23 hours old when tested — consistent with a nightly backup schedule.
PromQL queries
Pass rate over the last 30 days
sum(increase(kymaros_tests_total{result="passed"}[30d]))
/
sum(increase(kymaros_tests_total[30d]))
Current score for all tests
kymaros_score
Tests with score below threshold (e.g., below 80)
kymaros_score < 80
RTO trend over the last 7 days (per test)
kymaros_rto_seconds
Plot this over a time range to visualize trends. A rising trend may indicate data volume growth or infrastructure degradation.
95th percentile test duration over the last 30 days
histogram_quantile(0.95, sum(rate(kymaros_test_duration_seconds_bucket[30d])) by (le, test))
Tests that have not run in 25 hours
changes(kymaros_tests_total[25h]) == 0
Returns series for any test where the counter has not incremented in 25 hours. This is the same expression used in the RestoreTestNotRun alert.
Backup age at last test exceeding 26 hours
kymaros_backup_age_seconds > 93600
This metric records the age of the backup that was used in the most recent test run, not how long ago the test ran. A value above 93600 seconds (26 hours) indicates the backup itself was stale when tested — the backup pipeline may have missed a run. A 26-hour threshold gives a 1-hour buffer above a 25-hour max-drift daily schedule.
Failure rate by test
rate(kymaros_tests_total{result="failed"}[7d])
/
rate(kymaros_tests_total[7d])
AlertManager rules
The following three AlertManager rules cover the most important failure conditions for production use. Add them to your AlertManager configuration or to a PrometheusRule resource.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kymaros-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: kymaros.rules
interval: 1m
rules:
# Alert when a restore test fails.
- alert: RestoreTestFailed
expr: |
increase(kymaros_tests_total{result="failed"}[1h]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Restore test failed: {{ $labels.test }}"
description: >
RestoreTest {{ $labels.test }} failed in the last hour.
Check the RestoreReport resource and operator logs for the
failing health check details.
runbook_url: "https://docs.kymaros.io/docs/guides/health-checks"
# Alert when the measured RTO exceeds the SLA threshold.
# Adjust 900 (15 minutes) to match your RTO SLA.
- alert: RTOExceedsSLA
expr: |
kymaros_rto_seconds > 900
for: 0m
labels:
severity: warning
annotations:
summary: "RTO exceeds SLA for {{ $labels.test }}"
description: >
RestoreTest {{ $labels.test }} measured an RTO of
{{ $value | humanizeDuration }}, which exceeds the configured
SLA of 15 minutes. Review infrastructure performance and
backup size growth.
runbook_url: "https://docs.kymaros.io/docs/guides/stateful-apps"
# Alert when a test has not run in over 25 hours.
# Catches schedule failures, operator crashes, or backup tool outages.
- alert: RestoreTestNotRun
expr: |
changes(kymaros_tests_total[25h]) == 0
for: 10m
labels:
severity: warning
annotations:
summary: "Restore test not run recently: {{ $labels.test }}"
description: >
No restore test has completed for {{ $labels.test }} in the
last 25 hours. Verify the operator is running and the backup
schedule is producing new backups.
runbook_url: "https://docs.kymaros.io/docs/operations/troubleshooting"
Alert severity guide
| Alert | Severity | Reason |
|---|---|---|
RestoreTestFailed | critical | A failed test means your current backup cannot be verified as restorable. This is a data recovery risk. |
RTOExceedsSLA | warning | Recovery is still possible but slower than committed. Investigate before the next test. |
RestoreTestNotRun | warning | The test machinery itself may be broken. Not an immediate risk but masks potential failures. |
ServiceMonitor
If you use the Prometheus Operator, create a ServiceMonitor to scrape the Kymaros operator metrics:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kymaros-operator
namespace: monitoring
labels:
prometheus: kube-prometheus
spec:
namespaceSelector:
matchNames:
- kymaros-system
selector:
matchLabels:
app: kymaros-operator
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
This assumes the Kymaros operator Service has a named port metrics pointing to port 8080. Verify the port name with:
kubectl get svc -n kymaros-system -l app=kymaros-operator -o jsonpath='{.items[0].spec.ports}'
Grafana dashboard
Kymaros does not ship a pre-built Grafana dashboard at this time. The following panels cover the most useful views:
Recommended panels
| Panel | Query | Visualization |
|---|---|---|
| Score over time (per test) | kymaros_score | Time series |
| Pass/fail ratio | sum by (result)(increase(kymaros_tests_total[24h])) | Pie chart |
| RTO trend | kymaros_rto_seconds | Time series |
| Backup age | kymaros_backup_age_seconds / 3600 | Gauge (hours) |
| P95 test duration | histogram_quantile(0.95, sum(rate(kymaros_test_duration_seconds_bucket[7d])) by (le, test)) | Time series |
| Tests with score below 80 | kymaros_score < 80 | Table |
Importing a dashboard
If your team has created a Kymaros dashboard JSON, import it via:
- Open Grafana and navigate to Dashboards > Import.
- Paste the JSON or upload the file.
- Select your Prometheus datasource when prompted.
- Click Import.
The dashboard will appear under the folder you selected during import. Grafana Dashboard ID lookup is not available for Kymaros at this time — use a local JSON file.
Variables
For dashboards that cover multiple tests, add a Grafana variable:
- Variable name:
test - Type: Query
- Query:
label_values(kymaros_score, test) - Refresh: On dashboard load
Then replace the literal test name in queries with $test:
kymaros_score{test="$test"}