Prometheus Metrics
The Kymaros API server exposes Prometheus metrics at /metrics on port 8080. All metrics are prefixed with kymaros_.
Metrics reference
| Metric | Type | Labels | Description |
|---|---|---|---|
kymaros_tests_total | Counter | test, result | Total number of completed test runs, partitioned by test name and outcome. |
kymaros_score | Gauge | test | Current confidence score (0–100) for the most recent run of each test. |
kymaros_rto_seconds | Gauge | test | Measured restore duration in seconds for the most recent run of each test. |
kymaros_test_duration_seconds | Histogram | test | Full test execution duration in seconds. Exponential buckets from 60s to 7680s. |
kymaros_backup_age_seconds | Gauge | test | Age of the backup that was restored, in seconds, as of the most recent run. |
Label values
| Label | Description |
|---|---|
test | Name of the RestoreTest resource (e.g., my-app-nightly). |
result | Result value on kymaros_tests_total: pass, fail, or partial. |
Metric details
kymaros_tests_total
A monotonically increasing counter. Each completed test run increments the counter for the corresponding (test, result) pair.
kymaros_tests_total{test="my-app-nightly", result="pass"} 42
kymaros_tests_total{test="my-app-nightly", result="fail"} 3
kymaros_tests_total{test="orders-db-validation", result="pass"} 38
kymaros_tests_total{test="orders-db-validation", result="partial"} 2
kymaros_score
Updated after each run. Holds the score from the most recent completed run, not a running average.
kymaros_score{test="my-app-nightly"} 96
kymaros_score{test="orders-db-validation"} 42
kymaros_rto_seconds
Updated after each run. Holds the measured RTO from the most recent run in seconds.
kymaros_rto_seconds{test="my-app-nightly"} 695
kymaros_rto_seconds{test="orders-db-validation"} 1105
kymaros_test_duration_seconds
Histogram tracking the full wall-clock time of each test execution (from restore start to final validation step). Bucket boundaries (seconds): 60, 120, 240, 480, 960, 1920, 3840, 7680.
kymaros_test_duration_seconds_bucket{test="my-app-nightly", le="480"} 38
kymaros_test_duration_seconds_bucket{test="my-app-nightly", le="960"} 42
kymaros_test_duration_seconds_bucket{test="my-app-nightly", le="+Inf"} 42
kymaros_test_duration_seconds_sum{test="my-app-nightly"} 28140
kymaros_test_duration_seconds_count{test="my-app-nightly"} 42
kymaros_backup_age_seconds
Age of the restored backup at the time of restore, in seconds. Useful for tracking backup recency.
kymaros_backup_age_seconds{test="my-app-nightly"} 14412
kymaros_backup_age_seconds{test="orders-db-validation"} 18008
PromQL examples
Current score for a specific test
kymaros_score{test="my-app-nightly"}
Tests with a score below 70 (failures)
kymaros_score < 70
Overall pass rate across all tests (last 7 days)
sum(increase(kymaros_tests_total{result="pass"}[7d]))
/
sum(increase(kymaros_tests_total[7d]))
Failure rate per test (last 24 hours)
increase(kymaros_tests_total{result="fail"}[24h])
/
increase(kymaros_tests_total[24h])
Tests where measured RTO exceeds a threshold (e.g., 900 seconds / 15 minutes)
kymaros_rto_seconds > 900
95th percentile test duration across all tests
histogram_quantile(0.95, sum by (le) (rate(kymaros_test_duration_seconds_bucket[24h])))
95th percentile test duration per test
histogram_quantile(
0.95,
sum by (test, le) (rate(kymaros_test_duration_seconds_bucket[24h]))
)
Average backup age across all tests (hours)
avg(kymaros_backup_age_seconds) / 3600
Alert: test score dropped below 70
Suitable as a Prometheus alerting rule:
groups:
- name: kymaros
rules:
- alert: KymarosTestFailed
expr: kymaros_score < 70
for: 0m
labels:
severity: critical
annotations:
summary: "Restore test failed: {{ $labels.test }}"
description: "Test {{ $labels.test }} scored {{ $value }}/100 (threshold: 70)"
- alert: KymarosRTOExceeded
expr: kymaros_rto_seconds > 900
for: 0m
labels:
severity: warning
annotations:
summary: "RTO exceeded for test: {{ $labels.test }}"
description: "Test {{ $labels.test }} measured {{ $value }}s RTO"
Scrape configuration
Prometheus scrape_configs
scrape_configs:
- job_name: kymaros
static_configs:
- targets:
- kymaros-api.kymaros-system.svc.cluster.local:8080
metrics_path: /metrics
scrape_interval: 30s
Prometheus Operator ServiceMonitor
If you use the Prometheus Operator (e.g., via kube-prometheus-stack), create a ServiceMonitor in the kymaros-system namespace:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kymaros
namespace: kymaros-system
labels:
release: prometheus
spec:
selector:
matchLabels:
app.kubernetes.io/name: kymaros-api
endpoints:
- port: http
path: /metrics
interval: 30s
namespaceSelector:
matchNames:
- kymaros-system
Adjust the release label to match your Prometheus Operator's serviceMonitorSelector.