Skip to main content

Prometheus Metrics

The Kymaros API server exposes Prometheus metrics at /metrics on port 8080. All metrics are prefixed with kymaros_.


Metrics reference

MetricTypeLabelsDescription
kymaros_tests_totalCountertest, resultTotal number of completed test runs, partitioned by test name and outcome.
kymaros_scoreGaugetestCurrent confidence score (0–100) for the most recent run of each test.
kymaros_rto_secondsGaugetestMeasured restore duration in seconds for the most recent run of each test.
kymaros_test_duration_secondsHistogramtestFull test execution duration in seconds. Exponential buckets from 60s to 7680s.
kymaros_backup_age_secondsGaugetestAge of the backup that was restored, in seconds, as of the most recent run.

Label values

LabelDescription
testName of the RestoreTest resource (e.g., my-app-nightly).
resultResult value on kymaros_tests_total: pass, fail, or partial.

Metric details

kymaros_tests_total

A monotonically increasing counter. Each completed test run increments the counter for the corresponding (test, result) pair.

kymaros_tests_total{test="my-app-nightly", result="pass"} 42
kymaros_tests_total{test="my-app-nightly", result="fail"} 3
kymaros_tests_total{test="orders-db-validation", result="pass"} 38
kymaros_tests_total{test="orders-db-validation", result="partial"} 2

kymaros_score

Updated after each run. Holds the score from the most recent completed run, not a running average.

kymaros_score{test="my-app-nightly"} 96
kymaros_score{test="orders-db-validation"} 42

kymaros_rto_seconds

Updated after each run. Holds the measured RTO from the most recent run in seconds.

kymaros_rto_seconds{test="my-app-nightly"} 695
kymaros_rto_seconds{test="orders-db-validation"} 1105

kymaros_test_duration_seconds

Histogram tracking the full wall-clock time of each test execution (from restore start to final validation step). Bucket boundaries (seconds): 60, 120, 240, 480, 960, 1920, 3840, 7680.

kymaros_test_duration_seconds_bucket{test="my-app-nightly", le="480"} 38
kymaros_test_duration_seconds_bucket{test="my-app-nightly", le="960"} 42
kymaros_test_duration_seconds_bucket{test="my-app-nightly", le="+Inf"} 42
kymaros_test_duration_seconds_sum{test="my-app-nightly"} 28140
kymaros_test_duration_seconds_count{test="my-app-nightly"} 42

kymaros_backup_age_seconds

Age of the restored backup at the time of restore, in seconds. Useful for tracking backup recency.

kymaros_backup_age_seconds{test="my-app-nightly"} 14412
kymaros_backup_age_seconds{test="orders-db-validation"} 18008

PromQL examples

Current score for a specific test

kymaros_score{test="my-app-nightly"}

Tests with a score below 70 (failures)

kymaros_score < 70

Overall pass rate across all tests (last 7 days)

sum(increase(kymaros_tests_total{result="pass"}[7d]))
/
sum(increase(kymaros_tests_total[7d]))

Failure rate per test (last 24 hours)

increase(kymaros_tests_total{result="fail"}[24h])
/
increase(kymaros_tests_total[24h])

Tests where measured RTO exceeds a threshold (e.g., 900 seconds / 15 minutes)

kymaros_rto_seconds > 900

95th percentile test duration across all tests

histogram_quantile(0.95, sum by (le) (rate(kymaros_test_duration_seconds_bucket[24h])))

95th percentile test duration per test

histogram_quantile(
0.95,
sum by (test, le) (rate(kymaros_test_duration_seconds_bucket[24h]))
)

Average backup age across all tests (hours)

avg(kymaros_backup_age_seconds) / 3600

Alert: test score dropped below 70

Suitable as a Prometheus alerting rule:

groups:
- name: kymaros
rules:
- alert: KymarosTestFailed
expr: kymaros_score < 70
for: 0m
labels:
severity: critical
annotations:
summary: "Restore test failed: {{ $labels.test }}"
description: "Test {{ $labels.test }} scored {{ $value }}/100 (threshold: 70)"

- alert: KymarosRTOExceeded
expr: kymaros_rto_seconds > 900
for: 0m
labels:
severity: warning
annotations:
summary: "RTO exceeded for test: {{ $labels.test }}"
description: "Test {{ $labels.test }} measured {{ $value }}s RTO"

Scrape configuration

Prometheus scrape_configs

scrape_configs:
- job_name: kymaros
static_configs:
- targets:
- kymaros-api.kymaros-system.svc.cluster.local:8080
metrics_path: /metrics
scrape_interval: 30s

Prometheus Operator ServiceMonitor

If you use the Prometheus Operator (e.g., via kube-prometheus-stack), create a ServiceMonitor in the kymaros-system namespace:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kymaros
namespace: kymaros-system
labels:
release: prometheus
spec:
selector:
matchLabels:
app.kubernetes.io/name: kymaros-api
endpoints:
- port: http
path: /metrics
interval: 30s
namespaceSelector:
matchNames:
- kymaros-system

Adjust the release label to match your Prometheus Operator's serviceMonitorSelector.