Prometheus Metrics

The Kymaros API server exposes Prometheus metrics at /metrics on port 8080. All metrics are prefixed with kymaros_.

Metrics reference

Metric	Type	Labels	Description
`kymaros_tests_total`	Counter	`test`, `result`	Total number of completed test runs, partitioned by test name and outcome.
`kymaros_score`	Gauge	`test`	Current confidence score (0–100) for the most recent run of each test.
`kymaros_rto_seconds`	Gauge	`test`	Measured restore duration in seconds for the most recent run of each test.
`kymaros_test_duration_seconds`	Histogram	`test`	Full test execution duration in seconds. Exponential buckets from 60s to 7680s.
`kymaros_backup_age_seconds`	Gauge	`test`	Age of the backup that was restored, in seconds, as of the most recent run.

Label values

Label	Description
`test`	Name of the `RestoreTest` resource (e.g., `my-app-nightly`).
`result`	Result value on `kymaros_tests_total`: `pass`, `fail`, or `partial`.

Metric details

kymaros_tests_total

A monotonically increasing counter. Each completed test run increments the counter for the corresponding (test, result) pair.

kymaros_tests_total{test="my-app-nightly", result="pass"} 42
kymaros_tests_total{test="my-app-nightly", result="fail"} 3
kymaros_tests_total{test="orders-db-validation", result="pass"} 38
kymaros_tests_total{test="orders-db-validation", result="partial"} 2

kymaros_score

Updated after each run. Holds the score from the most recent completed run, not a running average.

kymaros_score{test="my-app-nightly"} 96
kymaros_score{test="orders-db-validation"} 42

kymaros_rto_seconds

Updated after each run. Holds the measured RTO from the most recent run in seconds.

kymaros_rto_seconds{test="my-app-nightly"} 695
kymaros_rto_seconds{test="orders-db-validation"} 1105

kymaros_test_duration_seconds

Histogram tracking the full wall-clock time of each test execution (from restore start to final validation step). Bucket boundaries (seconds): 60, 120, 240, 480, 960, 1920, 3840, 7680.

kymaros_test_duration_seconds_bucket{test="my-app-nightly", le="480"} 38
kymaros_test_duration_seconds_bucket{test="my-app-nightly", le="960"} 42
kymaros_test_duration_seconds_bucket{test="my-app-nightly", le="+Inf"} 42
kymaros_test_duration_seconds_sum{test="my-app-nightly"} 28140
kymaros_test_duration_seconds_count{test="my-app-nightly"} 42

kymaros_backup_age_seconds

Age of the restored backup at the time of restore, in seconds. Useful for tracking backup recency.

kymaros_backup_age_seconds{test="my-app-nightly"} 14412
kymaros_backup_age_seconds{test="orders-db-validation"} 18008

PromQL examples

Current score for a specific test

kymaros_score{test="my-app-nightly"}

Tests with a score below 70 (failures)

kymaros_score < 70

Overall pass rate across all tests (last 7 days)

sum(increase(kymaros_tests_total{result="pass"}[7d]))
/
sum(increase(kymaros_tests_total[7d]))

Failure rate per test (last 24 hours)

increase(kymaros_tests_total{result="fail"}[24h])
/
increase(kymaros_tests_total[24h])

Tests where measured RTO exceeds a threshold (e.g., 900 seconds / 15 minutes)

kymaros_rto_seconds > 900

95th percentile test duration across all tests

histogram_quantile(0.95, sum by (le) (rate(kymaros_test_duration_seconds_bucket[24h])))

95th percentile test duration per test

histogram_quantile(
  0.95,
  sum by (test, le) (rate(kymaros_test_duration_seconds_bucket[24h]))
)

Average backup age across all tests (hours)

avg(kymaros_backup_age_seconds) / 3600

Alert: test score dropped below 70

Suitable as a Prometheus alerting rule:

groups:
  - name: kymaros
    rules:
      - alert: KymarosTestFailed
        expr: kymaros_score < 70
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Restore test failed: {{ $labels.test }}"
          description: "Test {{ $labels.test }} scored {{ $value }}/100 (threshold: 70)"

      - alert: KymarosRTOExceeded
        expr: kymaros_rto_seconds > 900
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "RTO exceeded for test: {{ $labels.test }}"
          description: "Test {{ $labels.test }} measured {{ $value }}s RTO"

Scrape configuration

Prometheus `scrape_configs`

scrape_configs:
  - job_name: kymaros
    static_configs:
      - targets:
          - kymaros-api.kymaros-system.svc.cluster.local:8080
    metrics_path: /metrics
    scrape_interval: 30s

Prometheus Operator `ServiceMonitor`

If you use the Prometheus Operator (e.g., via kube-prometheus-stack), create a ServiceMonitor in the kymaros-system namespace:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kymaros
  namespace: kymaros-system
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kymaros-api
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
  namespaceSelector:
    matchNames:
      - kymaros-system

Adjust the release label to match your Prometheus Operator's serviceMonitorSelector.

Metrics reference​

Label values​

Metric details​

kymaros_tests_total​

kymaros_score​

kymaros_rto_seconds​

kymaros_test_duration_seconds​

kymaros_backup_age_seconds​

PromQL examples​

Current score for a specific test​

Tests with a score below 70 (failures)​

Overall pass rate across all tests (last 7 days)​

Failure rate per test (last 24 hours)​

Tests where measured RTO exceeds a threshold (e.g., 900 seconds / 15 minutes)​

95th percentile test duration across all tests​

95th percentile test duration per test​

Average backup age across all tests (hours)​

Alert: test score dropped below 70​

Scrape configuration​

Prometheus scrape_configs​

Prometheus Operator ServiceMonitor​