Prometheus & Grafana

Kymaros exposes five Prometheus metrics that cover test outcomes, timing, and backup age. This guide covers the metric definitions, useful PromQL queries, AlertManager rules for production alerting, and how to configure scraping and Grafana.

Metrics reference

All metrics are registered by the operator and exposed on its metrics endpoint (default: port 8080, path /metrics).

kymaros_tests_total

Type: Counter
Labels: test, result

Total number of restore tests executed since the operator started. The result label takes one of three values: passed, failed, or error (infrastructure error, not a health check failure).

kymaros_tests_total{test="webapp-nightly", result="passed"} 42
kymaros_tests_total{test="webapp-nightly", result="failed"} 3

kymaros_score

Type: Gauge
Labels: test

The score from the most recent completed test run, on a scale of 0 to 100. A score of 100 means all checks passed. A score of 0 means the test failed entirely or could not run.

kymaros_score{test="webapp-nightly"} 87

kymaros_rto_seconds

Type: Gauge
Labels: test

The measured Recovery Time Objective in seconds for the most recent test run. This is the elapsed time from restore initiation to the moment all health checks passed.

kymaros_rto_seconds{test="webapp-nightly"} 342

kymaros_test_duration_seconds

Type: Histogram
Labels: test
Buckets: Exponential, starting at 60 seconds, doubling to 7680 seconds (60, 120, 240, 480, 960, 1920, 3840, 7680)

The total duration of each test run from start to finish, including backup restore time, pod scheduling, and all health check execution. Stored as a histogram to allow percentile queries across test runs.

kymaros_test_duration_seconds_bucket{test="webapp-nightly", le="240"} 38
kymaros_test_duration_seconds_bucket{test="webapp-nightly", le="480"} 44
kymaros_test_duration_seconds_sum{test="webapp-nightly"} 9840
kymaros_test_duration_seconds_count{test="webapp-nightly"} 44

kymaros_backup_age_seconds

Type: Gauge
Labels: test

The age of the backup that was most recently tested, measured in seconds from the backup creation timestamp to the test run start. This metric is useful for detecting stale backups that have not been updated.

kymaros_backup_age_seconds{test="webapp-nightly"} 82800

A value of 82800 seconds indicates the backup was approximately 23 hours old when tested — consistent with a nightly backup schedule.

PromQL queries

Pass rate over the last 30 days

sum(increase(kymaros_tests_total{result="passed"}[30d]))
/
sum(increase(kymaros_tests_total[30d]))

Current score for all tests

kymaros_score

Tests with score below threshold (e.g., below 80)

kymaros_score < 80

RTO trend over the last 7 days (per test)

kymaros_rto_seconds

Plot this over a time range to visualize trends. A rising trend may indicate data volume growth or infrastructure degradation.

95th percentile test duration over the last 30 days

histogram_quantile(0.95, sum(rate(kymaros_test_duration_seconds_bucket[30d])) by (le, test))

Tests that have not run in 25 hours

changes(kymaros_tests_total[25h]) == 0

Returns series for any test where the counter has not incremented in 25 hours. This is the same expression used in the RestoreTestNotRun alert.

Backup age at last test exceeding 26 hours

kymaros_backup_age_seconds > 93600

This metric records the age of the backup that was used in the most recent test run, not how long ago the test ran. A value above 93600 seconds (26 hours) indicates the backup itself was stale when tested — the backup pipeline may have missed a run. A 26-hour threshold gives a 1-hour buffer above a 25-hour max-drift daily schedule.

Failure rate by test

rate(kymaros_tests_total{result="failed"}[7d])
/
rate(kymaros_tests_total[7d])

AlertManager rules

The following three AlertManager rules cover the most important failure conditions for production use. Add them to your AlertManager configuration or to a PrometheusRule resource.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kymaros-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: kymaros.rules
      interval: 1m
      rules:
        # Alert when a restore test fails.
        - alert: RestoreTestFailed
          expr: |
            increase(kymaros_tests_total{result="failed"}[1h]) > 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: "Restore test failed: {{ $labels.test }}"
            description: >
              RestoreTest {{ $labels.test }} failed in the last hour.
              Check the RestoreReport resource and operator logs for the
              failing health check details.
            runbook_url: "https://docs.kymaros.io/docs/guides/health-checks"

        # Alert when the measured RTO exceeds the SLA threshold.
        # Adjust 900 (15 minutes) to match your RTO SLA.
        - alert: RTOExceedsSLA
          expr: |
            kymaros_rto_seconds > 900
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: "RTO exceeds SLA for {{ $labels.test }}"
            description: >
              RestoreTest {{ $labels.test }} measured an RTO of
              {{ $value | humanizeDuration }}, which exceeds the configured
              SLA of 15 minutes. Review infrastructure performance and
              backup size growth.
            runbook_url: "https://docs.kymaros.io/docs/guides/stateful-apps"

        # Alert when a test has not run in over 25 hours.
        # Catches schedule failures, operator crashes, or backup tool outages.
        - alert: RestoreTestNotRun
          expr: |
            changes(kymaros_tests_total[25h]) == 0
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Restore test not run recently: {{ $labels.test }}"
            description: >
              No restore test has completed for {{ $labels.test }} in the
              last 25 hours. Verify the operator is running and the backup
              schedule is producing new backups.
            runbook_url: "https://docs.kymaros.io/docs/operations/troubleshooting"

Alert severity guide

Alert	Severity	Reason
`RestoreTestFailed`	`critical`	A failed test means your current backup cannot be verified as restorable. This is a data recovery risk.
`RTOExceedsSLA`	`warning`	Recovery is still possible but slower than committed. Investigate before the next test.
`RestoreTestNotRun`	`warning`	The test machinery itself may be broken. Not an immediate risk but masks potential failures.

ServiceMonitor

If you use the Prometheus Operator, create a ServiceMonitor to scrape the Kymaros operator metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kymaros-operator
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
spec:
  namespaceSelector:
    matchNames:
      - kymaros-system
  selector:
    matchLabels:
      app: kymaros-operator
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
      scheme: http

This assumes the Kymaros operator Service has a named port metrics pointing to port 8080. Verify the port name with:

kubectl get svc -n kymaros-system -l app=kymaros-operator -o jsonpath='{.items[0].spec.ports}'

Grafana dashboard

Kymaros does not ship a pre-built Grafana dashboard at this time. The following panels cover the most useful views:

Recommended panels

Panel	Query	Visualization
Score over time (per test)	`kymaros_score`	Time series
Pass/fail ratio	`sum by (result)(increase(kymaros_tests_total[24h]))`	Pie chart
RTO trend	`kymaros_rto_seconds`	Time series
Backup age	`kymaros_backup_age_seconds / 3600`	Gauge (hours)
P95 test duration	`histogram_quantile(0.95, sum(rate(kymaros_test_duration_seconds_bucket[7d])) by (le, test))`	Time series
Tests with score below 80	`kymaros_score < 80`	Table

Importing a dashboard

If your team has created a Kymaros dashboard JSON, import it via:

Open Grafana and navigate to Dashboards > Import.
Paste the JSON or upload the file.
Select your Prometheus datasource when prompted.
Click Import.

The dashboard will appear under the folder you selected during import. Grafana Dashboard ID lookup is not available for Kymaros at this time — use a local JSON file.

Variables

For dashboards that cover multiple tests, add a Grafana variable:

Variable name: test
Type: Query
Query: label_values(kymaros_score, test)
Refresh: On dashboard load

Then replace the literal test name in queries with $test:

kymaros_score{test="$test"}

Metrics reference​

kymaros_tests_total​

kymaros_score​

kymaros_rto_seconds​

kymaros_test_duration_seconds​

kymaros_backup_age_seconds​

PromQL queries​

Pass rate over the last 30 days​

Current score for all tests​

Tests with score below threshold (e.g., below 80)​

RTO trend over the last 7 days (per test)​

95th percentile test duration over the last 30 days​

Tests that have not run in 25 hours​

Backup age at last test exceeding 26 hours​

Failure rate by test​

AlertManager rules​

Alert severity guide​

ServiceMonitor​

Grafana dashboard​

Recommended panels​

Importing a dashboard​

Variables​

Metrics reference

kymaros_tests_total

kymaros_score

kymaros_rto_seconds

kymaros_test_duration_seconds

kymaros_backup_age_seconds

PromQL queries

Pass rate over the last 30 days

Current score for all tests

Tests with score below threshold (e.g., below 80)

RTO trend over the last 7 days (per test)

95th percentile test duration over the last 30 days

Tests that have not run in 25 hours

Backup age at last test exceeding 26 hours

Failure rate by test

AlertManager rules

Alert severity guide

ServiceMonitor

Grafana dashboard

Recommended panels

Importing a dashboard

Variables