Skip to main content

Introduction

Kymaros is an open-source Kubernetes Operator that continuously validates backup restores. It restores your backups into isolated sandbox namespaces, runs configurable health checks, measures restore duration against your SLA, and produces a scored report — so you know your backups actually work before a disaster forces you to find out.

License: Apache 2.0
API group: restore.kymaros.io/v1alpha1
Project: github.com/kymorahq/kymora


The Problem

Backup tools report job status, not restorability. A backup marked Completed tells you the data was written somewhere. It does not tell you whether that data restores into a working application, whether pods come up healthy, whether dependencies resolve, or how long the restore actually takes.

Teams discover this gap at the worst possible moment: during an incident, under pressure, with an untested restore procedure and no confident RTO estimate.

Kymaros closes that gap by treating restore validation as a continuous, automated process — not a manual exercise performed once at audit time.


How It Works

Kymaros operates in four stages:

1. Define

You create a RestoreTest resource that declares which backup to test, how often, and what conditions a successful restore must satisfy:

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: my-app-nightly
namespace: kymaros-system
spec:
backupSource:
provider: velero
backupName: latest
namespaces:
- name: my-app
schedule:
cron: "0 3 * * *"
sandbox:
namespacePrefix: rp-test
ttl: 30m
networkIsolation: strict
sla:
maxRTO: "15m"

2. Sandbox

At each scheduled run, the controller creates an isolated namespace (prefixed by sandbox.namespacePrefix). The backup is restored into that namespace using Velero. Network isolation prevents the sandbox from interfering with production workloads. The sandbox is automatically deleted after the TTL expires.

3. Validate

Once the restore completes, Kymaros runs a validation sequence:

  • Restore integrity — confirms the restore operation itself succeeded
  • Completeness — checks that the expected resource counts (Deployments, Services, PVCs, etc.) match the source namespace
  • Pod startup — waits for all pods to reach Running and Ready state
  • Health checks — executes HTTP probes, exec commands, or custom checks defined in a HealthCheckPolicy
  • Cross-namespace dependencies — verifies that services expected outside the sandbox are reachable
  • RTO compliance — records total restore duration and compares it against sla.maxRTO

4. Report

The controller writes a RestoreReport resource with a confidence score (0–100) and the full breakdown of each validation step. A score of 90 or above means the restore passed. Below 70 is a failure. Scores between 70 and 89 indicate a partial pass with degraded confidence.

Reports can be read with kubectl, viewed in the built-in dashboard, or scraped via the Prometheus metrics endpoint.


ResourcePath
InstallationGetting Started: Installation
First test in 5 minutesGetting Started: Quick Start
Reading reportsGetting Started: Your First Report
Dashboard accessGetting Started: Dashboard Access
CRD referenceRestoreTest
GitHubkymorahq/kymora