How It Works
Kymaros runs as a Kubernetes Operator in the kymaros-system namespace. A single custom resource, RestoreTest, drives the entire lifecycle. The controller watches that resource and executes a scheduled validation loop on each reconcile.
The Four Stages
Each validation cycle progresses through four sequential stages.
Stage 1 — Define
You declare intent in a RestoreTest resource. The spec carries:
- Which backup provider to target (
velero,kasten, ortrilio) - A cron schedule for when validations run
- Which namespaces to include
- Network isolation mode and resource quotas for the sandbox
- Health check references and RTO targets
The controller parses the cron expression, computes the next scheduled run time, and requeues itself with the appropriate delay. The first run is triggered immediately when a new RestoreTest is created.
Stage 2 — Sandbox
Before any restore data touches the cluster, the controller provisions isolated namespaces through the Sandbox Manager. Each namespace is named with a deterministic prefix, the test name, and a six-character random suffix:
<prefix>-<testName>-<6char-random>
A set of labels is applied to every managed namespace so resources can be discovered and cleaned up safely:
| Label | Purpose |
|---|---|
kymaros.io/managed-by | Marks the namespace as owned by Kymaros |
kymaros.io/test | Links the namespace to its RestoreTest |
kymaros.io/test-namespace | Records the source namespace being restored |
kymaros.io/group | Enables group-scoped network policies |
The sandbox lifecycle ends with namespace deletion, which cascades to all contained resources. See Sandbox Isolation for the full security model.
Stage 3 — Validate
With sandboxes in place, the controller triggers restores through the backup adapter for the configured provider. The Velero adapter creates a Restore CR with namespace mappings so backup data lands inside the sandbox rather than overwriting production namespaces.
Once restore objects are created, the controller polls their status every 30 seconds. When the restore reaches a terminal state, validation begins:
- Wait for pod readiness — up to 2 minutes by default.
- Completeness check — counts Deployments, Services, Secrets, ConfigMaps, and PVCs found in the sandbox versus what was expected.
- Health check runner — executes any
HealthCheckPolicyprobes attached to the test. - Scoring — computes the six-level score and determines pass/partial/fail.
Stage 4 — Report
The controller creates a RestoreReport resource capturing the full result: score breakdown, phase durations, resource counts, and health check outcomes. It also checks for regressions by comparing the new score against the previous report. Sandbox namespaces are deleted before the status is written back to the RestoreTest, ensuring cleanup happens even if the status write fails.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ RestoreTest Controller │
│ │
│ ┌─────────┐ ┌──────────────┐ ┌───────────────────┐ │
│ │ Parse │───>│ Phase State │───>│ scoreAndReport() │ │
│ │ Cron │ │ Machine │ │ │ │
│ └─────────┘ └──────┬───────┘ └────────┬──────────┘ │
│ │ │ │
└────────────────────────┼─────────────────────┼─────────────┘
│ │
┌──────────────┼──────┐ │
│ │ │ │
▼ ▼ ▼ ▼
┌───────────┐ ┌────────────┐ ┌───────────────────┐
│ Backup │ │ Sandbox │ │ Health Check │
│ Adapter │ │ Manager │ │ Runner │
│ (Velero) │ │ │ │ │
└───────────┘ └────────────┘ └────────┬──────────┘
│
▼
┌─────────────────┐
│ RestoreReport │
│ + Score │
└─────────────────┘
Phase State Machine
The controller reconciles a RestoreTest through three phases. The status.phase field reflects the current state.
┌─────────────────────────────────────┐
│ │
▼ │
┌────────┐ schedule fires ┌─────────┐ │
│ Idle │──────────────────>│ Running │ │
└────────┘ └────┬─────┘ │
▲ │ │
│ ┌────────┴──────┐ │
│ │ │ │
│ ┌──────▼──────┐ ┌─────▼──────┐
│ │ Completed │ │ Failed │
└─────────────┴─────────────┴──┴────────────┘
(next schedule computed, requeue)
PhaseIdle
The controller is waiting for the next scheduled run. It parses the cron expression from spec.schedule, computes the next trigger time, and requeues itself with a delay equal to the remaining wait. The first reconcile after resource creation skips the wait and transitions immediately to Running.
PhaseRunning
Active validation. The controller:
- Creates sandbox namespaces via the Sandbox Manager.
- Calls the backup adapter to list available backups and trigger restores.
- Polls restore status every 30 seconds via
ctrl.Result{RequeueAfter: 30s}. - Once restores reach a terminal phase, calls
scoreAndReport.
PhaseCompleted / PhaseFailed
Terminal for the current cycle. scoreAndReport cleans up sandboxes, creates the RestoreReport, detects regressions, and writes the result to status. The controller then computes the next schedule and requeues into PhaseIdle.
A finalizer (kymaros.io/sandbox-cleanup) is registered on the RestoreTest resource to ensure sandbox namespaces are deleted even if the resource itself is deleted mid-run.