Skip to main content

RestoreTest

API group: restore.kymaros.io/v1alpha1
Kind: RestoreTest
Short name: rt
Scope: Namespaced (typically kymaros-system)

A RestoreTest defines a scheduled restore validation job. The controller reads it, restores the specified backup into an isolated sandbox namespace at each scheduled interval, runs health checks, measures RTO, and produces a RestoreReport.


Spec

Top-level fields

FieldTypeRequiredDefaultDescription
backupSourceBackupSourceYesIdentifies the backup to restore and from which provider.
scheduleScheduleConfigYesControls when restore tests run.
sandboxSandboxConfigYesConfigures the isolated sandbox namespace used during testing.
healthChecksHealthCheckRefNoReferences a HealthCheckPolicy to run after restore.
slaSLAConfigNoDefines the RTO target and alert behavior.
notificationsNotificationConfigNoConfigures where pass/fail notifications are sent.
timeoutDurationNoGlobal timeout for the entire test run.
historyLimitint32No10Number of RestoreReport objects to retain. Minimum: 1.

BackupSource

FieldTypeRequiredDefaultDescription
providerstringYesBackup provider. Accepted values: velero, kasten, trilio.
backupNamestringYesName of the backup to restore. Use "latest" to always select the most recent backup.
namespaces[]NamespaceMappingYesSource namespaces to restore. At least one entry required.
labelSelectormap[string]stringNoKubernetes label selector applied when identifying the backup.

NamespaceMapping

FieldTypeRequiredDefaultDescription
namestringYesSource namespace name in the backup.
sandboxNamestringNoOverride the generated sandbox namespace name for this source namespace. When omitted, the name is derived from sandbox.namespacePrefix and the source namespace name.

ScheduleConfig

FieldTypeRequiredDefaultDescription
cronstringYesStandard five-field cron expression (e.g., "0 3 * * *" for 03:00 daily).
timezonestringNo"UTC"IANA timezone name for cron evaluation (e.g., "Europe/Paris").

SandboxConfig

FieldTypeRequiredDefaultDescription
namespacePrefixstringNo"rp-test"Prefix for generated sandbox namespace names.
ttlDurationNo"30m"How long the sandbox namespace lives after the test completes before automatic deletion.
resourceQuotaResourceQuotaConfigNoResource limits applied to the sandbox namespace.
networkIsolationstringNo"strict"Network policy mode. strict: sandbox has no external egress. group: sandbox pods can reach other sandboxes in the same group but not production.

ResourceQuotaConfig

FieldTypeRequiredDefaultDescription
cpustringNoCPU limit for the sandbox namespace (e.g., "4").
memorystringNoMemory limit for the sandbox namespace (e.g., "8Gi").
storagestringNoTotal storage request limit for the sandbox namespace (e.g., "50Gi").

HealthCheckRef

FieldTypeRequiredDefaultDescription
policyRefstringNoName of a HealthCheckPolicy resource in the same namespace.
timeoutDurationNoMaximum time to wait for all health checks defined in the referenced policy to pass.

SLAConfig

FieldTypeRequiredDefaultDescription
maxRTODurationYes (if sla set)Maximum acceptable restore time. The measured duration is compared against this value and recorded in the report as rto.withinSLA.
alertOnExceedboolNoWhen true, a notification is sent through the configured channels when measured RTO exceeds maxRTO.

NotificationConfig

FieldTypeRequiredDefaultDescription
onFailure[]NotificationChannelNoChannels to notify when the test result is fail.
onSuccess[]NotificationChannelNoChannels to notify when the test result is pass.

NotificationChannel

FieldTypeRequiredDefaultDescription
typestringYesNotification backend. Accepted values: slack, pagerduty, webhook.
channelstringNoSlack channel name or ID (e.g., "#alerts"). Only applicable when type is slack.
webhookSecretRefstringNoName of a Secret in the same namespace that contains the webhook URL or API token. The secret must have a key named url for webhook type or token for pagerduty.

Status

FieldTypeDescription
phasestringCurrent lifecycle phase: Idle, Running, Completed, or Failed.
lastRunAtTimeTimestamp of the most recent completed test run.
lastScoreintConfidence score (0–100) from the most recent run.
lastResultstringOutcome of the most recent run: pass, fail, or partial.
lastReportRefstringName of the RestoreReport object created by the most recent run.
nextRunAtTimeScheduled time for the next test run, derived from schedule.cron.
sandboxNamespacestringName of the active sandbox namespace (populated during Running phase).
restoreIDstringProvider-specific restore operation identifier (populated during Running phase).
conditions[]ConditionStandard Kubernetes condition array reflecting reconciliation state.

Print columns exposed by kubectl get rt: Phase, Score, Result, Last Run, Age.


Examples

Simple stateless application

Runs nightly at 02:00 UTC, restores the latest Velero backup of the my-app namespace, applies a 30-minute RTO target, and notifies Slack on failure.

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: my-app-nightly
namespace: kymaros-system
spec:
backupSource:
provider: velero
backupName: latest
namespaces:
- name: my-app
schedule:
cron: "0 2 * * *"
timezone: UTC
sandbox:
namespacePrefix: rp-test
ttl: 30m
networkIsolation: strict
sla:
maxRTO: "30m"
alertOnExceed: true
notifications:
onFailure:
- type: slack
channel: "#ops-alerts"
webhookSecretRef: slack-kymaros-webhook
historyLimit: 10

Stateful application with database health checks

Restores the orders namespace using Velero, runs a custom HealthCheckPolicy that verifies the database pod and HTTP readiness endpoint, enforces a 15-minute RTO, and sends both success and failure notifications.

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: orders-db-validation
namespace: kymaros-system
spec:
backupSource:
provider: velero
backupName: orders-daily
namespaces:
- name: orders
schedule:
cron: "0 3 * * *"
timezone: "Europe/Paris"
sandbox:
namespacePrefix: rp-test
ttl: 45m
resourceQuota:
cpu: "4"
memory: 8Gi
storage: 50Gi
networkIsolation: strict
healthChecks:
policyRef: orders-health-policy
timeout: 10m
sla:
maxRTO: "15m"
alertOnExceed: true
notifications:
onFailure:
- type: pagerduty
webhookSecretRef: pagerduty-kymaros-token
- type: slack
channel: "#platform-alerts"
webhookSecretRef: slack-kymaros-webhook
onSuccess:
- type: slack
channel: "#platform-ops"
webhookSecretRef: slack-kymaros-webhook
timeout: 1h
historyLimit: 30

Multi-namespace application

Restores three namespaces that together form a single application (frontend, backend API, and shared infrastructure). Each namespace maps to a distinct sandbox name to avoid collisions.

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: platform-full-stack
namespace: kymaros-system
spec:
backupSource:
provider: velero
backupName: latest
namespaces:
- name: platform-frontend
sandboxName: rp-frontend
- name: platform-api
sandboxName: rp-api
- name: platform-infra
sandboxName: rp-infra
labelSelector:
app.kubernetes.io/part-of: platform
schedule:
cron: "0 4 * * 0"
timezone: "America/New_York"
sandbox:
namespacePrefix: rp-test
ttl: 1h
resourceQuota:
cpu: "8"
memory: 16Gi
storage: 100Gi
networkIsolation: group
healthChecks:
policyRef: platform-health-policy
timeout: 15m
sla:
maxRTO: "45m"
alertOnExceed: true
notifications:
onFailure:
- type: webhook
webhookSecretRef: incident-webhook
timeout: 2h
historyLimit: 20

kubectl quick reference

# List all RestoreTest resources
kubectl get rt -n kymaros-system

# Describe a specific test
kubectl describe rt my-app-nightly -n kymaros-system

# Watch status in real time
kubectl get rt -n kymaros-system -w

# Edit a test
kubectl edit rt my-app-nightly -n kymaros-system