Skip to main content

Validation Levels

Kymaros applies six validation levels to each restore. The levels are ordered by dependency: later levels only produce meaningful results if earlier ones pass. Each level contributes a fixed maximum number of points to the overall score.

LevelNameMax PointsStatus
1Restore Integrity25Implemented
2Completeness20Implemented
3Pod Startup20Implemented
4Health Checks20Implemented
5Cross-NS Dependencies10Planned (scores 0)
6RTO Compliance5Implemented

Level 1 — Restore Integrity

Maximum points: 25

What it checks

Whether the backup provider reported the restore as successful. This is a binary gate: if the restore did not succeed, the controller still runs the remaining levels but they will generally score 0 because no resources landed in the sandbox.

How it is scored

if RestoreSucceeded {
score += 25
} else {
score += 0
}

The RestoreSucceeded flag is set by the backup adapter based on the terminal phase of the restore object. For Velero, a Restore in phase Completed sets this to true. A Restore in phase PartiallyFailed is treated as a partial success (see Backup Adapters for details) and the adapter determines whether to set RestoreSucceeded accordingly.

Pass example

A Velero Restore transitions to Completed. The adapter reports RestoreSucceeded = true. Level 1 contributes 25 points.

Fail example

A Velero Restore transitions to Failed (network error, missing S3 credentials, corrupted backup). The adapter reports RestoreSucceeded = false. Level 1 contributes 0 points.

How to improve

  • Check your backup provider's logs for restore failures.
  • Validate that the backup exists and is in Completed or PartiallyFailed phase before Kymaros selects it.
  • Ensure the sandbox has permissions to pull images and access any secrets required by the workload.

Level 2 — Completeness

Maximum points: 20

What it checks

Whether the expected Kubernetes resources are present in the sandbox after the restore. The controller counts five resource types:

  • Deployments
  • Services
  • Secrets
  • ConfigMaps
  • PersistentVolumeClaims (PVCs)

It computes the ratio of found resources to expected resources: CompletenessRatio = found / expected.

How it is scored

score += int(CompletenessRatio * 20)

int() truncates (floors) the result. A ratio of 1.0 yields 20 points; 0.5 yields 10 points; 0.0 yields 0 points.

Pass example

Backup contained 3 Deployments, 4 Services, 2 Secrets, 6 ConfigMaps, and 1 PVC (16 resources total). All 16 are found in the sandbox. CompletenessRatio = 16/16 = 1.0. Level 2 contributes 20 points.

Fail example

Same backup, but 4 Secrets and 2 ConfigMaps are missing after restore (10 found, 16 expected). CompletenessRatio = 10/16 ≈ 0.625. Level 2 contributes int(0.625 * 20) = int(12.5) = 12 points.

How to improve

  • Inspect the RestoreReport for which resource types are underrepresented.
  • Check whether your backup excludes certain namespaces or resource types by label selector.
  • For Velero, ensure includeClusterResources is set if cluster-scoped resources are required.

Level 3 — Pod Startup

Maximum points: 20

What it checks

Whether the pods associated with restored Deployments reach the Ready state within the wait window (2 minutes by default). The controller waits for pod readiness before scoring this level.

The ratio is: PodsReadyRatio = readyPods / totalPods.

How it is scored

score += int(PodsReadyRatio * 20)

Pass example

A restore brings up 8 pods. All 8 reach Ready within 2 minutes. PodsReadyRatio = 8/8 = 1.0. Level 3 contributes 20 points.

Fail example

8 pods are expected. 5 reach Ready, 3 are stuck in CrashLoopBackOff due to a missing database connection string. PodsReadyRatio = 5/8 = 0.625. Level 3 contributes int(0.625 * 20) = 12 points.

How to improve

  • Look at pod events and logs inside the sandbox namespace for crash reasons.
  • Common causes: missing Secrets not included in the backup, image pull failures, init container failures, incorrect environment variable references.
  • Increase the readiness wait window in the RestoreTest spec if your application has a long startup time.

Level 4 — Health Checks

Maximum points: 20

What it checks

User-defined probes declared in a HealthCheckPolicy resource and referenced by the RestoreTest. Health checks can include HTTP probes, TCP probes, or custom exec commands run against the restored pods.

The ratio is: HealthChecksPassRatio = passingChecks / totalChecks.

How it is scored

score += int(HealthChecksPassRatio * 20)

Pass example

A HealthCheckPolicy defines 5 HTTP probes (one per microservice). All 5 return 2xx responses. HealthChecksPassRatio = 5/5 = 1.0. Level 4 contributes 20 points.

Fail example

5 HTTP probes defined. 3 pass, 2 fail because the payment service depends on an external gateway not reachable from the sandbox. HealthChecksPassRatio = 3/5 = 0.6. Level 4 contributes int(0.6 * 20) = 12 points.

How to improve

  • Review failing probe details in the RestoreReport.
  • For probes that target external dependencies, consider adding mock services to the sandbox via the RestoreTest spec or adjusting probes to exclude known-unreachable endpoints.
  • Attach a HealthCheckPolicy that covers only the internal API surface of the restored application.

Level 5 — Cross-Namespace Dependencies

Maximum points: 10 (currently scores 0)

What it checks

Whether services that the restored application depends on in other namespaces are reachable and functional. This covers cross-namespace database connections, shared service meshes, and external operator dependencies.

The planned ratio is: DepsCoverageRatio = coveredDeps / totalDeps.

Current status

This level is in the roadmap but not yet implemented. The controller always sets this contribution to 0:

score += int(DepsCoverageRatio * 10)  // DepsCoverageRatio hardcoded to 0

No configuration is required and no action is needed. Once implemented, this level will participate automatically based on RestoreTest spec fields that declare inter-namespace dependency probes.

Why it matters

A restore can be complete and all pods healthy, yet the application fails in production because a shared cache or message broker in a different namespace was not validated. Level 5 is intended to close this gap.


Level 6 — RTO Compliance

Maximum points: 5

What it checks

Whether the total restore duration fell within the Recovery Time Objective declared in the RestoreTest spec. The controller measures elapsed time from restore trigger to restore completion and compares it to the configured SLA threshold.

How it is scored

if RTOWithinSLA {
score += 5
} else {
score += 0
}

Pass example

spec.rtoSLA is set to 10m. The Velero restore completes in 7 minutes. RTOWithinSLA = true. Level 6 contributes 5 points.

Fail example

spec.rtoSLA is set to 10m. A large backup takes 14 minutes to restore due to slow object storage throughput. RTOWithinSLA = false. Level 6 contributes 0 points.

How to improve

  • Review restore duration trends in RestoreReport history to distinguish one-off slowness from a systematic drift.
  • Optimize your backup storage backend (object store proximity, network bandwidth, snapshot mechanism).
  • Consider breaking large namespaces into smaller, independently restorable units.
  • Adjust spec.rtoSLA to reflect a realistic target if the initial value was set too aggressively.