Skip to main content

Sandbox Isolation

Every restore validation runs in a purpose-built, isolated Kubernetes namespace called a sandbox. Sandboxes are created before any backup data is restored and deleted after scoring completes. Production namespaces are never touched.

Namespace Naming

Sandbox namespaces follow a deterministic naming scheme:

<prefix>-<testName>-<6char-random>

For example, a RestoreTest named payments-nightly with the default prefix produces namespaces like:

kymaros-payments-nightly-a3f8k2

The six-character random suffix prevents collisions when multiple test cycles overlap or when the same test is re-run rapidly.

Labels

Every sandbox namespace carries a standard label set applied by the Sandbox Manager:

LabelValuePurpose
kymaros.io/managed-bykymarosIdentifies Kymaros-owned namespaces for cluster-wide queries
kymaros.io/test<restoretest-name>Links the namespace to its controlling RestoreTest
kymaros.io/test-namespace<source-namespace>Records the production namespace being restored
kymaros.io/group<group-name>Enables group-scoped network policies (see Group Mode)

These labels are used for lifecycle management: during cleanup, the controller lists all namespaces with kymaros.io/test=<name> and deletes them all. No resource is left orphaned.

Security Controls

NetworkPolicy

Every sandbox namespace receives a NetworkPolicy that controls ingress and egress based on the spec.networkPolicy field in the RestoreTest.

strict mode (default)

A deny-all policy is applied: both ingress and egress arrays are empty. No traffic can enter or leave the sandbox namespace.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: kymaros-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress: []
egress: []

This ensures a restored workload cannot accidentally affect production databases, message brokers, or external APIs during validation.

group mode

Traffic between namespaces in the same group is permitted. The policy uses namespaceSelector matching kymaros.io/group: <group-name>. This is used for multi-namespace restores where services in one sandbox need to communicate with services in another sandbox within the same test cycle.

ResourceQuota

The Sandbox Manager applies a ResourceQuota to each namespace from the spec.resources section of the RestoreTest. The field mapping is:

spec.resources fieldQuota resource(s) applied
cpurequests.cpu and limits.cpu
memoryrequests.memory and limits.memory
storagerequests.storage

This prevents a runaway restore from exhausting cluster compute or storage capacity.

LimitRange

A LimitRange is always applied to every sandbox namespace, regardless of what spec.resources specifies. It sets conservative defaults so pods without explicit requests or limits are still bounded:

ResourceDefault LimitDefault Request
CPU500m100m
Memory512Mi128Mi

The LimitRange applies at the container level. Pods that do not specify their own requests and limits will inherit these values automatically.

Lifecycle

RestoreTest reconciles


┌───────────────────┐
│ Create sandbox │ Namespace + labels + NetworkPolicy
│ namespace(s) │ + ResourceQuota + LimitRange
└────────┬──────────┘


┌───────────────────┐
│ Trigger restore │ Backup adapter maps backup data
│ into sandbox │ into sandbox namespace
└────────┬──────────┘


┌───────────────────┐
│ Run validation │ Completeness, pod readiness,
│ stages │ health checks, scoring
└────────┬──────────┘


┌───────────────────┐
│ Delete sandbox │ Namespace deletion cascades to
│ namespace(s) │ all contained resources
└────────┬──────────┘


┌───────────────────┐
│ Write │ RestoreReport created,
│ RestoreReport │ RestoreTest status updated
└───────────────────┘

Sandbox cleanup happens before the RestoreReport is written. This ordering is intentional: it ensures that even if the status write fails (for example, due to a controller restart), the sandbox has already been removed. The result is that cleanup is never conditional on a successful status update.

TTL Failsafe

If the controller crashes between sandbox creation and cleanup, the finalizer kymaros.io/sandbox-cleanup on the RestoreTest resource ensures cleanup is attempted when the controller restarts. Any sandbox namespace bearing the kymaros.io/managed-by label can also be discovered and cleaned up manually using:

kubectl delete namespace -l kymaros.io/managed-by=kymaros

Group Mode for Multi-Namespace Restores

When an application spans multiple Kubernetes namespaces (for example, a frontend namespace and a backend namespace), the RestoreTest can restore both simultaneously. Each namespace gets its own sandbox, and the kymaros.io/group label ties them together.

In group mode, the NetworkPolicy allows inter-sandbox traffic within the same group while still blocking all traffic to and from outside the group. This lets the validation runner test cross-service calls (frontend calling backend API) without exposing the sandbox to the rest of the cluster.

Zero Impact on Production

The isolation model makes it structurally impossible for a sandbox to affect production:

  • Restore data is mapped into the sandbox namespace via namespace remapping in the backup adapter, not copied into the source namespace.
  • NetworkPolicy deny-all prevents outbound connections to production databases or APIs.
  • ResourceQuota and LimitRange cap the compute and storage footprint.
  • Sandbox namespaces are deleted after every cycle, leaving no persistent state.

The only cluster-level side effect of running Kymaros is the transient compute and storage consumption of the sandbox during the validation window.