Skip to main content

Troubleshooting

General Debug Commands

Before diving into specific issues, collect baseline diagnostic information:

# Controller logs (last 100 lines)
kubectl logs -n kymaros-system deploy/kymaros-controller --tail=100

# API server logs
kubectl logs -n kymaros-system deploy/kymaros-api --tail=100

# All events in kymaros-system
kubectl get events -n kymaros-system --sort-by='.lastTimestamp'

# Status of all Kymaros resources
kubectl get restoretest,restorereport,healthcheckpolicy -n kymaros-system

1. RestoreTest Stays in Idle

Symptom: A RestoreTest resource is created but no run is triggered. The status phase remains Idle indefinitely.

Causes and fixes:

The schedule has not fired yet. Kymaros uses a standard cron expression. If you set schedule.cron: "0 3 * * *", the first run occurs at 03:00 the following day. To trigger a run immediately without waiting:

kubectl annotate restoretest <name> -n kymaros-system \
kymaros.io/trigger-now=true --overwrite

The controller is not running. Verify the controller pod is Running:

kubectl get pods -n kymaros-system -l app.kubernetes.io/component=controller
kubectl describe pod -n kymaros-system -l app.kubernetes.io/component=controller

Leader election is not won. If --leader-elect is enabled and multiple controller replicas are running, check that one replica holds the lease:

kubectl get lease -n kymaros-system

2. Restore Fails with "Permission Denied"

Symptom: A RestoreReport shows a failed restore step. Controller logs contain forbidden or permission denied errors referencing a Kubernetes API resource.

Cause: The Kymaros ClusterRole is missing a required permission.

Diagnosis:

kubectl describe clusterrole kymaros-controller-role
kubectl auth can-i create pods/exec --as=system:serviceaccount:kymaros-system:kymaros-controller -n <sandbox-namespace>

Fix: Verify that the ClusterRole grants all required permissions. The complete permission set is:

API GroupResourcesVerbs
(core)namespacescreate, delete, get, list, watch
(core)podsget, list, watch
(core)pods/execcreate
(core)configmaps, secrets, services, persistentvolumeclaimsget, list, watch
(core)limitranges, resourcequotascreate, delete
appsdeployments, statefulsetsget, list, update, watch
networking.k8s.ionetworkpoliciescreate, delete
restore.kymaros.io** (full CRUD + status + finalizers)
velero.iobackups, schedulesget, list, watch
velero.iorestorescreate, delete, get, list, watch

If you deployed via Helm, re-apply the chart to restore the ClusterRole:

helm upgrade kymaros kymaros/kymaros -n kymaros-system --reuse-values

3. Sandbox Namespace Not Cleaned Up

Symptom: After a test completes (or fails), the sandbox namespace (prefix rp-test- or the configured namespacePrefix) remains. Repeated runs accumulate stale namespaces.

Cause: Either the controller crashed before cleanup, or the finalizer kymaros.io/sandbox-cleanup is preventing deletion.

Diagnosis:

# List sandbox namespaces
kubectl get namespace | grep -E 'rp-test|<your-prefix>'

# Check finalizers on the namespace
kubectl get namespace <sandbox-ns> -o jsonpath='{.metadata.finalizers}'

Fix: The TTL failsafe runs independently of the main reconcile loop. If a sandbox persists beyond its TTL (sandbox.ttl), the failsafe will delete it. If the namespace is stuck with a finalizer:

# Remove the finalizer manually (use only after confirming the sandbox is truly orphaned)
kubectl patch namespace <sandbox-ns> \
-p '{"metadata":{"finalizers":[]}}' \
--type=merge

After removing the finalizer, the namespace deletion will proceed. Investigate controller logs to understand why the cleanup did not fire normally.


4. Score Is 0

Symptom: A RestoreReport shows score: 0.

Cause: A score of 0 indicates the restore itself failed at Level 1 (restore integrity). The Velero restore operation did not complete successfully. No further validation steps were executed.

Diagnosis:

# Get the RestoreReport
kubectl describe restorereport <name> -n kymaros-system

# Check the Velero restore object created by Kymaros
kubectl get restore -n velero | grep kymaros

# Inspect the Velero restore
kubectl describe restore <velero-restore-name> -n velero

Fix: The issue is in the Velero restore, not in Kymaros. Common causes include missing backup storage location credentials, a corrupted backup, or a namespace conflict. Resolve the underlying Velero restore failure, then re-trigger the RestoreTest.


5. Health Check Timeout

Symptom: A RestoreReport shows health checks timing out. Pods are running in the sandbox but HTTP probes or exec probes return no response.

Cause (expected behavior in strict mode): When sandbox.networkIsolation: strict, a NetworkPolicy with a default-deny-all rule is applied to the sandbox namespace. HTTP health checks that probe endpoints outside the sandbox will time out by design. Only intra-sandbox traffic is permitted.

Fix (strict mode): Adjust your HealthCheckPolicy to probe only intra-sandbox endpoints. If you need egress to external dependencies (for example, a shared database), either switch to networkIsolation: permissive for that test or add a specific egress rule in your HealthCheckPolicy.

Cause (misconfiguration): The health check timeout is too short for the sandbox pod startup time. The sandbox environment is cold — images must be pulled, volumes attached, and the application must initialize from scratch.

Fix: Increase the timeout in your HealthCheckPolicy:

spec:
checks:
- type: http
endpoint: "http://my-service:8080/health"
timeout: 120s
retryInterval: 10s
maxRetries: 10

6. CRDs Not Found

Symptom: kubectl get restoretest returns error: the server doesn't have a resource type "restoretest".

Cause: CRDs were not installed. This happens when:

  • The Helm install failed partway through
  • The cluster was upgraded and CRDs were inadvertently removed
  • You deployed from source but skipped make install

Fix:

Via Helm (reinstall will register CRDs):

helm install kymaros kymaros/kymaros -n kymaros-system --create-namespace

Via source:

make install

Manually from release assets:

kubectl apply --server-side -f https://github.com/kymorahq/kymora/releases/download/v<VERSION>/crds.yaml

Verify after applying:

kubectl get crd | grep restore.kymaros.io

7. Dashboard Shows No Data

Symptom: The Kymaros web dashboard loads but shows empty charts or "No data" placeholders. RestoreReport resources exist in the cluster.

Cause: The API server pod (kymaros-api) is not running or is not reachable from the frontend.

Diagnosis:

kubectl get pods -n kymaros-system -l app.kubernetes.io/component=api
kubectl logs -n kymaros-system deploy/kymaros-api --tail=50

Check that the API server is listening on the expected address. The health probe binds to :8081 by default (--health-probe-bind-address):

kubectl exec -n kymaros-system deploy/kymaros-api -- wget -qO- http://localhost:8081/healthz

Fix: If the pod is crash-looping, check logs for startup errors (database connection, missing Secret, license validation failure). If the pod is running but the frontend cannot reach it, check the Service definition:

kubectl get svc -n kymaros-system
kubectl describe svc kymaros-api -n kymaros-system

8. Velero Restore Stuck in InProgress

Symptom: A Velero Restore object created by Kymaros stays in InProgress for an extended time. The RestoreTest run does not advance.

Cause: This is a Velero-side issue. Kymaros waits for the Velero restore to complete before proceeding to validation. It does not manage the Velero restore internals.

Diagnosis:

kubectl describe restore <velero-restore-name> -n velero
kubectl logs -n velero deploy/velero --tail=100

Fix: Resolve the Velero restore issue (storage backend connectivity, plugin crash, PVC provisioning timeout). Once the Velero restore transitions to Completed or Failed, Kymaros will react accordingly. If you need to cancel a stuck run, delete the Velero restore object — Kymaros will mark the RestoreReport as failed and clean up the sandbox.


9. Pods Crash-Loop in the Sandbox

Symptom: The RestoreReport shows low pod startup scores. Pods in the sandbox are in CrashLoopBackOff or Error state.

Cause (expected): This is often expected behavior in a network-isolated sandbox. Applications with hard dependencies on external services (shared databases, external APIs, message brokers in other namespaces) will fail to start because those dependencies are not present in the sandbox. The sandbox only contains what Velero backed up from the source namespace.

Diagnosis:

kubectl logs <pod-name> -n <sandbox-namespace>
kubectl describe pod <pod-name> -n <sandbox-namespace>

Fix (correct approach): Review your backup scope in the RestoreTest spec.backupSource.namespaces field. If the application depends on a service in another namespace, include that namespace in the backup. Alternatively, use a HealthCheckPolicy that accounts for the absence of cross-namespace dependencies.

Fix (suppress the alert): If crash-looping is acceptable for your use case and you want to focus validation on other aspects, adjust your HealthCheckPolicy scoring weights to de-prioritize pod startup.


10. Score Dropped Suddenly

Symptom: A RestoreTest that previously produced scores of 90+ now produces a significantly lower score with no intentional change.

Cause: Kymaros detected a regression. Common triggers include: a new dependency added to the application that is not backed up, a change to the application's startup sequence that increases RTO beyond the sla.maxRTO, or a Velero plugin update that changed restore behavior.

Diagnosis:

# Compare the latest two RestoreReports
kubectl get restorereport -n kymaros-system --sort-by='.metadata.creationTimestamp'
kubectl describe restorereport <latest-report-name> -n kymaros-system

The RestoreReport status contains a per-step breakdown. Identify which validation level regressed (completeness, pod startup, health checks, RTO) and investigate that area specifically.

On Team and Enterprise tiers, regression alerts are sent automatically when a score drop exceeds the configured threshold.


Controller Startup Flags Reference

FlagDefaultDescription
--metrics-bind-address0 (disabled)Address for the Prometheus metrics endpoint
--metrics-securetrueServe metrics over HTTPS
--health-probe-bind-address:8081Address for the liveness and readiness probe endpoints
--leader-electfalseEnable leader election for multi-replica deployments