Skip to main content

HealthCheckPolicy

API group: restore.kymaros.io/v1alpha1
Kind: HealthCheckPolicy
Short name: hcp
Scope: Namespaced (typically kymaros-system)

A HealthCheckPolicy defines a named, reusable set of probes that Kymaros runs inside a sandbox after a restore completes. A RestoreTest references a policy by name via spec.healthChecks.policyRef. Multiple tests can share the same policy.


Spec

FieldTypeRequiredDescription
checks[]HealthCheckYesOrdered list of health checks to execute. All checks run unless a preceding check causes an abort.

HealthCheck

Each entry in checks defines a single probe. The type field selects the probe type and determines which sub-fields are relevant.

FieldTypeRequiredDescription
namestringYesUnique identifier for this check within the policy. Referenced in RestoreReport.status.checks[*].name.
typestringYesProbe type. Accepted values: podStatus, httpGet, exec, tcpSocket, resourceExists.
podStatusPodStatusCheckConditionalRequired when type is podStatus.
httpGetHTTPGetCheckConditionalRequired when type is httpGet.
execExecCheckConditionalRequired when type is exec.
tcpSocketTCPSocketCheckConditionalRequired when type is tcpSocket.
resourceExistsResourceExistsCheckConditionalRequired when type is resourceExists.

Check type reference

podStatus

Verifies that pods matching a label selector have reached Running and Ready state. This is the most common first check after a restore.

FieldTypeRequiredDefaultDescription
labelSelectormap[string]stringYesKubernetes label selector used to identify the target pods in the sandbox namespace.
minReadyintYesMinimum number of pods that must be in Ready state for the check to pass.
timeoutDurationNoMaximum time to wait for pods to reach Ready state. When omitted, the HealthCheckRef.timeout on the RestoreTest governs.

httpGet

Sends an HTTP GET request to a service within the sandbox and validates the response status code.

FieldTypeRequiredDefaultDescription
servicestringYesName of the Kubernetes Service in the sandbox namespace to probe.
portintYesPort number on the service.
pathstringYesHTTP path to request (e.g., "/healthz").
expectedStatusintYesHTTP response status code expected for the check to pass (e.g., 200).
timeoutDurationNoPer-request timeout.
retriesintNoNumber of retry attempts before marking the check as failed.

exec

Executes a command inside a container running in the sandbox and validates the exit code.

FieldTypeRequiredDefaultDescription
podSelectormap[string]stringYesLabel selector for the target pod. The command runs in the first matching pod.
containerstringYesName of the container within the selected pod in which to run the command.
command[]stringYesCommand and arguments to execute (exec form, not shell). Example: ["pg_isready", "-U", "app"].
successExitCodeintYesExit code that indicates success. Typically 0.
timeoutDurationNoMaximum time to wait for the command to complete.

tcpSocket

Attempts a TCP connection to a port on a service within the sandbox. Passes if the connection is accepted.

FieldTypeRequiredDefaultDescription
servicestringYesName of the Kubernetes Service in the sandbox namespace to probe.
portintYesTCP port to connect to.
timeoutDurationNoConnection timeout.

resourceExists

Verifies that specific named Kubernetes resources are present in the sandbox after the restore. Useful for confirming that CRDs and custom objects were included in the backup and restored correctly.

FieldTypeRequiredDescription
resources[]ResourceRefYesList of resources to verify.

ResourceRef

FieldTypeRequiredDescription
kindstringYesKubernetes resource kind (e.g., "Deployment", "ConfigMap", "MyCustomResource").
namestringYesName of the resource to look for in the sandbox namespace.

Examples

Web API application

Checks that the API pods are running, that the readiness endpoint returns HTTP 200, and that the in-cluster cache service accepts TCP connections.

apiVersion: restore.kymaros.io/v1alpha1
kind: HealthCheckPolicy
metadata:
name: web-api-health-policy
namespace: kymaros-system
spec:
checks:
- name: api-pods-ready
type: podStatus
podStatus:
labelSelector:
app: api
component: server
minReady: 2
timeout: 5m

- name: api-http-healthz
type: httpGet
httpGet:
service: api-service
port: 8080
path: /healthz
expectedStatus: 200
timeout: 10s
retries: 3

- name: api-http-readyz
type: httpGet
httpGet:
service: api-service
port: 8080
path: /readyz
expectedStatus: 200
timeout: 10s
retries: 3

- name: redis-tcp-reachable
type: tcpSocket
tcpSocket:
service: redis
port: 6379
timeout: 5s

- name: configmap-exists
type: resourceExists
resourceExists:
resources:
- kind: ConfigMap
name: api-config
- kind: Secret
name: api-tls-cert

Database application

Verifies the PostgreSQL StatefulSet pod is ready, then runs pg_isready inside the container to confirm the database process is accepting connections.

apiVersion: restore.kymaros.io/v1alpha1
kind: HealthCheckPolicy
metadata:
name: postgres-health-policy
namespace: kymaros-system
spec:
checks:
- name: postgres-pod-ready
type: podStatus
podStatus:
labelSelector:
app: postgres
role: primary
minReady: 1
timeout: 8m

- name: postgres-accepting-connections
type: exec
exec:
podSelector:
app: postgres
role: primary
container: postgres
command:
- pg_isready
- -U
- app
- -d
- orders
successExitCode: 0
timeout: 30s

- name: postgres-tcp-port
type: tcpSocket
tcpSocket:
service: postgres-service
port: 5432
timeout: 5s

- name: postgres-schema-migrated
type: exec
exec:
podSelector:
app: postgres
role: primary
container: postgres
command:
- psql
- -U
- app
- -d
- orders
- -c
- "SELECT COUNT(*) FROM schema_migrations;"
successExitCode: 0
timeout: 1m

- name: pvc-exists
type: resourceExists
resourceExists:
resources:
- kind: PersistentVolumeClaim
name: postgres-data-postgres-0

Background worker application

Verifies that worker pods are running and that a custom WorkerQueue CRD resource was restored, then confirms the worker can connect to its job queue over TCP.

apiVersion: restore.kymaros.io/v1alpha1
kind: HealthCheckPolicy
metadata:
name: worker-health-policy
namespace: kymaros-system
spec:
checks:
- name: worker-pods-ready
type: podStatus
podStatus:
labelSelector:
app: worker
minReady: 1
timeout: 6m

- name: worker-queue-resource-exists
type: resourceExists
resourceExists:
resources:
- kind: WorkerQueue
name: default-queue
- kind: ConfigMap
name: worker-config

- name: rabbitmq-tcp-reachable
type: tcpSocket
tcpSocket:
service: rabbitmq
port: 5672
timeout: 10s

- name: worker-process-alive
type: exec
exec:
podSelector:
app: worker
container: worker
command:
- /bin/sh
- -c
- "ps aux | grep -q '[w]orker' && exit 0 || exit 1"
successExitCode: 0
timeout: 10s

kubectl quick reference

# List all HealthCheckPolicy resources
kubectl get hcp -n kymaros-system

# Describe a policy to see all checks
kubectl describe hcp web-api-health-policy -n kymaros-system

# List check names for a specific policy
kubectl get hcp web-api-health-policy -n kymaros-system \
-o jsonpath='{range .spec.checks[*]}{.name}{"\t"}{.type}{"\n"}{end}'