Health Checks
Kymaros validates a restored workload through a configurable sequence of health checks. Each check is a discrete probe with its own fields, timeout, and failure semantics. The result of all checks contributes to the RestoreTest score (0–100).
Health checks are defined either inline in a RestoreTest resource or in a standalone HealthCheckPolicy that a RestoreTest references.
Overview of check types
| Type | Transport | Primary use |
|---|---|---|
podStatus | Kubernetes API | Wait for pods to reach Ready state |
httpGet | HTTP over ClusterIP | Validate an HTTP endpoint after restore |
tcpSocket | TCP dial | Confirm a port is accepting connections |
exec | kubectl exec (SPDY) | Run a command inside a container |
resourceExists | Kubernetes API | Confirm Secrets, ConfigMaps, PVCs exist |
podStatus
Queries the Kubernetes API for pods matching a label selector and counts how many have PodReady condition = True. The check passes when the ready count is at or above minReady.
Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
labelSelector | map[string]string | Yes | — | Key/value pairs used to filter pods |
minReady | int | Yes | — | Minimum number of Ready pods required |
timeout | Duration | No | — | How long to wait before failing |
When to use it
Use podStatus as the first check in any sequence. It ensures the restore actually brought the workload up before any network or exec check attempts a connection. Without this gate, HTTP checks against a still-starting pod produce misleading failures.
Example
- type: podStatus
podStatus:
labelSelector:
app: api-server
tier: backend
minReady: 2
timeout: 3m
Common pitfalls
- Setting
minReadyhigher than the replicas in the backup snapshot causes a guaranteed failure. If your backup was taken during a scale-down event,minReady: 1is safer. - A pod that is
Runningbut notReady(failing liveness probe) does not count towardminReady. This is intentional — it catches misconfigured applications after restore.
httpGet
Resolves the ClusterIP of the named Service, then performs an HTTP GET request. The check passes when the response status matches expectedStatus.
Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
service | string | Yes | — | Name of the Kubernetes Service to resolve |
port | int | Yes | — | Port to connect to |
path | string | Yes | — | URL path (include leading /) |
expectedStatus | int | Yes | — | HTTP status code that signals success |
timeout | Duration | No | 10s | Per-request timeout |
retries | int | No | 1 | Number of retry attempts |
The retry delay between attempts is fixed at 1 second.
When to use it
Use httpGet for any workload that exposes a readiness or health endpoint over HTTP — REST APIs, web applications, Prometheus exporters. Prefer it over tcpSocket when the application-level response matters, not just port reachability.
Example
- type: httpGet
httpGet:
service: api-server-svc
port: 8080
path: /healthz
expectedStatus: 200
timeout: 15s
retries: 3
Common pitfalls
- The check resolves the Service ClusterIP, not a hostname. If the Service does not exist in the restored namespace, the check fails at DNS resolution, not at HTTP level.
expectedStatus: 200is the common case but some APIs return204for health endpoints. Verify the actual status your application returns.- TLS is not supported. For HTTPS services, either use
tcpSocketfor port reachability or add a sidecar endpoint that serves HTTP.
tcpSocket
Resolves the ClusterIP of the named Service and opens a TCP connection using net.DialTimeout. The check passes when the connection is established.
Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
service | string | Yes | — | Name of the Kubernetes Service to resolve |
port | int | Yes | — | Port to dial |
timeout | Duration | No | 10s | Connection timeout |
When to use it
Use tcpSocket for protocols that are not HTTP: databases (PostgreSQL 5432, MySQL 3306, Redis 6379, MongoDB 27017), message brokers, GRPC services. It confirms the process is listening without requiring protocol knowledge.
Example
- type: tcpSocket
tcpSocket:
service: postgres-svc
port: 5432
timeout: 10s
Common pitfalls
- A successful TCP dial only confirms the port is open. The database process may still be in recovery mode (for example, PostgreSQL replaying WAL). Follow
tcpSocketwith anexeccheck that runs a real query to confirm readiness. - If the Service exists but no pod has started, the TCP dial succeeds at the Service level then immediately closes. Some databases send a banner on connect that would distinguish this, but
tcpSocketdoes not inspect the response.
exec
Finds a pod matching podSelector, opens an exec session via the SPDY protocol, and runs command inside the specified container. The check passes when the command exits with successExitCode.
Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
podSelector | map[string]string | Yes | — | Label selector to find the target pod |
container | string | No | First container | Container name to exec into |
command | []string | Yes | — | Command and arguments as a list |
successExitCode | int | Yes | 0 | Exit code that signals success |
timeout | Duration | No | — | Maximum time for the command to run |
The operator requires a valid restConfig to open SPDY connections. This is available automatically when the operator runs inside the cluster.
When to use it
Use exec when you need to validate application-level state that is only visible from inside the container: running a database query, checking a file, invoking a CLI health command. It is the most powerful check type and the most expensive — SPDY session setup has latency overhead.
Example: PostgreSQL pg_isready
- type: exec
exec:
podSelector:
app: postgres
container: postgres
command:
- pg_isready
- -U
- postgres
- -d
- myapp
successExitCode: 0
timeout: 30s
pg_isready exits 0 when the server is accepting connections, 1 when refusing, and 2 when no response. This is more reliable than a TCP dial because it confirms the server has completed startup and is ready for queries.
Example: Redis PING
- type: exec
exec:
podSelector:
app: redis
role: master
container: redis
command:
- redis-cli
- PING
successExitCode: 0
timeout: 10s
Common pitfalls
podSelectormust match exactly one pod. If it matches zero or more than one, the check fails. For deployments with multiple replicas, add enough labels to target a single pod, or test a specific replica.- The
containerfield is optional, but should be set explicitly in pods with sidecars (service meshes, logging agents). An exec into the wrong container produces confusing error output. - Commands that produce output but exit non-zero still fail the check. If you are testing a MySQL query with
mysql -e "SELECT 1", ensure the command exits 0 on success — it does by default, but--skip-column-namesand error handling can change that.
resourceExists
Checks whether a list of Kubernetes resources exist in the target namespace. Supported kinds are: Secret, ConfigMap, Service, and PVC. The check fails immediately if any resource in the list is missing.
Fields
| Field | Type | Required | Description |
|---|---|---|---|
resources | []ResourceRef | Yes | List of resources to verify |
ResourceRef fields:
| Field | Type | Description |
|---|---|---|
kind | string | One of: Secret, ConfigMap, Service, PVC |
name | string | Name of the resource |
When to use it
Use resourceExists to confirm that the backup included supporting resources, not just the main workload. Applications commonly break after restore because a Secret containing database credentials or a TLS certificate was excluded from the backup scope.
Example
- type: resourceExists
resourceExists:
resources:
- kind: Secret
name: db-credentials
- kind: Secret
name: tls-cert
- kind: ConfigMap
name: app-config
- kind: PVC
name: uploads-pvc
Common pitfalls
- Kubernetes Secrets created by operators (for example,
kubernetes.io/service-account-tokensecrets) may not be present in a cross-cluster restore. Only check for Secrets that your application explicitly depends on. - PVC existence does not guarantee the volume has data. Follow
resourceExistswith anexeccheck that reads from the mounted path if data integrity matters.
Complete HealthCheckPolicy example
The following HealthCheckPolicy combines multiple check types in a sequenced validation for a three-tier web application that uses PostgreSQL.
apiVersion: restore.kymaros.io/v1alpha1
kind: HealthCheckPolicy
metadata:
name: webapp-full-check
namespace: kymaros-system
spec:
checks:
# 1. Confirm supporting resources exist before probing live services.
- name: required-resources
type: resourceExists
resourceExists:
resources:
- kind: Secret
name: db-credentials
- kind: ConfigMap
name: app-config
- kind: PVC
name: uploads-pvc
# 2. Wait for the database pod to be ready before running queries.
- name: postgres-pod-ready
type: podStatus
podStatus:
labelSelector:
app: postgres
minReady: 1
timeout: 5m
# 3. Confirm PostgreSQL is accepting connections at the application level.
- name: postgres-accepting-queries
type: exec
exec:
podSelector:
app: postgres
container: postgres
command:
- pg_isready
- -U
- postgres
- -d
- webapp
successExitCode: 0
timeout: 30s
# 4. Confirm the application pods are running.
- name: api-pods-ready
type: podStatus
podStatus:
labelSelector:
app: api-server
minReady: 2
timeout: 3m
# 5. Validate the HTTP health endpoint.
- name: api-health-endpoint
type: httpGet
httpGet:
service: api-server-svc
port: 8080
path: /healthz
expectedStatus: 200
timeout: 15s
retries: 3
# 6. Confirm the frontend service TCP port is open.
- name: frontend-port
type: tcpSocket
tcpSocket:
service: frontend-svc
port: 3000
timeout: 10s
This policy can be referenced by a RestoreTest using .spec.healthCheckPolicyRef:
apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: webapp-nightly
namespace: kymaros-system
spec:
schedule: "0 2 * * *"
backupSource:
name: webapp-backup
namespace: webapp-prod
healthCheckPolicyRef:
name: webapp-full-check
Checks run in the order they are declared. A check failure stops the sequence and records the first failing check in the RestoreReport.