Skip to main content

Health Checks

Kymaros validates a restored workload through a configurable sequence of health checks. Each check is a discrete probe with its own fields, timeout, and failure semantics. The result of all checks contributes to the RestoreTest score (0–100).

Health checks are defined either inline in a RestoreTest resource or in a standalone HealthCheckPolicy that a RestoreTest references.

Overview of check types

TypeTransportPrimary use
podStatusKubernetes APIWait for pods to reach Ready state
httpGetHTTP over ClusterIPValidate an HTTP endpoint after restore
tcpSocketTCP dialConfirm a port is accepting connections
execkubectl exec (SPDY)Run a command inside a container
resourceExistsKubernetes APIConfirm Secrets, ConfigMaps, PVCs exist

podStatus

Queries the Kubernetes API for pods matching a label selector and counts how many have PodReady condition = True. The check passes when the ready count is at or above minReady.

Fields

FieldTypeRequiredDefaultDescription
labelSelectormap[string]stringYesKey/value pairs used to filter pods
minReadyintYesMinimum number of Ready pods required
timeoutDurationNoHow long to wait before failing

When to use it

Use podStatus as the first check in any sequence. It ensures the restore actually brought the workload up before any network or exec check attempts a connection. Without this gate, HTTP checks against a still-starting pod produce misleading failures.

Example

- type: podStatus
podStatus:
labelSelector:
app: api-server
tier: backend
minReady: 2
timeout: 3m

Common pitfalls

  • Setting minReady higher than the replicas in the backup snapshot causes a guaranteed failure. If your backup was taken during a scale-down event, minReady: 1 is safer.
  • A pod that is Running but not Ready (failing liveness probe) does not count toward minReady. This is intentional — it catches misconfigured applications after restore.

httpGet

Resolves the ClusterIP of the named Service, then performs an HTTP GET request. The check passes when the response status matches expectedStatus.

Fields

FieldTypeRequiredDefaultDescription
servicestringYesName of the Kubernetes Service to resolve
portintYesPort to connect to
pathstringYesURL path (include leading /)
expectedStatusintYesHTTP status code that signals success
timeoutDurationNo10sPer-request timeout
retriesintNo1Number of retry attempts

The retry delay between attempts is fixed at 1 second.

When to use it

Use httpGet for any workload that exposes a readiness or health endpoint over HTTP — REST APIs, web applications, Prometheus exporters. Prefer it over tcpSocket when the application-level response matters, not just port reachability.

Example

- type: httpGet
httpGet:
service: api-server-svc
port: 8080
path: /healthz
expectedStatus: 200
timeout: 15s
retries: 3

Common pitfalls

  • The check resolves the Service ClusterIP, not a hostname. If the Service does not exist in the restored namespace, the check fails at DNS resolution, not at HTTP level.
  • expectedStatus: 200 is the common case but some APIs return 204 for health endpoints. Verify the actual status your application returns.
  • TLS is not supported. For HTTPS services, either use tcpSocket for port reachability or add a sidecar endpoint that serves HTTP.

tcpSocket

Resolves the ClusterIP of the named Service and opens a TCP connection using net.DialTimeout. The check passes when the connection is established.

Fields

FieldTypeRequiredDefaultDescription
servicestringYesName of the Kubernetes Service to resolve
portintYesPort to dial
timeoutDurationNo10sConnection timeout

When to use it

Use tcpSocket for protocols that are not HTTP: databases (PostgreSQL 5432, MySQL 3306, Redis 6379, MongoDB 27017), message brokers, GRPC services. It confirms the process is listening without requiring protocol knowledge.

Example

- type: tcpSocket
tcpSocket:
service: postgres-svc
port: 5432
timeout: 10s

Common pitfalls

  • A successful TCP dial only confirms the port is open. The database process may still be in recovery mode (for example, PostgreSQL replaying WAL). Follow tcpSocket with an exec check that runs a real query to confirm readiness.
  • If the Service exists but no pod has started, the TCP dial succeeds at the Service level then immediately closes. Some databases send a banner on connect that would distinguish this, but tcpSocket does not inspect the response.

exec

Finds a pod matching podSelector, opens an exec session via the SPDY protocol, and runs command inside the specified container. The check passes when the command exits with successExitCode.

Fields

FieldTypeRequiredDefaultDescription
podSelectormap[string]stringYesLabel selector to find the target pod
containerstringNoFirst containerContainer name to exec into
command[]stringYesCommand and arguments as a list
successExitCodeintYes0Exit code that signals success
timeoutDurationNoMaximum time for the command to run

The operator requires a valid restConfig to open SPDY connections. This is available automatically when the operator runs inside the cluster.

When to use it

Use exec when you need to validate application-level state that is only visible from inside the container: running a database query, checking a file, invoking a CLI health command. It is the most powerful check type and the most expensive — SPDY session setup has latency overhead.

Example: PostgreSQL pg_isready

- type: exec
exec:
podSelector:
app: postgres
container: postgres
command:
- pg_isready
- -U
- postgres
- -d
- myapp
successExitCode: 0
timeout: 30s

pg_isready exits 0 when the server is accepting connections, 1 when refusing, and 2 when no response. This is more reliable than a TCP dial because it confirms the server has completed startup and is ready for queries.

Example: Redis PING

- type: exec
exec:
podSelector:
app: redis
role: master
container: redis
command:
- redis-cli
- PING
successExitCode: 0
timeout: 10s

Common pitfalls

  • podSelector must match exactly one pod. If it matches zero or more than one, the check fails. For deployments with multiple replicas, add enough labels to target a single pod, or test a specific replica.
  • The container field is optional, but should be set explicitly in pods with sidecars (service meshes, logging agents). An exec into the wrong container produces confusing error output.
  • Commands that produce output but exit non-zero still fail the check. If you are testing a MySQL query with mysql -e "SELECT 1", ensure the command exits 0 on success — it does by default, but --skip-column-names and error handling can change that.

resourceExists

Checks whether a list of Kubernetes resources exist in the target namespace. Supported kinds are: Secret, ConfigMap, Service, and PVC. The check fails immediately if any resource in the list is missing.

Fields

FieldTypeRequiredDescription
resources[]ResourceRefYesList of resources to verify

ResourceRef fields:

FieldTypeDescription
kindstringOne of: Secret, ConfigMap, Service, PVC
namestringName of the resource

When to use it

Use resourceExists to confirm that the backup included supporting resources, not just the main workload. Applications commonly break after restore because a Secret containing database credentials or a TLS certificate was excluded from the backup scope.

Example

- type: resourceExists
resourceExists:
resources:
- kind: Secret
name: db-credentials
- kind: Secret
name: tls-cert
- kind: ConfigMap
name: app-config
- kind: PVC
name: uploads-pvc

Common pitfalls

  • Kubernetes Secrets created by operators (for example, kubernetes.io/service-account-token secrets) may not be present in a cross-cluster restore. Only check for Secrets that your application explicitly depends on.
  • PVC existence does not guarantee the volume has data. Follow resourceExists with an exec check that reads from the mounted path if data integrity matters.

Complete HealthCheckPolicy example

The following HealthCheckPolicy combines multiple check types in a sequenced validation for a three-tier web application that uses PostgreSQL.

apiVersion: restore.kymaros.io/v1alpha1
kind: HealthCheckPolicy
metadata:
name: webapp-full-check
namespace: kymaros-system
spec:
checks:
# 1. Confirm supporting resources exist before probing live services.
- name: required-resources
type: resourceExists
resourceExists:
resources:
- kind: Secret
name: db-credentials
- kind: ConfigMap
name: app-config
- kind: PVC
name: uploads-pvc

# 2. Wait for the database pod to be ready before running queries.
- name: postgres-pod-ready
type: podStatus
podStatus:
labelSelector:
app: postgres
minReady: 1
timeout: 5m

# 3. Confirm PostgreSQL is accepting connections at the application level.
- name: postgres-accepting-queries
type: exec
exec:
podSelector:
app: postgres
container: postgres
command:
- pg_isready
- -U
- postgres
- -d
- webapp
successExitCode: 0
timeout: 30s

# 4. Confirm the application pods are running.
- name: api-pods-ready
type: podStatus
podStatus:
labelSelector:
app: api-server
minReady: 2
timeout: 3m

# 5. Validate the HTTP health endpoint.
- name: api-health-endpoint
type: httpGet
httpGet:
service: api-server-svc
port: 8080
path: /healthz
expectedStatus: 200
timeout: 15s
retries: 3

# 6. Confirm the frontend service TCP port is open.
- name: frontend-port
type: tcpSocket
tcpSocket:
service: frontend-svc
port: 3000
timeout: 10s

This policy can be referenced by a RestoreTest using .spec.healthCheckPolicyRef:

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: webapp-nightly
namespace: kymaros-system
spec:
schedule: "0 2 * * *"
backupSource:
name: webapp-backup
namespace: webapp-prod
healthCheckPolicyRef:
name: webapp-full-check

Checks run in the order they are declared. A check failure stops the sequence and records the first failing check in the RestoreReport.