Health Checks

Kymaros validates a restored workload through a configurable sequence of health checks. Each check is a discrete probe with its own fields, timeout, and failure semantics. The result of all checks contributes to the RestoreTest score (0–100).

Health checks are defined either inline in a RestoreTest resource or in a standalone HealthCheckPolicy that a RestoreTest references.

Overview of check types

Type	Transport	Primary use
`podStatus`	Kubernetes API	Wait for pods to reach Ready state
`httpGet`	HTTP over ClusterIP	Validate an HTTP endpoint after restore
`tcpSocket`	TCP dial	Confirm a port is accepting connections
`exec`	kubectl exec (SPDY)	Run a command inside a container
`resourceExists`	Kubernetes API	Confirm Secrets, ConfigMaps, PVCs exist

podStatus

Queries the Kubernetes API for pods matching a label selector and counts how many have PodReady condition = True. The check passes when the ready count is at or above minReady.

Fields

Field	Type	Required	Default	Description
`labelSelector`	`map[string]string`	Yes	—	Key/value pairs used to filter pods
`minReady`	`int`	Yes	—	Minimum number of Ready pods required
`timeout`	`Duration`	No	—	How long to wait before failing

When to use it

Use podStatus as the first check in any sequence. It ensures the restore actually brought the workload up before any network or exec check attempts a connection. Without this gate, HTTP checks against a still-starting pod produce misleading failures.

Example

- type: podStatus
  podStatus:
    labelSelector:
      app: api-server
      tier: backend
    minReady: 2
    timeout: 3m

Common pitfalls

Setting minReady higher than the replicas in the backup snapshot causes a guaranteed failure. If your backup was taken during a scale-down event, minReady: 1 is safer.
A pod that is Running but not Ready (failing liveness probe) does not count toward minReady. This is intentional — it catches misconfigured applications after restore.

httpGet

Resolves the ClusterIP of the named Service, then performs an HTTP GET request. The check passes when the response status matches expectedStatus.

Fields

Field	Type	Required	Default	Description
`service`	`string`	Yes	—	Name of the Kubernetes Service to resolve
`port`	`int`	Yes	—	Port to connect to
`path`	`string`	Yes	—	URL path (include leading `/`)
`expectedStatus`	`int`	Yes	—	HTTP status code that signals success
`timeout`	`Duration`	No	`10s`	Per-request timeout
`retries`	`int`	No	`1`	Number of retry attempts

The retry delay between attempts is fixed at 1 second.

When to use it

Use httpGet for any workload that exposes a readiness or health endpoint over HTTP — REST APIs, web applications, Prometheus exporters. Prefer it over tcpSocket when the application-level response matters, not just port reachability.

Example

- type: httpGet
  httpGet:
    service: api-server-svc
    port: 8080
    path: /healthz
    expectedStatus: 200
    timeout: 15s
    retries: 3

Common pitfalls

The check resolves the Service ClusterIP, not a hostname. If the Service does not exist in the restored namespace, the check fails at DNS resolution, not at HTTP level.
expectedStatus: 200 is the common case but some APIs return 204 for health endpoints. Verify the actual status your application returns.
TLS is not supported. For HTTPS services, either use tcpSocket for port reachability or add a sidecar endpoint that serves HTTP.

tcpSocket

Resolves the ClusterIP of the named Service and opens a TCP connection using net.DialTimeout. The check passes when the connection is established.

Fields

Field	Type	Required	Default	Description
`service`	`string`	Yes	—	Name of the Kubernetes Service to resolve
`port`	`int`	Yes	—	Port to dial
`timeout`	`Duration`	No	`10s`	Connection timeout

When to use it

Use tcpSocket for protocols that are not HTTP: databases (PostgreSQL 5432, MySQL 3306, Redis 6379, MongoDB 27017), message brokers, GRPC services. It confirms the process is listening without requiring protocol knowledge.

Example

- type: tcpSocket
  tcpSocket:
    service: postgres-svc
    port: 5432
    timeout: 10s

Common pitfalls

A successful TCP dial only confirms the port is open. The database process may still be in recovery mode (for example, PostgreSQL replaying WAL). Follow tcpSocket with an exec check that runs a real query to confirm readiness.
If the Service exists but no pod has started, the TCP dial succeeds at the Service level then immediately closes. Some databases send a banner on connect that would distinguish this, but tcpSocket does not inspect the response.

exec

Finds a pod matching podSelector, opens an exec session via the SPDY protocol, and runs command inside the specified container. The check passes when the command exits with successExitCode.

Fields

Field	Type	Required	Default	Description
`podSelector`	`map[string]string`	Yes	—	Label selector to find the target pod
`container`	`string`	No	First container	Container name to exec into
`command`	`[]string`	Yes	—	Command and arguments as a list
`successExitCode`	`int`	Yes	`0`	Exit code that signals success
`timeout`	`Duration`	No	—	Maximum time for the command to run

The operator requires a valid restConfig to open SPDY connections. This is available automatically when the operator runs inside the cluster.

When to use it

Use exec when you need to validate application-level state that is only visible from inside the container: running a database query, checking a file, invoking a CLI health command. It is the most powerful check type and the most expensive — SPDY session setup has latency overhead.

Example: PostgreSQL pg_isready

- type: exec
  exec:
    podSelector:
      app: postgres
    container: postgres
    command:
      - pg_isready
      - -U
      - postgres
      - -d
      - myapp
    successExitCode: 0
    timeout: 30s

pg_isready exits 0 when the server is accepting connections, 1 when refusing, and 2 when no response. This is more reliable than a TCP dial because it confirms the server has completed startup and is ready for queries.

Example: Redis PING

- type: exec
  exec:
    podSelector:
      app: redis
      role: master
    container: redis
    command:
      - redis-cli
      - PING
    successExitCode: 0
    timeout: 10s

Common pitfalls

podSelector must match exactly one pod. If it matches zero or more than one, the check fails. For deployments with multiple replicas, add enough labels to target a single pod, or test a specific replica.
The container field is optional, but should be set explicitly in pods with sidecars (service meshes, logging agents). An exec into the wrong container produces confusing error output.
Commands that produce output but exit non-zero still fail the check. If you are testing a MySQL query with mysql -e "SELECT 1", ensure the command exits 0 on success — it does by default, but --skip-column-names and error handling can change that.

resourceExists

Checks whether a list of Kubernetes resources exist in the target namespace. Supported kinds are: Secret, ConfigMap, Service, and PVC. The check fails immediately if any resource in the list is missing.

Fields

Field	Type	Required	Description
`resources`	`[]ResourceRef`	Yes	List of resources to verify

ResourceRef fields:

Field	Type	Description
`kind`	`string`	One of: `Secret`, `ConfigMap`, `Service`, `PVC`
`name`	`string`	Name of the resource

When to use it

Use resourceExists to confirm that the backup included supporting resources, not just the main workload. Applications commonly break after restore because a Secret containing database credentials or a TLS certificate was excluded from the backup scope.

Example

- type: resourceExists
  resourceExists:
    resources:
      - kind: Secret
        name: db-credentials
      - kind: Secret
        name: tls-cert
      - kind: ConfigMap
        name: app-config
      - kind: PVC
        name: uploads-pvc

Common pitfalls

Kubernetes Secrets created by operators (for example, kubernetes.io/service-account-token secrets) may not be present in a cross-cluster restore. Only check for Secrets that your application explicitly depends on.
PVC existence does not guarantee the volume has data. Follow resourceExists with an exec check that reads from the mounted path if data integrity matters.

Complete HealthCheckPolicy example

The following HealthCheckPolicy combines multiple check types in a sequenced validation for a three-tier web application that uses PostgreSQL.

apiVersion: restore.kymaros.io/v1alpha1
kind: HealthCheckPolicy
metadata:
  name: webapp-full-check
  namespace: kymaros-system
spec:
  checks:
    # 1. Confirm supporting resources exist before probing live services.
    - name: required-resources
      type: resourceExists
      resourceExists:
        resources:
          - kind: Secret
            name: db-credentials
          - kind: ConfigMap
            name: app-config
          - kind: PVC
            name: uploads-pvc

    # 2. Wait for the database pod to be ready before running queries.
    - name: postgres-pod-ready
      type: podStatus
      podStatus:
        labelSelector:
          app: postgres
        minReady: 1
        timeout: 5m

    # 3. Confirm PostgreSQL is accepting connections at the application level.
    - name: postgres-accepting-queries
      type: exec
      exec:
        podSelector:
          app: postgres
        container: postgres
        command:
          - pg_isready
          - -U
          - postgres
          - -d
          - webapp
        successExitCode: 0
        timeout: 30s

    # 4. Confirm the application pods are running.
    - name: api-pods-ready
      type: podStatus
      podStatus:
        labelSelector:
          app: api-server
        minReady: 2
        timeout: 3m

    # 5. Validate the HTTP health endpoint.
    - name: api-health-endpoint
      type: httpGet
      httpGet:
        service: api-server-svc
        port: 8080
        path: /healthz
        expectedStatus: 200
        timeout: 15s
        retries: 3

    # 6. Confirm the frontend service TCP port is open.
    - name: frontend-port
      type: tcpSocket
      tcpSocket:
        service: frontend-svc
        port: 3000
        timeout: 10s

This policy can be referenced by a RestoreTest using .spec.healthCheckPolicyRef:

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
  name: webapp-nightly
  namespace: kymaros-system
spec:
  schedule: "0 2 * * *"
  backupSource:
    name: webapp-backup
    namespace: webapp-prod
  healthCheckPolicyRef:
    name: webapp-full-check

Checks run in the order they are declared. A check failure stops the sequence and records the first failing check in the RestoreReport.

Overview of check types​

podStatus​

Fields​

When to use it​

Example​

Common pitfalls​

httpGet​

Fields​

When to use it​

Example​

Common pitfalls​

tcpSocket​

Fields​

When to use it​

Example​

Common pitfalls​

exec​

Fields​

When to use it​

Example: PostgreSQL pg_isready​

Example: Redis PING​

Common pitfalls​

resourceExists​

Fields​

When to use it​

Example​

Common pitfalls​

Complete HealthCheckPolicy example​

Overview of check types

podStatus

Fields

When to use it

Example

Common pitfalls

httpGet

Fields

When to use it

Example

Common pitfalls

tcpSocket

Fields

When to use it

Example

Common pitfalls

exec

Fields

When to use it

Example: PostgreSQL pg_isready

Example: Redis PING

Common pitfalls

resourceExists

Fields

When to use it

Example

Common pitfalls

Complete HealthCheckPolicy example