Stateful Applications

Stateful workloads have failure modes that ephemeral services do not. A restore may succeed at the Kubernetes level — pods running, Services reachable — while the database is internally inconsistent: corrupted pages, incomplete WAL replay, or a replica pointing at a primary that no longer exists. This guide covers how to structure RestoreTest resources that catch those failures.

General principles

Before looking at individual databases:

Always start with resourceExists to confirm PVCs and credential Secrets were included in the backup.
Follow with podStatus to gate all subsequent checks behind pod readiness.
Use exec with the database's own health command rather than a generic tcpSocket — the former detects recovery mode, the latter does not.
Set timeouts conservatively. A 50 GB PostgreSQL database may need 10–15 minutes of WAL replay before accepting queries.

PostgreSQL

Which checks to use

Stage	Check type	Purpose
1	`resourceExists`	Confirm PVC and password Secret exist
2	`podStatus`	Wait for the StatefulSet pod to be Ready
3	`exec` (`pg_isready`)	Confirm PostgreSQL is accepting connections
4	`exec` (`psql`)	Run a test query to validate data integrity

Example RestoreTest

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
  name: postgres-nightly
  namespace: kymaros-system
spec:
  schedule: "0 1 * * *"
  backupSource:
    name: postgres-backup
    namespace: db-prod
  checks:
    - name: pvc-and-secret
      type: resourceExists
      resourceExists:
        resources:
          - kind: PVC
            name: postgres-data
          - kind: Secret
            name: postgres-credentials

    - name: pod-ready
      type: podStatus
      podStatus:
        labelSelector:
          app: postgres
          statefulset.kubernetes.io/pod-name: postgres-0
        minReady: 1
        timeout: 15m

    - name: accepting-connections
      type: exec
      exec:
        podSelector:
          app: postgres
          statefulset.kubernetes.io/pod-name: postgres-0
        container: postgres
        command:
          - pg_isready
          - -U
          - postgres
          - -d
          - myapp_db
        successExitCode: 0
        timeout: 30s

    - name: data-query
      type: exec
      exec:
        podSelector:
          app: postgres
          statefulset.kubernetes.io/pod-name: postgres-0
        container: postgres
        command:
          - psql
          - -U
          - postgres
          - -d
          - myapp_db
          - -c
          - SELECT COUNT(*) FROM users WHERE created_at > NOW() - INTERVAL '30 days';
        successExitCode: 0
        timeout: 60s

PVC verification

PostgreSQL stores data in the volume mounted at /var/lib/postgresql/data. To confirm the volume has data (not just that it exists), add an exec check that reads the data directory:

- name: data-directory-present
  type: exec
  exec:
    podSelector:
      app: postgres
      statefulset.kubernetes.io/pod-name: postgres-0
    container: postgres
    command:
      - test
      - -f
      - /var/lib/postgresql/data/PG_VERSION
    successExitCode: 0
    timeout: 10s

Pitfalls

WAL replay: After a snapshot restore, PostgreSQL may enter archive recovery mode and replay WAL segments before accepting connections. pg_isready returns exit code 1 ("refusing connections") during this phase. Set timeout on the podStatus check to at least 15 minutes for databases larger than 20 GB, and set timeout on the exec check to 60–120 seconds.

Snapshots during writes: A volume snapshot taken while PostgreSQL is actively writing can leave the data directory in a state that requires crash recovery. PostgreSQL handles this automatically, but the recovery window depends on checkpoint frequency. Coordinate backups with pg_checkpoint if possible, or enable continuous archiving.

Replication slots: If the restored instance has replication slots that reference WAL positions in the past, it may hold WAL segments indefinitely. After a restore test, confirm that slots are dropped or that WAL accumulation is acceptable.

MySQL

Which checks to use

Stage	Check type	Purpose
1	`resourceExists`	Confirm PVC and Secret exist
2	`podStatus`	Wait for pod readiness
3	`exec` (`mysqladmin ping`)	Confirm server is responsive
4	`exec` (`mysql -e`)	Run a query to validate data

Example RestoreTest

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
  name: mysql-nightly
  namespace: kymaros-system
spec:
  schedule: "0 1 * * *"
  backupSource:
    name: mysql-backup
    namespace: db-prod
  checks:
    - name: pvc-and-secret
      type: resourceExists
      resourceExists:
        resources:
          - kind: PVC
            name: mysql-data
          - kind: Secret
            name: mysql-credentials

    - name: pod-ready
      type: podStatus
      podStatus:
        labelSelector:
          app: mysql
        minReady: 1
        timeout: 10m

    - name: server-ping
      type: exec
      exec:
        podSelector:
          app: mysql
        container: mysql
        command:
          - /bin/sh
          - -c
          - mysqladmin ping -h 127.0.0.1 -u root --password=$MYSQL_ROOT_PASSWORD
        successExitCode: 0
        timeout: 30s

    - name: data-query
      type: exec
      exec:
        podSelector:
          app: mysql
        container: mysql
        command:
          - /bin/sh
          - -c
          - mysql -u root --password=$MYSQL_ROOT_PASSWORD -e "SELECT COUNT(*) FROM myapp_db.orders;"
        successExitCode: 0
        timeout: 60s

PVC verification

MySQL stores data in /var/lib/mysql. To confirm InnoDB files are present:

- name: innodb-files-present
  type: exec
  exec:
    podSelector:
      app: mysql
    container: mysql
    command:
      - test
      - -f
      - /var/lib/mysql/ibdata1
    successExitCode: 0
    timeout: 10s

Pitfalls

InnoDB crash recovery: Like PostgreSQL, MySQL may enter InnoDB crash recovery after a snapshot restore. The pod may show as Running before recovery completes. mysqladmin ping returns mysqld is alive only after recovery finishes, making it a reliable gate.

Binary log positions: If you restore a replica, its binary log position will be ahead of the backup point. Do not attempt to reconnect a restored replica to its primary without resetting the replication coordinates — this will cause the replica to skip transactions.

Environment variable passwords: The exec check runs commands directly without a shell, so $VARIABLE expansion does not work when passing arguments as separate list entries. The examples above use /bin/sh -c as the command entry point to enable shell expansion of $MYSQL_ROOT_PASSWORD. This is required any time your command uses environment variables, pipes, or redirection.

Redis

Which checks to use

Stage	Check type	Purpose
1	`podStatus`	Wait for the Redis pod
2	`tcpSocket`	Confirm port 6379 is open
3	`exec` (`redis-cli PING`)	Confirm server responds to commands
4	`exec` (`redis-cli DBSIZE`)	Verify keyspace is not empty

Redis does not use PVCs in all deployments. If persistence is enabled via an RDB or AOF file, add a resourceExists check for the PVC and an exec check that tests the file path.

Example RestoreTest

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
  name: redis-nightly
  namespace: kymaros-system
spec:
  schedule: "0 1 * * *"
  backupSource:
    name: redis-backup
    namespace: cache-prod
  checks:
    - name: pod-ready
      type: podStatus
      podStatus:
        labelSelector:
          app: redis
          role: master
        minReady: 1
        timeout: 5m

    - name: port-open
      type: tcpSocket
      tcpSocket:
        service: redis-svc
        port: 6379
        timeout: 10s

    - name: ping
      type: exec
      exec:
        podSelector:
          app: redis
          role: master
        container: redis
        command:
          - redis-cli
          - PING
        successExitCode: 0
        timeout: 10s

    - name: keyspace-not-empty
      type: exec
      exec:
        podSelector:
          app: redis
          role: master
        container: redis
        command:
          - /bin/sh
          - -c
          - test $(redis-cli DBSIZE) -gt 0
        successExitCode: 0
        timeout: 10s

PVC verification

For Redis with AOF or RDB persistence:

- name: persistence-file
  type: resourceExists
  resourceExists:
    resources:
      - kind: PVC
        name: redis-data

Then follow with an exec check:

- name: rdb-file-exists
  type: exec
  exec:
    podSelector:
      app: redis
      role: master
    container: redis
    command:
      - test
      - -f
      - /data/dump.rdb
    successExitCode: 0
    timeout: 10s

Pitfalls

RDB load time: Redis loads the RDB file at startup before accepting connections. Large RDB files (several GB) cause a startup delay. During this window, redis-cli PING hangs or times out. Ensure podStatus has a sufficient timeout before the exec check runs.

Replica promotion: If the backup captured a replica and not the primary, the keyspace may be complete but the restored instance may have stale keys from before the last replication sync. Verify the restored instance was the primary in the backup, or test for specific critical keys rather than DBSIZE > 0.

MongoDB

Which checks to use

Stage	Check type	Purpose
1	`resourceExists`	Confirm PVC and Secret exist
2	`podStatus`	Wait for pod readiness
3	`exec` (`mongosh --eval`)	Confirm server responds
4	`exec` (`mongosh` count)	Validate a collection is not empty

Example RestoreTest

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
  name: mongodb-nightly
  namespace: kymaros-system
spec:
  schedule: "0 1 * * *"
  backupSource:
    name: mongodb-backup
    namespace: db-prod
  checks:
    - name: pvc-and-secret
      type: resourceExists
      resourceExists:
        resources:
          - kind: PVC
            name: mongodb-data
          - kind: Secret
            name: mongodb-credentials

    - name: pod-ready
      type: podStatus
      podStatus:
        labelSelector:
          app: mongodb
        minReady: 1
        timeout: 10m

    - name: server-status
      type: exec
      exec:
        podSelector:
          app: mongodb
        container: mongodb
        command:
          - mongosh
          - --quiet
          - --eval
          - db.adminCommand({ ping: 1 }).ok
        successExitCode: 0
        timeout: 30s

    - name: collection-count
      type: exec
      exec:
        podSelector:
          app: mongodb
        container: mongodb
        command:
          - /bin/sh
          - -c
          - test $(mongosh --quiet myapp_db --eval "db.users.countDocuments()") -gt 0
        successExitCode: 0
        timeout: 60s

PVC verification

MongoDB stores data in /data/db. To confirm the data directory is not empty after restore:

- name: data-directory-not-empty
  type: exec
  exec:
    podSelector:
      app: mongodb
    container: mongodb
    command:
      - /bin/sh
      - -c
      - test $(ls /data/db | wc -l) -gt 0
    successExitCode: 0
    timeout: 10s

Pitfalls

Replica set reconfiguration: A restored MongoDB replica set member may refuse to start if its replica set configuration references hostnames or node IDs that do not exist in the restored environment. This causes the pod to restart repeatedly. Restore tests should target a standalone instance or a replica set where all members are restored together.

Snapshot during oplog writes: Volume snapshots taken during active write workloads can produce a data directory that MongoDB considers corrupted. MongoDB will attempt journal recovery at startup, but this is not guaranteed. Use --journal and ensure checkpoints are frequent (the default checkpoint interval is 60 seconds).

mongosh vs mongo: Images based on MongoDB 6.0+ use mongosh. Older images (< 5.0) use the mongo binary. Adjust the command field accordingly.

General principles​

PostgreSQL​

Which checks to use​

Example RestoreTest​

PVC verification​

Pitfalls​

MySQL​

Which checks to use​

Example RestoreTest​

PVC verification​

Pitfalls​

Redis​

Which checks to use​

Example RestoreTest​

PVC verification​

Pitfalls​

MongoDB​

Which checks to use​

Example RestoreTest​

PVC verification​

Pitfalls​

General principles

PostgreSQL

Which checks to use

Example RestoreTest

PVC verification

Pitfalls

MySQL

Which checks to use

Example RestoreTest

PVC verification

Pitfalls

Redis

Which checks to use

Example RestoreTest

PVC verification

Pitfalls

MongoDB

Which checks to use

Example RestoreTest

PVC verification

Pitfalls