Skip to main content

Stateful Applications

Stateful workloads have failure modes that ephemeral services do not. A restore may succeed at the Kubernetes level — pods running, Services reachable — while the database is internally inconsistent: corrupted pages, incomplete WAL replay, or a replica pointing at a primary that no longer exists. This guide covers how to structure RestoreTest resources that catch those failures.

General principles

Before looking at individual databases:

  • Always start with resourceExists to confirm PVCs and credential Secrets were included in the backup.
  • Follow with podStatus to gate all subsequent checks behind pod readiness.
  • Use exec with the database's own health command rather than a generic tcpSocket — the former detects recovery mode, the latter does not.
  • Set timeouts conservatively. A 50 GB PostgreSQL database may need 10–15 minutes of WAL replay before accepting queries.

PostgreSQL

Which checks to use

StageCheck typePurpose
1resourceExistsConfirm PVC and password Secret exist
2podStatusWait for the StatefulSet pod to be Ready
3exec (pg_isready)Confirm PostgreSQL is accepting connections
4exec (psql)Run a test query to validate data integrity

Example RestoreTest

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: postgres-nightly
namespace: kymaros-system
spec:
schedule: "0 1 * * *"
backupSource:
name: postgres-backup
namespace: db-prod
checks:
- name: pvc-and-secret
type: resourceExists
resourceExists:
resources:
- kind: PVC
name: postgres-data
- kind: Secret
name: postgres-credentials

- name: pod-ready
type: podStatus
podStatus:
labelSelector:
app: postgres
statefulset.kubernetes.io/pod-name: postgres-0
minReady: 1
timeout: 15m

- name: accepting-connections
type: exec
exec:
podSelector:
app: postgres
statefulset.kubernetes.io/pod-name: postgres-0
container: postgres
command:
- pg_isready
- -U
- postgres
- -d
- myapp_db
successExitCode: 0
timeout: 30s

- name: data-query
type: exec
exec:
podSelector:
app: postgres
statefulset.kubernetes.io/pod-name: postgres-0
container: postgres
command:
- psql
- -U
- postgres
- -d
- myapp_db
- -c
- SELECT COUNT(*) FROM users WHERE created_at > NOW() - INTERVAL '30 days';
successExitCode: 0
timeout: 60s

PVC verification

PostgreSQL stores data in the volume mounted at /var/lib/postgresql/data. To confirm the volume has data (not just that it exists), add an exec check that reads the data directory:

- name: data-directory-present
type: exec
exec:
podSelector:
app: postgres
statefulset.kubernetes.io/pod-name: postgres-0
container: postgres
command:
- test
- -f
- /var/lib/postgresql/data/PG_VERSION
successExitCode: 0
timeout: 10s

Pitfalls

WAL replay: After a snapshot restore, PostgreSQL may enter archive recovery mode and replay WAL segments before accepting connections. pg_isready returns exit code 1 ("refusing connections") during this phase. Set timeout on the podStatus check to at least 15 minutes for databases larger than 20 GB, and set timeout on the exec check to 60–120 seconds.

Snapshots during writes: A volume snapshot taken while PostgreSQL is actively writing can leave the data directory in a state that requires crash recovery. PostgreSQL handles this automatically, but the recovery window depends on checkpoint frequency. Coordinate backups with pg_checkpoint if possible, or enable continuous archiving.

Replication slots: If the restored instance has replication slots that reference WAL positions in the past, it may hold WAL segments indefinitely. After a restore test, confirm that slots are dropped or that WAL accumulation is acceptable.


MySQL

Which checks to use

StageCheck typePurpose
1resourceExistsConfirm PVC and Secret exist
2podStatusWait for pod readiness
3exec (mysqladmin ping)Confirm server is responsive
4exec (mysql -e)Run a query to validate data

Example RestoreTest

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: mysql-nightly
namespace: kymaros-system
spec:
schedule: "0 1 * * *"
backupSource:
name: mysql-backup
namespace: db-prod
checks:
- name: pvc-and-secret
type: resourceExists
resourceExists:
resources:
- kind: PVC
name: mysql-data
- kind: Secret
name: mysql-credentials

- name: pod-ready
type: podStatus
podStatus:
labelSelector:
app: mysql
minReady: 1
timeout: 10m

- name: server-ping
type: exec
exec:
podSelector:
app: mysql
container: mysql
command:
- /bin/sh
- -c
- mysqladmin ping -h 127.0.0.1 -u root --password=$MYSQL_ROOT_PASSWORD
successExitCode: 0
timeout: 30s

- name: data-query
type: exec
exec:
podSelector:
app: mysql
container: mysql
command:
- /bin/sh
- -c
- mysql -u root --password=$MYSQL_ROOT_PASSWORD -e "SELECT COUNT(*) FROM myapp_db.orders;"
successExitCode: 0
timeout: 60s

PVC verification

MySQL stores data in /var/lib/mysql. To confirm InnoDB files are present:

- name: innodb-files-present
type: exec
exec:
podSelector:
app: mysql
container: mysql
command:
- test
- -f
- /var/lib/mysql/ibdata1
successExitCode: 0
timeout: 10s

Pitfalls

InnoDB crash recovery: Like PostgreSQL, MySQL may enter InnoDB crash recovery after a snapshot restore. The pod may show as Running before recovery completes. mysqladmin ping returns mysqld is alive only after recovery finishes, making it a reliable gate.

Binary log positions: If you restore a replica, its binary log position will be ahead of the backup point. Do not attempt to reconnect a restored replica to its primary without resetting the replication coordinates — this will cause the replica to skip transactions.

Environment variable passwords: The exec check runs commands directly without a shell, so $VARIABLE expansion does not work when passing arguments as separate list entries. The examples above use /bin/sh -c as the command entry point to enable shell expansion of $MYSQL_ROOT_PASSWORD. This is required any time your command uses environment variables, pipes, or redirection.


Redis

Which checks to use

StageCheck typePurpose
1podStatusWait for the Redis pod
2tcpSocketConfirm port 6379 is open
3exec (redis-cli PING)Confirm server responds to commands
4exec (redis-cli DBSIZE)Verify keyspace is not empty

Redis does not use PVCs in all deployments. If persistence is enabled via an RDB or AOF file, add a resourceExists check for the PVC and an exec check that tests the file path.

Example RestoreTest

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: redis-nightly
namespace: kymaros-system
spec:
schedule: "0 1 * * *"
backupSource:
name: redis-backup
namespace: cache-prod
checks:
- name: pod-ready
type: podStatus
podStatus:
labelSelector:
app: redis
role: master
minReady: 1
timeout: 5m

- name: port-open
type: tcpSocket
tcpSocket:
service: redis-svc
port: 6379
timeout: 10s

- name: ping
type: exec
exec:
podSelector:
app: redis
role: master
container: redis
command:
- redis-cli
- PING
successExitCode: 0
timeout: 10s

- name: keyspace-not-empty
type: exec
exec:
podSelector:
app: redis
role: master
container: redis
command:
- /bin/sh
- -c
- test $(redis-cli DBSIZE) -gt 0
successExitCode: 0
timeout: 10s

PVC verification

For Redis with AOF or RDB persistence:

- name: persistence-file
type: resourceExists
resourceExists:
resources:
- kind: PVC
name: redis-data

Then follow with an exec check:

- name: rdb-file-exists
type: exec
exec:
podSelector:
app: redis
role: master
container: redis
command:
- test
- -f
- /data/dump.rdb
successExitCode: 0
timeout: 10s

Pitfalls

RDB load time: Redis loads the RDB file at startup before accepting connections. Large RDB files (several GB) cause a startup delay. During this window, redis-cli PING hangs or times out. Ensure podStatus has a sufficient timeout before the exec check runs.

Replica promotion: If the backup captured a replica and not the primary, the keyspace may be complete but the restored instance may have stale keys from before the last replication sync. Verify the restored instance was the primary in the backup, or test for specific critical keys rather than DBSIZE > 0.


MongoDB

Which checks to use

StageCheck typePurpose
1resourceExistsConfirm PVC and Secret exist
2podStatusWait for pod readiness
3exec (mongosh --eval)Confirm server responds
4exec (mongosh count)Validate a collection is not empty

Example RestoreTest

apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: mongodb-nightly
namespace: kymaros-system
spec:
schedule: "0 1 * * *"
backupSource:
name: mongodb-backup
namespace: db-prod
checks:
- name: pvc-and-secret
type: resourceExists
resourceExists:
resources:
- kind: PVC
name: mongodb-data
- kind: Secret
name: mongodb-credentials

- name: pod-ready
type: podStatus
podStatus:
labelSelector:
app: mongodb
minReady: 1
timeout: 10m

- name: server-status
type: exec
exec:
podSelector:
app: mongodb
container: mongodb
command:
- mongosh
- --quiet
- --eval
- db.adminCommand({ ping: 1 }).ok
successExitCode: 0
timeout: 30s

- name: collection-count
type: exec
exec:
podSelector:
app: mongodb
container: mongodb
command:
- /bin/sh
- -c
- test $(mongosh --quiet myapp_db --eval "db.users.countDocuments()") -gt 0
successExitCode: 0
timeout: 60s

PVC verification

MongoDB stores data in /data/db. To confirm the data directory is not empty after restore:

- name: data-directory-not-empty
type: exec
exec:
podSelector:
app: mongodb
container: mongodb
command:
- /bin/sh
- -c
- test $(ls /data/db | wc -l) -gt 0
successExitCode: 0
timeout: 10s

Pitfalls

Replica set reconfiguration: A restored MongoDB replica set member may refuse to start if its replica set configuration references hostnames or node IDs that do not exist in the restored environment. This causes the pod to restart repeatedly. Restore tests should target a standalone instance or a replica set where all members are restored together.

Snapshot during oplog writes: Volume snapshots taken during active write workloads can produce a data directory that MongoDB considers corrupted. MongoDB will attempt journal recovery at startup, but this is not guaranteed. Use --journal and ensure checkpoints are frequent (the default checkpoint interval is 60 seconds).

mongosh vs mongo: Images based on MongoDB 6.0+ use mongosh. Older images (< 5.0) use the mongo binary. Adjust the command field accordingly.