Stateful Applications
Stateful workloads have failure modes that ephemeral services do not. A restore may succeed at the Kubernetes level — pods running, Services reachable — while the database is internally inconsistent: corrupted pages, incomplete WAL replay, or a replica pointing at a primary that no longer exists. This guide covers how to structure RestoreTest resources that catch those failures.
General principles
Before looking at individual databases:
- Always start with
resourceExiststo confirm PVCs and credential Secrets were included in the backup. - Follow with
podStatusto gate all subsequent checks behind pod readiness. - Use
execwith the database's own health command rather than a generictcpSocket— the former detects recovery mode, the latter does not. - Set timeouts conservatively. A 50 GB PostgreSQL database may need 10–15 minutes of WAL replay before accepting queries.
PostgreSQL
Which checks to use
| Stage | Check type | Purpose |
|---|---|---|
| 1 | resourceExists | Confirm PVC and password Secret exist |
| 2 | podStatus | Wait for the StatefulSet pod to be Ready |
| 3 | exec (pg_isready) | Confirm PostgreSQL is accepting connections |
| 4 | exec (psql) | Run a test query to validate data integrity |
Example RestoreTest
apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: postgres-nightly
namespace: kymaros-system
spec:
schedule: "0 1 * * *"
backupSource:
name: postgres-backup
namespace: db-prod
checks:
- name: pvc-and-secret
type: resourceExists
resourceExists:
resources:
- kind: PVC
name: postgres-data
- kind: Secret
name: postgres-credentials
- name: pod-ready
type: podStatus
podStatus:
labelSelector:
app: postgres
statefulset.kubernetes.io/pod-name: postgres-0
minReady: 1
timeout: 15m
- name: accepting-connections
type: exec
exec:
podSelector:
app: postgres
statefulset.kubernetes.io/pod-name: postgres-0
container: postgres
command:
- pg_isready
- -U
- postgres
- -d
- myapp_db
successExitCode: 0
timeout: 30s
- name: data-query
type: exec
exec:
podSelector:
app: postgres
statefulset.kubernetes.io/pod-name: postgres-0
container: postgres
command:
- psql
- -U
- postgres
- -d
- myapp_db
- -c
- SELECT COUNT(*) FROM users WHERE created_at > NOW() - INTERVAL '30 days';
successExitCode: 0
timeout: 60s
PVC verification
PostgreSQL stores data in the volume mounted at /var/lib/postgresql/data. To confirm the volume has data (not just that it exists), add an exec check that reads the data directory:
- name: data-directory-present
type: exec
exec:
podSelector:
app: postgres
statefulset.kubernetes.io/pod-name: postgres-0
container: postgres
command:
- test
- -f
- /var/lib/postgresql/data/PG_VERSION
successExitCode: 0
timeout: 10s
Pitfalls
WAL replay: After a snapshot restore, PostgreSQL may enter archive recovery mode and replay WAL segments before accepting connections. pg_isready returns exit code 1 ("refusing connections") during this phase. Set timeout on the podStatus check to at least 15 minutes for databases larger than 20 GB, and set timeout on the exec check to 60–120 seconds.
Snapshots during writes: A volume snapshot taken while PostgreSQL is actively writing can leave the data directory in a state that requires crash recovery. PostgreSQL handles this automatically, but the recovery window depends on checkpoint frequency. Coordinate backups with pg_checkpoint if possible, or enable continuous archiving.
Replication slots: If the restored instance has replication slots that reference WAL positions in the past, it may hold WAL segments indefinitely. After a restore test, confirm that slots are dropped or that WAL accumulation is acceptable.
MySQL
Which checks to use
| Stage | Check type | Purpose |
|---|---|---|
| 1 | resourceExists | Confirm PVC and Secret exist |
| 2 | podStatus | Wait for pod readiness |
| 3 | exec (mysqladmin ping) | Confirm server is responsive |
| 4 | exec (mysql -e) | Run a query to validate data |
Example RestoreTest
apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: mysql-nightly
namespace: kymaros-system
spec:
schedule: "0 1 * * *"
backupSource:
name: mysql-backup
namespace: db-prod
checks:
- name: pvc-and-secret
type: resourceExists
resourceExists:
resources:
- kind: PVC
name: mysql-data
- kind: Secret
name: mysql-credentials
- name: pod-ready
type: podStatus
podStatus:
labelSelector:
app: mysql
minReady: 1
timeout: 10m
- name: server-ping
type: exec
exec:
podSelector:
app: mysql
container: mysql
command:
- /bin/sh
- -c
- mysqladmin ping -h 127.0.0.1 -u root --password=$MYSQL_ROOT_PASSWORD
successExitCode: 0
timeout: 30s
- name: data-query
type: exec
exec:
podSelector:
app: mysql
container: mysql
command:
- /bin/sh
- -c
- mysql -u root --password=$MYSQL_ROOT_PASSWORD -e "SELECT COUNT(*) FROM myapp_db.orders;"
successExitCode: 0
timeout: 60s
PVC verification
MySQL stores data in /var/lib/mysql. To confirm InnoDB files are present:
- name: innodb-files-present
type: exec
exec:
podSelector:
app: mysql
container: mysql
command:
- test
- -f
- /var/lib/mysql/ibdata1
successExitCode: 0
timeout: 10s
Pitfalls
InnoDB crash recovery: Like PostgreSQL, MySQL may enter InnoDB crash recovery after a snapshot restore. The pod may show as Running before recovery completes. mysqladmin ping returns mysqld is alive only after recovery finishes, making it a reliable gate.
Binary log positions: If you restore a replica, its binary log position will be ahead of the backup point. Do not attempt to reconnect a restored replica to its primary without resetting the replication coordinates — this will cause the replica to skip transactions.
Environment variable passwords: The exec check runs commands directly without a shell, so $VARIABLE expansion does not work when passing arguments as separate list entries. The examples above use /bin/sh -c as the command entry point to enable shell expansion of $MYSQL_ROOT_PASSWORD. This is required any time your command uses environment variables, pipes, or redirection.
Redis
Which checks to use
| Stage | Check type | Purpose |
|---|---|---|
| 1 | podStatus | Wait for the Redis pod |
| 2 | tcpSocket | Confirm port 6379 is open |
| 3 | exec (redis-cli PING) | Confirm server responds to commands |
| 4 | exec (redis-cli DBSIZE) | Verify keyspace is not empty |
Redis does not use PVCs in all deployments. If persistence is enabled via an RDB or AOF file, add a resourceExists check for the PVC and an exec check that tests the file path.
Example RestoreTest
apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: redis-nightly
namespace: kymaros-system
spec:
schedule: "0 1 * * *"
backupSource:
name: redis-backup
namespace: cache-prod
checks:
- name: pod-ready
type: podStatus
podStatus:
labelSelector:
app: redis
role: master
minReady: 1
timeout: 5m
- name: port-open
type: tcpSocket
tcpSocket:
service: redis-svc
port: 6379
timeout: 10s
- name: ping
type: exec
exec:
podSelector:
app: redis
role: master
container: redis
command:
- redis-cli
- PING
successExitCode: 0
timeout: 10s
- name: keyspace-not-empty
type: exec
exec:
podSelector:
app: redis
role: master
container: redis
command:
- /bin/sh
- -c
- test $(redis-cli DBSIZE) -gt 0
successExitCode: 0
timeout: 10s
PVC verification
For Redis with AOF or RDB persistence:
- name: persistence-file
type: resourceExists
resourceExists:
resources:
- kind: PVC
name: redis-data
Then follow with an exec check:
- name: rdb-file-exists
type: exec
exec:
podSelector:
app: redis
role: master
container: redis
command:
- test
- -f
- /data/dump.rdb
successExitCode: 0
timeout: 10s
Pitfalls
RDB load time: Redis loads the RDB file at startup before accepting connections. Large RDB files (several GB) cause a startup delay. During this window, redis-cli PING hangs or times out. Ensure podStatus has a sufficient timeout before the exec check runs.
Replica promotion: If the backup captured a replica and not the primary, the keyspace may be complete but the restored instance may have stale keys from before the last replication sync. Verify the restored instance was the primary in the backup, or test for specific critical keys rather than DBSIZE > 0.
MongoDB
Which checks to use
| Stage | Check type | Purpose |
|---|---|---|
| 1 | resourceExists | Confirm PVC and Secret exist |
| 2 | podStatus | Wait for pod readiness |
| 3 | exec (mongosh --eval) | Confirm server responds |
| 4 | exec (mongosh count) | Validate a collection is not empty |
Example RestoreTest
apiVersion: restore.kymaros.io/v1alpha1
kind: RestoreTest
metadata:
name: mongodb-nightly
namespace: kymaros-system
spec:
schedule: "0 1 * * *"
backupSource:
name: mongodb-backup
namespace: db-prod
checks:
- name: pvc-and-secret
type: resourceExists
resourceExists:
resources:
- kind: PVC
name: mongodb-data
- kind: Secret
name: mongodb-credentials
- name: pod-ready
type: podStatus
podStatus:
labelSelector:
app: mongodb
minReady: 1
timeout: 10m
- name: server-status
type: exec
exec:
podSelector:
app: mongodb
container: mongodb
command:
- mongosh
- --quiet
- --eval
- db.adminCommand({ ping: 1 }).ok
successExitCode: 0
timeout: 30s
- name: collection-count
type: exec
exec:
podSelector:
app: mongodb
container: mongodb
command:
- /bin/sh
- -c
- test $(mongosh --quiet myapp_db --eval "db.users.countDocuments()") -gt 0
successExitCode: 0
timeout: 60s
PVC verification
MongoDB stores data in /data/db. To confirm the data directory is not empty after restore:
- name: data-directory-not-empty
type: exec
exec:
podSelector:
app: mongodb
container: mongodb
command:
- /bin/sh
- -c
- test $(ls /data/db | wc -l) -gt 0
successExitCode: 0
timeout: 10s
Pitfalls
Replica set reconfiguration: A restored MongoDB replica set member may refuse to start if its replica set configuration references hostnames or node IDs that do not exist in the restored environment. This causes the pod to restart repeatedly. Restore tests should target a standalone instance or a replica set where all members are restored together.
Snapshot during oplog writes: Volume snapshots taken during active write workloads can produce a data directory that MongoDB considers corrupted. MongoDB will attempt journal recovery at startup, but this is not guaranteed. Use --journal and ensure checkpoints are frequent (the default checkpoint interval is 60 seconds).
mongosh vs mongo: Images based on MongoDB 6.0+ use mongosh. Older images (< 5.0) use the mongo binary. Adjust the command field accordingly.