Skip to main content

Backup Best Practices

Kymaros validates that a backup restores into a working application. The quality of a restore test depends directly on the quality of the backup. A backup that omits critical resources, captures data mid-write, or covers only part of the application will produce misleading validation results — either false passes or uninformative failures.

The following practices are relevant whether you are creating new backup schedules or auditing existing ones.


Include All Application Resources

A restorable backup must include every resource the application needs to run, not just the workload objects.

Always include:

  • Secrets — credentials, TLS certificates, API keys
  • ConfigMaps — application configuration
  • PersistentVolumeClaims and their associated volume snapshots
  • Services and Endpoints if they carry manual configuration
  • ServiceAccounts if the application uses a non-default service account
  • NetworkPolicies if the application defines its own egress/ingress rules

Velero --include-resources example:

velero backup create my-app-full \
--include-namespaces my-app \
--include-cluster-resources=true \
--wait

Avoid using --exclude-resources unless you have an explicit reason. Each excluded resource type is a potential restore failure. If a resource type is excluded, Kymaros may report a completeness gap (Level 2 validation) because the restored namespace has fewer resources than the source.


Use Consistent Snapshots

Velero can capture volume data with filesystem-level snapshots or cloud-provider snapshots. Backups taken while the application is writing data produce inconsistent snapshots, which restore correctly at the volume level but fail at the application level (corrupted databases, partial writes).

Recommended approaches:

  • Use --default-volumes-to-fs-backup with Velero's Restic/Kopia integration and configure a pre-backup hook to quiesce the application:
# Velero pre-backup hook on the application pod
metadata:
annotations:
pre.hook.backup.velero.io/command: '["/bin/sh", "-c", "kill -SIGTERM 1 || true"]'
pre.hook.backup.velero.io/on-error: Fail
  • For databases, use a dedicated backup hook that flushes the write buffer (for example, FLUSH TABLES WITH READ LOCK for MySQL) before the snapshot and releases the lock afterward.

  • For stateless applications (no PVCs), consistency is not a concern — any point-in-time capture is valid.


Test Regularly with Short Intervals

A backup validated six months ago tells you nothing about today. Application dependencies change, schemas drift, and storage backends degrade.

Schedule RestoreTest runs at the same frequency you expect your RTO to matter:

Recovery scenarioRecommended test frequency
RTO < 1 hourDaily
RTO 1–4 hoursWeekly
RTO > 4 hoursMonthly minimum

For critical workloads, a nightly test at low-traffic hours is a reasonable default. The resource cost of a sandbox run is bounded by the ResourceQuota Kymaros applies — it does not consume production-level capacity.


Use Labels for Filtering

Velero supports label selectors for backup scope. Using consistent labels across your application resources makes it easy to define precise backup scope and to verify completeness during restore validation.

Apply a standard label set to all application resources:

metadata:
labels:
app.kubernetes.io/name: my-app
app.kubernetes.io/part-of: my-platform
backup.kymaros.io/include: "true"

Reference the selector in the Velero schedule:

apiVersion: velero.io/v1
kind: Schedule
metadata:
name: my-app-nightly
namespace: velero
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- my-app
labelSelector:
matchLabels:
backup.kymaros.io/include: "true"

The label selector approach also simplifies multi-component applications where only a subset of resources should be included in a given backup.


Set Appropriate TTLs on Backups

Velero backup TTLs control how long backup data is retained in storage. Setting TTLs too short can cause Kymaros to attempt validation against a backup that has already expired.

Rule of thumb: The backup TTL should be at least two times the RestoreTest schedule interval. For a nightly test, the backup TTL should be at least 48 hours.

spec:
template:
ttl: 168h # 7 days

Also consider your compliance retention requirements. If your organization requires 90 days of audit history, the backup TTL must match or exceed that window.


Use Multiple BackupStorageLocations

A single BackupStorageLocation is a single point of failure. If that location is unavailable, Kymaros cannot trigger restores and your validation coverage gaps silently.

Recommended configuration:

  • Primary BSL: in-region object storage (fast access for routine restores and testing)
  • Secondary BSL: cross-region or cross-provider object storage (disaster recovery)

Configure a RestoreTest for each BSL to validate that both restore paths work:

# Test primary BSL
spec:
backupSource:
provider: velero
storageLocation: primary-bsl
backupName: latest

---

# Test secondary BSL
spec:
backupSource:
provider: velero
storageLocation: secondary-bsl
backupName: latest

This catches drift between storage locations early and gives you confidence that the DR path is as valid as the primary path.


Align Backup Scope with the RestoreTest Namespace List

The RestoreTest spec.backupSource.namespaces field filters which namespaces are restored into the sandbox. This list must match the namespaces captured in the backup. Mismatches produce completeness failures.

spec:
backupSource:
provider: velero
backupName: latest
namespaces:
- name: my-app
- name: my-app-config # only if this namespace is in the backup

If you routinely include namespaces in a backup that you do not want to test (for example, a shared logging namespace), do not add them to spec.backupSource.namespaces. Kymaros will only restore the namespaces you list, keeping the sandbox scope focused.