Backup and Recovery¶
Overview¶
This document outlines the backup and recovery procedures for the Gustaffo Reservations Application. It covers database backups, application state, disaster recovery, and business continuity planning.
Backup Strategy¶
Database Backups¶
PostgreSQL Database¶
Backup Type | Frequency | Retention | Storage Location |
---|---|---|---|
Full Backup | Daily | 30 days | Cloud Storage |
WAL Archiving | Continuous | 7 days | Cloud Storage |
Logical Dump | Weekly | 90 days | Cloud Storage + Offsite |
Backup Process:
-
Full Backups: Automated using pg_dump with point-in-time recovery enabled
1
pg_dump -Fc -v -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} > ${BACKUP_DIR}/full_${TIMESTAMP}.dump
-
WAL Archiving: Continuous archiving of Write-Ahead Logs
1 2 3
# In postgresql.conf archive_mode = on archive_command = 'aws s3 cp %p s3://gustaffo-backups/wal/%f'
-
Logical Dumps: Weekly schema and data dumps
1 2
pg_dumpall -g > ${BACKUP_DIR}/globals_${TIMESTAMP}.sql pg_dump -Fc -v -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} > ${BACKUP_DIR}/logical_${TIMESTAMP}.dump
Application State¶
Configuration Backups¶
Component | Backup Method | Frequency | Retention |
---|---|---|---|
Kubernetes Manifests | Git Repository | Every change | Permanent |
Helm Charts | Git Repository | Every change | Permanent |
ConfigMaps & Secrets | Export Script | Daily | 30 days |
Backup Process:
-
Kubernetes Resources:
1
kubectl get -n gustaffo-reservations configmap,secret -o yaml > ${BACKUP_DIR}/k8s_config_${TIMESTAMP}.yaml
-
Helm Releases:
1
helm list -n gustaffo-reservations -o yaml > ${BACKUP_DIR}/helm_releases_${TIMESTAMP}.yaml
File Storage¶
Data Type | Backup Method | Frequency | Retention |
---|---|---|---|
Uploaded Files | Snapshot | Daily | 30 days |
Generated PDFs | Snapshot | Daily | 30 days |
Logs | Archive | Weekly | 90 days |
Backup Process:
-
Storage Snapshots:
1 2
# Example for AWS EBS volumes aws ec2 create-snapshot --volume-id ${VOLUME_ID} --description "Daily backup ${TIMESTAMP}"
-
Log Archiving:
1 2
# Compress and archive logs older than 7 days find ${LOG_DIR} -type f -name "*.log" -mtime +7 | xargs tar -czvf ${BACKUP_DIR}/logs_${TIMESTAMP}.tar.gz
Backup Verification¶
Automated Testing¶
- Restore Testing: Weekly automated restore tests to validate backup integrity
- Data Validation: Automated checks for database consistency after restore
- Application Testing: Basic functionality tests with restored data
Manual Verification¶
- Quarterly Exercises: Complete restore exercises performed quarterly
- Validation Checklist: Detailed verification of restored systems
- Documentation Review: Update procedures based on exercise findings
Recovery Procedures¶
Database Recovery¶
Full Database Restore¶
-
Stop Applications:
1
kubectl scale deployment -n gustaffo-reservations --replicas=0 -l app.kubernetes.io/part-of=gustaffo-reservations
-
Prepare Environment:
1 2
# Create empty database if needed psql -h ${DB_HOST} -U ${DB_USER} -c "CREATE DATABASE ${DB_NAME};"
-
Restore from Backup:
1 2
# Full restore pg_restore -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} -v ${BACKUP_FILE}
-
Verify Restore:
1 2
# Run basic validation queries psql -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} -c "SELECT count(*) FROM reservations;"
-
Restart Applications:
1
kubectl scale deployment -n gustaffo-reservations --replicas=${REPLICA_COUNT} -l app.kubernetes.io/part-of=gustaffo-reservations
Point-in-Time Recovery¶
-
Determine Recovery Target Time:
1 2
# Set the target time for recovery RECOVERY_TARGET_TIME="2023-07-15 14:30:00"
-
Create recovery.conf:
1 2
restore_command = 'aws s3 cp s3://gustaffo-backups/wal/%f %p' recovery_target_time = '${RECOVERY_TARGET_TIME}'
-
Restore Base Backup:
1
pg_restore -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} ${BASE_BACKUP}
-
Apply WAL Files: PostgreSQL will automatically apply WAL files up to the target time
-
Verify Recovery:
1 2
# Verify data at recovery point psql -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} -c "SELECT max(created_at) FROM reservations;"
Application Recovery¶
Configuration Restore¶
-
Apply Kubernetes Resources:
1
kubectl apply -f ${BACKUP_DIR}/k8s_config_${TIMESTAMP}.yaml
-
Reinstall Helm Releases if needed:
1
helm upgrade --install -n gustaffo-reservations -f ${VALUES_FILE} ${RELEASE_NAME} ${CHART}
File Storage Recovery¶
-
Restore from Snapshot:
1 2
# Example for AWS EBS aws ec2 create-volume --availability-zone ${AZ} --snapshot-id ${SNAPSHOT_ID}
-
Attach to Instances:
1
aws ec2 attach-volume --volume-id ${VOLUME_ID} --instance-id ${INSTANCE_ID} --device ${DEVICE_NAME}
Disaster Recovery¶
Disaster Scenarios¶
-
Database Corruption/Failure
- Impact: Data loss or unavailability
- Recovery: Restore from latest backup, apply WAL logs
- RTO: 1 hour
- RPO: 15 minutes (maximum data loss)
-
Application Infrastructure Failure
- Impact: Service unavailability
- Recovery: Redeploy to secondary region
- RTO: 30 minutes
- RPO: 5 minutes
-
Primary Region Outage
- Impact: Complete service unavailability
- Recovery: Activate secondary region
- RTO: 1 hour
- RPO: 15 minutes
-
Data Center Destruction
- Impact: Complete loss of infrastructure
- Recovery: Rebuild in alternative region
- RTO: 4 hours
- RPO: 15 minutes
Recovery Time Objectives (RTO)¶
Service Component | RTO | Description |
---|---|---|
Database Services | 1 hour | Time to restore database and verify integrity |
API Services | 30 minutes | Time to redeploy and validate API functionality |
Frontend Applications | 15 minutes | Time to redeploy and validate frontend |
Complete System | 4 hours | Time to recover all components after catastrophic failure |
Recovery Point Objectives (RPO)¶
Data Type | RPO | Description |
---|---|---|
Reservation Data | 15 minutes | Maximum acceptable data loss for reservations |
Payment Data | 5 minutes | Maximum acceptable data loss for payment records |
Configuration Data | 24 hours | Maximum acceptable data loss for configuration |
Business Continuity Plan¶
Service Continuity¶
Multi-Region Deployment¶
The application is deployed across multiple regions to ensure continuity:
- Primary Region: EU-West (Ireland)
- Secondary Region: EU-Central (Frankfurt)
- DR Region: US-East (Virginia)
Failover Process¶
- Automated Detection: Monitoring systems detect regional failure
- Alert Notification: Operations team notified of potential failover need
- Decision Point: Operations team evaluates need for failover
- DNS Failover: Update DNS records to point to secondary region
- Database Promotion: Promote database replica in secondary region
- Verification: Validate application functionality in secondary region
- Notification: Inform stakeholders of failover completion
Continuity Testing¶
- Scheduled Tests: Quarterly failover exercises
- Unscheduled Drills: Random continuity drills (no advance notice to team)
- Documentation Updates: Procedures updated based on test results
Responsibilities¶
Backup Management¶
- DevOps Team: Responsible for backup configuration and monitoring
- Database Team: Responsible for database backup validation
- Security Team: Responsible for backup security and access controls
Recovery Operations¶
- Incident Commander: Coordinates recovery operations
- Database Team: Executes database recovery procedures
- Application Team: Validates application functionality after recovery
- Network Team: Ensures network connectivity during recovery
Documentation and Training¶
Documentation¶
- Backup Procedures: Detailed step-by-step backup instructions
- Recovery Playbooks: Scenario-based recovery instructions
- Contact Information: Emergency contacts and escalation paths
Training¶
- Regular Drills: Quarterly backup and recovery exercises
- New Team Members: Onboarding includes backup/recovery training
- Refresher Training: Annual review of procedures for all team members
Compliance and Auditing¶
Backup Compliance¶
- PCI DSS Requirements: Secure backup of cardholder data
- GDPR Requirements: Proper handling of personal data in backups
- Internal Policies: Adherence to company data protection policies
Audit Trail¶
- Backup Logs: Detailed logs of all backup operations
- Restore Logs: Documentation of all restore operations
- Access Logs: Records of all access to backup systems
Appendix¶
Backup Schedule¶
Backup Type | Schedule | Start Time | Expected Duration |
---|---|---|---|
Database Full Backup | Daily | 01:00 UTC | 30 minutes |
Database Incremental | Hourly | HH:15 UTC | 5 minutes |
Configuration Backup | Daily | 02:00 UTC | 10 minutes |
File Storage Snapshot | Daily | 03:00 UTC | 20 minutes |
Recovery Testing Schedule¶
Test Type | Frequency | Last Performed | Next Scheduled |
---|---|---|---|
Database Restore | Monthly | 2023-06-15 | 2023-07-15 |
Full System Recovery | Quarterly | 2023-04-10 | 2023-07-10 |
Regional Failover | Quarterly | 2023-05-22 | 2023-08-22 |