Skip to content

Backup and Recovery

Overview

This document outlines the backup and recovery procedures for the Gustaffo Reservations Application. It covers database backups, application state, disaster recovery, and business continuity planning.

Backup Strategy

Database Backups

PostgreSQL Database

Backup Type Frequency Retention Storage Location
Full Backup Daily 30 days Cloud Storage
WAL Archiving Continuous 7 days Cloud Storage
Logical Dump Weekly 90 days Cloud Storage + Offsite

Backup Process:

  1. Full Backups: Automated using pg_dump with point-in-time recovery enabled

    1
    pg_dump -Fc -v -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} > ${BACKUP_DIR}/full_${TIMESTAMP}.dump
    

  2. WAL Archiving: Continuous archiving of Write-Ahead Logs

    1
    2
    3
    # In postgresql.conf
    archive_mode = on
    archive_command = 'aws s3 cp %p s3://gustaffo-backups/wal/%f'
    

  3. Logical Dumps: Weekly schema and data dumps

    1
    2
    pg_dumpall -g > ${BACKUP_DIR}/globals_${TIMESTAMP}.sql
    pg_dump -Fc -v -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} > ${BACKUP_DIR}/logical_${TIMESTAMP}.dump
    

Application State

Configuration Backups

Component Backup Method Frequency Retention
Kubernetes Manifests Git Repository Every change Permanent
Helm Charts Git Repository Every change Permanent
ConfigMaps & Secrets Export Script Daily 30 days

Backup Process:

  1. Kubernetes Resources:

    1
    kubectl get -n gustaffo-reservations configmap,secret -o yaml > ${BACKUP_DIR}/k8s_config_${TIMESTAMP}.yaml
    

  2. Helm Releases:

    1
    helm list -n gustaffo-reservations -o yaml > ${BACKUP_DIR}/helm_releases_${TIMESTAMP}.yaml
    

File Storage

Data Type Backup Method Frequency Retention
Uploaded Files Snapshot Daily 30 days
Generated PDFs Snapshot Daily 30 days
Logs Archive Weekly 90 days

Backup Process:

  1. Storage Snapshots:

    1
    2
    # Example for AWS EBS volumes
    aws ec2 create-snapshot --volume-id ${VOLUME_ID} --description "Daily backup ${TIMESTAMP}"
    

  2. Log Archiving:

    1
    2
    # Compress and archive logs older than 7 days
    find ${LOG_DIR} -type f -name "*.log" -mtime +7 | xargs tar -czvf ${BACKUP_DIR}/logs_${TIMESTAMP}.tar.gz
    

Backup Verification

Automated Testing

  • Restore Testing: Weekly automated restore tests to validate backup integrity
  • Data Validation: Automated checks for database consistency after restore
  • Application Testing: Basic functionality tests with restored data

Manual Verification

  • Quarterly Exercises: Complete restore exercises performed quarterly
  • Validation Checklist: Detailed verification of restored systems
  • Documentation Review: Update procedures based on exercise findings

Recovery Procedures

Database Recovery

Full Database Restore

  1. Stop Applications:

    1
    kubectl scale deployment -n gustaffo-reservations --replicas=0 -l app.kubernetes.io/part-of=gustaffo-reservations
    

  2. Prepare Environment:

    1
    2
    # Create empty database if needed
    psql -h ${DB_HOST} -U ${DB_USER} -c "CREATE DATABASE ${DB_NAME};"
    

  3. Restore from Backup:

    1
    2
    # Full restore
    pg_restore -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} -v ${BACKUP_FILE}
    

  4. Verify Restore:

    1
    2
    # Run basic validation queries
    psql -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} -c "SELECT count(*) FROM reservations;"
    

  5. Restart Applications:

    1
    kubectl scale deployment -n gustaffo-reservations --replicas=${REPLICA_COUNT} -l app.kubernetes.io/part-of=gustaffo-reservations
    

Point-in-Time Recovery

  1. Determine Recovery Target Time:

    1
    2
    # Set the target time for recovery
    RECOVERY_TARGET_TIME="2023-07-15 14:30:00"
    

  2. Create recovery.conf:

    1
    2
    restore_command = 'aws s3 cp s3://gustaffo-backups/wal/%f %p'
    recovery_target_time = '${RECOVERY_TARGET_TIME}'
    

  3. Restore Base Backup:

    1
    pg_restore -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} ${BASE_BACKUP}
    

  4. Apply WAL Files: PostgreSQL will automatically apply WAL files up to the target time

  5. Verify Recovery:

    1
    2
    # Verify data at recovery point
    psql -h ${DB_HOST} -U ${DB_USER} -d ${DB_NAME} -c "SELECT max(created_at) FROM reservations;"
    

Application Recovery

Configuration Restore

  1. Apply Kubernetes Resources:

    1
    kubectl apply -f ${BACKUP_DIR}/k8s_config_${TIMESTAMP}.yaml
    

  2. Reinstall Helm Releases if needed:

    1
    helm upgrade --install -n gustaffo-reservations -f ${VALUES_FILE} ${RELEASE_NAME} ${CHART}
    

File Storage Recovery

  1. Restore from Snapshot:

    1
    2
    # Example for AWS EBS
    aws ec2 create-volume --availability-zone ${AZ} --snapshot-id ${SNAPSHOT_ID}
    

  2. Attach to Instances:

    1
    aws ec2 attach-volume --volume-id ${VOLUME_ID} --instance-id ${INSTANCE_ID} --device ${DEVICE_NAME}
    

Disaster Recovery

Disaster Scenarios

  1. Database Corruption/Failure

    • Impact: Data loss or unavailability
    • Recovery: Restore from latest backup, apply WAL logs
    • RTO: 1 hour
    • RPO: 15 minutes (maximum data loss)
  2. Application Infrastructure Failure

    • Impact: Service unavailability
    • Recovery: Redeploy to secondary region
    • RTO: 30 minutes
    • RPO: 5 minutes
  3. Primary Region Outage

    • Impact: Complete service unavailability
    • Recovery: Activate secondary region
    • RTO: 1 hour
    • RPO: 15 minutes
  4. Data Center Destruction

    • Impact: Complete loss of infrastructure
    • Recovery: Rebuild in alternative region
    • RTO: 4 hours
    • RPO: 15 minutes

Recovery Time Objectives (RTO)

Service Component RTO Description
Database Services 1 hour Time to restore database and verify integrity
API Services 30 minutes Time to redeploy and validate API functionality
Frontend Applications 15 minutes Time to redeploy and validate frontend
Complete System 4 hours Time to recover all components after catastrophic failure

Recovery Point Objectives (RPO)

Data Type RPO Description
Reservation Data 15 minutes Maximum acceptable data loss for reservations
Payment Data 5 minutes Maximum acceptable data loss for payment records
Configuration Data 24 hours Maximum acceptable data loss for configuration

Business Continuity Plan

Service Continuity

Multi-Region Deployment

The application is deployed across multiple regions to ensure continuity:

  • Primary Region: EU-West (Ireland)
  • Secondary Region: EU-Central (Frankfurt)
  • DR Region: US-East (Virginia)

Failover Process

  1. Automated Detection: Monitoring systems detect regional failure
  2. Alert Notification: Operations team notified of potential failover need
  3. Decision Point: Operations team evaluates need for failover
  4. DNS Failover: Update DNS records to point to secondary region
  5. Database Promotion: Promote database replica in secondary region
  6. Verification: Validate application functionality in secondary region
  7. Notification: Inform stakeholders of failover completion

Continuity Testing

  • Scheduled Tests: Quarterly failover exercises
  • Unscheduled Drills: Random continuity drills (no advance notice to team)
  • Documentation Updates: Procedures updated based on test results

Responsibilities

Backup Management

  • DevOps Team: Responsible for backup configuration and monitoring
  • Database Team: Responsible for database backup validation
  • Security Team: Responsible for backup security and access controls

Recovery Operations

  • Incident Commander: Coordinates recovery operations
  • Database Team: Executes database recovery procedures
  • Application Team: Validates application functionality after recovery
  • Network Team: Ensures network connectivity during recovery

Documentation and Training

Documentation

  • Backup Procedures: Detailed step-by-step backup instructions
  • Recovery Playbooks: Scenario-based recovery instructions
  • Contact Information: Emergency contacts and escalation paths

Training

  • Regular Drills: Quarterly backup and recovery exercises
  • New Team Members: Onboarding includes backup/recovery training
  • Refresher Training: Annual review of procedures for all team members

Compliance and Auditing

Backup Compliance

  • PCI DSS Requirements: Secure backup of cardholder data
  • GDPR Requirements: Proper handling of personal data in backups
  • Internal Policies: Adherence to company data protection policies

Audit Trail

  • Backup Logs: Detailed logs of all backup operations
  • Restore Logs: Documentation of all restore operations
  • Access Logs: Records of all access to backup systems

Appendix

Backup Schedule

Backup Type Schedule Start Time Expected Duration
Database Full Backup Daily 01:00 UTC 30 minutes
Database Incremental Hourly HH:15 UTC 5 minutes
Configuration Backup Daily 02:00 UTC 10 minutes
File Storage Snapshot Daily 03:00 UTC 20 minutes

Recovery Testing Schedule

Test Type Frequency Last Performed Next Scheduled
Database Restore Monthly 2023-06-15 2023-07-15
Full System Recovery Quarterly 2023-04-10 2023-07-10
Regional Failover Quarterly 2023-05-22 2023-08-22
Back to top