Explain the importance of resilience and recovery in security architecture
Systems will fail. Breaches will happen. Resilience is about continuing to operate during disruption; recovery is about getting back to normal after. Both must be designed into the architecture, not bolted on after an incident.
High Availability
Load Balancing
Distributing workloads across multiple servers so no single server is a bottleneck or single point of failure.
- Active-active: All servers handle traffic simultaneously. Best performance and redundancy.
- Active-passive: Standby servers take over only when the primary fails. Simpler but wastes idle capacity.
- Layer 4 (transport) vs. Layer 7 (application) load balancing
Clustering
Multiple servers operating as a single logical system. If one node fails, others continue serving.
- Database clusters, application server clusters
- Requires shared or replicated state between nodes
Redundancy
Eliminating single points of failure across every layer:
- Server: Multiple servers behind load balancers
- Network: Dual ISPs, redundant switches/routers, diverse routing paths
- Power: Dual power supplies, UPS (Uninterruptible Power Supply), generators
- Storage: RAID, replicated databases, distributed storage
- Site: Geographic redundancy (hot/warm/cold sites)
RAID (Redundant Array of Independent Disks)
| Level | Description | Fault Tolerance | Min Disks |
|---|---|---|---|
| RAID 0 | Striping. Performance, no redundancy. | None — one disk fails, all data lost | 2 |
| RAID 1 | Mirroring. Exact copy on two disks. | Survives one disk failure | 2 |
| RAID 5 | Striping with distributed parity. | Survives one disk failure | 3 |
| RAID 6 | Striping with double parity. | Survives two disk failures | 4 |
| RAID 10 | Mirror + stripe (1+0). | Survives one failure per mirror pair | 4 |
Exam tip: RAID is not a backup. It protects against disk failure, not against data corruption, ransomware, or accidental deletion. You still need backups.
Backup Strategies
Types
- Full: Complete copy of all data. Slowest to create, fastest to restore.
- Incremental: Only data changed since the last backup (any type). Fast to create, slower to restore (need full + all incrementals).
- Differential: All data changed since the last full backup. Middle ground — larger than incremental, faster to restore.
- Snapshot: Point-in-time copy of a volume or VM state. Fast creation through copy-on-write. Used for quick recovery but not a replacement for offsite backups.
3-2-1 Rule
- 3 copies of data (primary + 2 backups)
- 2 different media types (disk + tape, disk + cloud)
- 1 offsite copy (geographically separate)
Testing
Backups that aren’t tested are assumptions, not backups. Regular restore tests verify:
- Data integrity (restored data matches original)
- Recovery time (meets RTO)
- Process documentation accuracy
Recovery Sites
Hot Site
Fully equipped, real-time data replication, ready to take over immediately.
- RTO: Minutes to hours
- Most expensive. Maintained continuously with live data sync.
Warm Site
Equipment installed but not fully configured. Data must be restored from backups.
- RTO: Hours to days
- Balance between cost and recovery speed
Cold Site
Empty facility with power, cooling, and network connectivity. No equipment pre-installed.
- RTO: Days to weeks
- Cheapest. Equipment must be procured and configured after activation.
Cloud-Based Recovery
- DRaaS (Disaster Recovery as a Service): Cloud provider hosts your recovery environment
- Scales on demand — pay for standby capacity, spin up full environment during disaster
- Eliminates physical site management
Recovery Metrics
RTO (Recovery Time Objective)
Maximum acceptable time to restore operations after a disruption.
- “We must be back online within 4 hours.”
- Drives decisions about recovery site type, backup strategy, and automation investment.
RPO (Recovery Point Objective)
Maximum acceptable amount of data loss measured in time.
- “We can afford to lose no more than 1 hour of data.”
- RPO of 1 hour → backups/replication must happen at least hourly.
- RPO of 0 → requires real-time synchronous replication.
MTBF (Mean Time Between Failures)
Average time a system operates before failing. Measure of reliability.
- Higher MTBF = more reliable system
- Used to predict failure frequency and plan maintenance
MTTR (Mean Time to Repair)
Average time to restore a system after failure. Measure of maintainability.
- Lower MTTR = faster recovery
- Improved by automation, documentation, spare parts availability
Power Resilience
UPS (Uninterruptible Power Supply)
Battery backup that provides immediate power during outages.
- Bridges the gap between power loss and generator startup (typically 10-30 seconds)
- Also provides power conditioning (surge protection, voltage regulation)
Generator
Diesel or natural gas backup power for extended outages.
- Requires fuel supply and regular testing
- Automatic transfer switch (ATS) manages failover from utility to generator
PDU (Power Distribution Unit)
Distributes power to rack-mounted equipment. Managed PDUs allow remote power cycling and monitoring.
Capacity Planning
Ensuring infrastructure can handle current and future demand:
- People: Sufficient staff for security operations, incident response, recovery
- Technology: Processing capacity, storage, network bandwidth
- Infrastructure: Data center space, power, cooling
Underprovisioning creates availability risk. Overprovisioning wastes resources. Capacity planning balances both against business requirements and growth projections.
Offensive Context
Resilience is what prevents a successful attack from becoming a catastrophe. An attacker who detonates ransomware against an org with tested backups, a warm site, and a 4-hour RTO has caused an inconvenience. The same attack against an org with untested backups and no recovery site is an existential threat. Attackers increasingly target backup systems specifically (deleting shadow copies, encrypting backup servers, compromising cloud backup credentials) because they know destroying recovery capability maximizes leverage. Your backup architecture must assume the attacker will try to destroy it.