3.4: Explain the importance of resilience and recovery in security architecture // Wolf//Sec

Systems will fail. Breaches will happen. Resilience is about continuing to operate during disruption; recovery is about getting back to normal after. Both must be designed into the architecture, not bolted on after an incident.

High Availability

Load Balancing

Distributing workloads across multiple servers so no single server is a bottleneck or single point of failure.

Active-active: All servers handle traffic simultaneously. Best performance and redundancy.
Active-passive: Standby servers take over only when the primary fails. Simpler but wastes idle capacity.
Layer 4 (transport) vs. Layer 7 (application) load balancing

Clustering

Multiple servers operating as a single logical system. If one node fails, others continue serving.

Database clusters, application server clusters
Requires shared or replicated state between nodes

Redundancy

Eliminating single points of failure across every layer:

Server: Multiple servers behind load balancers
Network: Dual ISPs, redundant switches/routers, diverse routing paths
Power: Dual power supplies, UPS (Uninterruptible Power Supply), generators
Storage: RAID, replicated databases, distributed storage
Site: Geographic redundancy (hot/warm/cold sites)

RAID (Redundant Array of Independent Disks)

Level	Description	Fault Tolerance	Min Disks
RAID 0	Striping. Performance, no redundancy.	None — one disk fails, all data lost	2
RAID 1	Mirroring. Exact copy on two disks.	Survives one disk failure	2
RAID 5	Striping with distributed parity.	Survives one disk failure	3
RAID 6	Striping with double parity.	Survives two disk failures	4
RAID 10	Mirror + stripe (1+0).	Survives one failure per mirror pair	4

Exam tip: RAID is not a backup. It protects against disk failure, not against data corruption, ransomware, or accidental deletion. You still need backups.

Backup Strategies

Types

Full: Complete copy of all data. Slowest to create, fastest to restore.
Incremental: Only data changed since the last backup (any type). Fast to create, slower to restore (need full + all incrementals).
Differential: All data changed since the last full backup. Middle ground — larger than incremental, faster to restore.
Snapshot: Point-in-time copy of a volume or VM state. Fast creation through copy-on-write. Used for quick recovery but not a replacement for offsite backups.

3-2-1 Rule

3 copies of data (primary + 2 backups)
2 different media types (disk + tape, disk + cloud)
1 offsite copy (geographically separate)

Testing

Backups that aren’t tested are assumptions, not backups. Regular restore tests verify:

Data integrity (restored data matches original)
Recovery time (meets RTO)
Process documentation accuracy

Recovery Sites

Hot Site

Fully equipped, real-time data replication, ready to take over immediately.

RTO: Minutes to hours
Most expensive. Maintained continuously with live data sync.

Warm Site

Equipment installed but not fully configured. Data must be restored from backups.

RTO: Hours to days
Balance between cost and recovery speed

Cold Site

Empty facility with power, cooling, and network connectivity. No equipment pre-installed.

RTO: Days to weeks
Cheapest. Equipment must be procured and configured after activation.

Cloud-Based Recovery

DRaaS (Disaster Recovery as a Service): Cloud provider hosts your recovery environment
Scales on demand — pay for standby capacity, spin up full environment during disaster
Eliminates physical site management

Recovery Metrics

RTO (Recovery Time Objective)

Maximum acceptable time to restore operations after a disruption.

“We must be back online within 4 hours.”
Drives decisions about recovery site type, backup strategy, and automation investment.

RPO (Recovery Point Objective)

Maximum acceptable amount of data loss measured in time.

“We can afford to lose no more than 1 hour of data.”
RPO of 1 hour → backups/replication must happen at least hourly.
RPO of 0 → requires real-time synchronous replication.

MTBF (Mean Time Between Failures)

Average time a system operates before failing. Measure of reliability.

Higher MTBF = more reliable system
Used to predict failure frequency and plan maintenance

MTTR (Mean Time to Repair)

Average time to restore a system after failure. Measure of maintainability.

Lower MTTR = faster recovery
Improved by automation, documentation, spare parts availability

Power Resilience

UPS (Uninterruptible Power Supply)

Battery backup that provides immediate power during outages.

Bridges the gap between power loss and generator startup (typically 10-30 seconds)
Also provides power conditioning (surge protection, voltage regulation)

Generator

Diesel or natural gas backup power for extended outages.

Requires fuel supply and regular testing
Automatic transfer switch (ATS) manages failover from utility to generator

PDU (Power Distribution Unit)

Distributes power to rack-mounted equipment. Managed PDUs allow remote power cycling and monitoring.

Capacity Planning

Ensuring infrastructure can handle current and future demand:

People: Sufficient staff for security operations, incident response, recovery
Technology: Processing capacity, storage, network bandwidth
Infrastructure: Data center space, power, cooling

Underprovisioning creates availability risk. Overprovisioning wastes resources. Capacity planning balances both against business requirements and growth projections.

Offensive Context

Resilience is what prevents a successful attack from becoming a catastrophe. An attacker who detonates ransomware against an org with tested backups, a warm site, and a 4-hour RTO has caused an inconvenience. The same attack against an org with untested backups and no recovery site is an existential threat. Attackers increasingly target backup systems specifically (deleting shadow copies, encrypting backup servers, compromising cloud backup credentials) because they know destroying recovery capability maximizes leverage. Your backup architecture must assume the attacker will try to destroy it.