Resilience and Recovery in Security Architecture
High availability, hot/warm/cold sites, multi-cloud, backups, power, and the metrics — RTO, RPO, MTTR, MTBF — that turn “stay up” into engineering decisions.
Resilience is the engineering answer to “what happens when something fails”; recovery is the operational answer to “how do we get back up.” Every Domain 3.4 question reduces to two numbers: RTO (how much downtime you can tolerate) and RPO (how much data you can afford to lose). Those numbers drive every other choice — hot vs. warm vs. cold site, replication vs. snapshots, active/active vs. active/passive.
Three mental hooks carry most questions: (1) HA is not DR. High availability addresses failure within a datacenter (server, rack, AZ); disaster recovery addresses loss of the datacenter itself — and the exam punishes conflation. (2) RTO and RPO are the forcing functions. Given a stated tolerance, the site type and backup strategy almost pick themselves. (3) A backup you have not restored is not a backup. Untested recovery is the most common and most testable failure in this domain.
High availability (HA). Redundant components and automatic failover so that a single failure does not take the service down. Two common patterns: load balancing distributes independent requests across a pool of servers (usually active/active — all nodes serve traffic), while clustering tightly couples nodes that share state or storage and present as a single logical system. Active/active uses every node and scales capacity; active/passive keeps a hot standby that takes over on failure — simpler but wastes capacity.
Site considerations. When a whole site is lost, an alternate site absorbs the workload. Hot site is a fully operational duplicate with data replicated continuously — near-zero RTO, highest cost. Cold site is an empty facility with power, cooling, and connectivity but no hardware or data — weeks to activate, lowest cost. Warm site sits between the two: hardware and network are in place, but data and apps still need staging — hours to days. Geographic dispersion places sites in different regions so a regional disaster cannot take both down; this trades against latency, replication cost, and sovereignty.
Platform diversity and multi-cloud. Running on multiple operating systems, databases, or cloud providers reduces the blast radius of a single-vendor defect or outage. Multi-cloud spreads workloads across two or more cloud providers — advantages include resilience, lock-in avoidance, and negotiating leverage; disadvantages include inconsistent tooling, data-egress costs, and the operational overhead of maintaining two ways to do everything.
Continuity of operations. The Business Continuity Plan (BCP) is the business-wide plan for keeping the organization operating during and after a disruption — people, processes, facilities, vendors. The Disaster Recovery Plan (DRP) is the IT subset: how to restore technology services after a disaster. BCP is the umbrella; DRP sits inside it.
Capacity planning. Resilience requires headroom in three dimensions: people (on-call staffing, trained responders, surge support), technology (compute, storage, network headroom; scale-up vs. scale-out choices), and infrastructure (power, cooling, physical space). A DR plan that assumes unlimited hands at 3am is a plan that fails at 3am.
Testing. Resilience claims are hypotheses until tested. Tabletop exercises are discussion-based walkthroughs — cheapest, highest frequency, lowest confidence. Simulations inject a failure scenario (chaos engineering) against real systems. Failover tests cut production over to the DR site — highest confidence and highest risk. Parallel processing runs production and DR concurrently to validate DR output matches production without cutting over.
Backups. Backups are distinguished by where they live, how often they run, and how they protect the data. Onsite backups recover fast but share fate with the primary site; offsite backups survive site loss but recover slower. Frequency is set by RPO — one hour of allowable loss means hourly backups minimum. Encryption of backups is mandatory; unencrypted backup media is an attacker’s favorite exfiltration target. Snapshots are point-in-time storage state — fast and space-efficient, but they live on the same storage array and are not a substitute for real backups. Replication copies data continuously or near-continuously to a secondary location; RPO approaches zero but bad writes are replicated too. Journaling stores transaction logs that can be replayed to a point in time (database-style recovery). Recovery testing is the control no one skips in an audit: a backup you have not restored is not a backup.
Power. Resilience dies first at the wall socket. Uninterruptible Power Supply (UPS) bridges the seconds-to-minutes gap between grid failure and generator start. Generators sustain operations for hours or days, limited by fuel. Dual power feeds (two utility feeds or feed plus generator) protect against a single-feed failure. Power is the easiest single point of failure to overlook.
Metrics. Recovery Time Objective (RTO) is how long you can be down before the business is materially harmed. Recovery Point Objective (RPO) is how much data loss is tolerable — measured in time (15 minutes, 1 hour, 24 hours). Mean Time To Recover (MTTR) is the average time to restore service after a failure. Mean Time Between Failures (MTBF) is the average time a system runs between failures. RTO and RPO are the targets; MTTR and MTBF describe the observed behavior.
| Site type | What’s there | Time to operate | Cost |
|---|---|---|---|
| Hot | Hardware, current data, running systems | Near-zero (minutes) | Highest |
| Warm | Hardware + connectivity; data/apps need staging | Hours to days | Medium |
| Cold | Power, HVAC, floor space — nothing else | Weeks | Lowest |
| Copy method | Protects against | Does NOT protect against |
|---|---|---|
| Snapshot | Accidental deletion, rollback of a bad change | Loss of the storage array (same failure domain) |
| Backup (onsite) | Data corruption, file deletion | Loss of the site |
| Backup (offsite) | Site loss, region-scoped disaster | Slower recovery; media handling risk |
| Replication | Hardware failure, site loss with low RPO | Bad writes — they replicate too |
| Journaling | Point-in-time database recovery | Storage loss unless combined with backup |
| Metric | Question it answers | Exam cue |
|---|---|---|
| RTO | How long until we’re back up? | “Service must be restored within 4 hours” |
| RPO | How much data loss is acceptable? | “Lose no more than 15 minutes of transactions” |
| MTTR | How fast do we actually recover? | “Historical average recovery time” |
| MTBF | How often does this fail? | “Expected lifetime between failures” |
A regional logistics company was hit by ransomware that encrypted most of its file servers and the primary database. The IT director was confident: “We take hourly snapshots on our storage array and nightly backups to a second appliance in the same rack.” The BCP lead is asking harder questions before they agree the recovery plan is sound.
IT Director ↔ BCP Lead
Recovery options reviewYou run a mid-size SaaS business with a stated RPO of 1 hour and an RTO of 4 hours. Revenue impact of a full-day outage is meaningful but not existential. The board wants strong continuity, but is pushing back on cost. You must pick between a hot secondary region with continuous replication, or a warm secondary with hourly snapshots shipped out of region.
Hot site — active/passive across regions
Full capacity running in a second region, continuous replication, automatic failover. Near-zero RTO/RPO. Highest cost by a wide margin.
Warm site — hardware ready, hourly snapshots shipped out of region
Standby infrastructure is provisioned but not running production load. Hourly snapshots match the 1-hour RPO; restore + promotion fits within the 4-hour RTO. Cost materially lower than hot.
Option B matches the stated RTO/RPO with meaningful cost savings.
The objectives tell you exactly what to buy: 1-hour RPO means hourly snapshots minimum — you do not need continuous replication to meet the target. A 4-hour RTO comfortably accommodates a warm-site promote/restore sequence. Paying hot-site prices for a 4-hour RTO is overbuying.
When Option A is right: hot/active-passive is the answer when RTO is minutes (financial trading, life-safety systems) or when the revenue cost of even an hour of downtime outweighs the delta in site cost. At 4-hour RTO and meaningful-but-not-existential revenue risk, it is not this scenario.
Both options meet the technical requirement — the exam tests whether you can match the site tier to the numbers. Don’t oversell; the cheapest option that meets RTO/RPO is the right engineering answer.
- A RTO = 15 minutes; RPO = 1 hour
- B RPO = 15 minutes; RTO = 1 hour
- C MTTR = 15 minutes; MTBF = 1 hour
- D Both values are MTTR targets
Correct: B. RPO is how much data loss is tolerable (15 minutes here); RTO is how long the outage may last (1 hour here). The exam tests this swap relentlessly.
A wrong: the labels are reversed. C wrong: MTTR and MTBF describe observed behavior, not stated objectives. D wrong: at least one value is a data-loss target, which is RPO.
Source: CompTIA SY0-701 Objectives v5.0 — 3.4 Resilience and Recovery
- A Hot site with active/active replication
- B Warm site with hourly snapshot shipping
- C Cold site with nightly offsite backups
- D Multi-cloud active/active across three providers
Correct: C. 48-hour RTO and minimal regulatory pressure means a cold site with nightly offsite backups is the cheapest tier that meets the requirement. Spending more would be overengineering.
A and D wrong: both are hot/active strategies that cost far more than the 48-hour RTO justifies. B wrong: warm is still overbuilt for this tolerance.
Source: CompTIA SY0-701 Objectives v5.0 — 3.4 Resilience and Recovery
- A Additional hourly snapshots on the same storage array
- B Faster replication between production servers
- C Offsite, immutable (or air-gapped) backups stored in a separate failure domain
- D Larger UPS capacity in the primary datacenter
Correct: C. The failure here is that snapshots and onsite backups share the production failure domain. An offsite, immutable, or air-gapped copy is in a different failure domain the attacker cannot reach, which is exactly what the exam objectives mean by resilience.
A wrong: more snapshots in the same failure domain still fall to the same event. B wrong: replication copies the encrypted state just as fast. D wrong: power is a separate concern and would not have stopped a ransomware encryption event.
Source: CompTIA SY0-701 Objectives v5.0 — 3.4 Resilience and Recovery