3.4 Domain 3 · Security Architecture

Resilience and Recovery in Security Architecture

High availability, hot/warm/cold sites, multi-cloud, backups, power, and the metrics — RTO, RPO, MTTR, MTBF — that turn “stay up” into engineering decisions.

✓

Concept

Textbook

Reference

Real Scenario

Hard Choice

Common Traps

Exam Signal

The Concept

Resilience is the engineering answer to “what happens when something fails”; recovery is the operational answer to “how do we get back up.” Every Domain 3.4 question reduces to two numbers: RTO (how much downtime you can tolerate) and RPO (how much data you can afford to lose). Those numbers drive every other choice — hot vs. warm vs. cold site, replication vs. snapshots, active/active vs. active/passive.

Three mental hooks carry most questions: (1) HA is not DR. High availability addresses failure within a datacenter (server, rack, AZ); disaster recovery addresses loss of the datacenter itself — and the exam punishes conflation. (2) RTO and RPO are the forcing functions. Given a stated tolerance, the site type and backup strategy almost pick themselves. (3) A backup you have not restored is not a backup. Untested recovery is the most common and most testable failure in this domain.

High availability (HA). Redundant components and automatic failover so that a single failure does not take the service down. Two common patterns: load balancing distributes independent requests across a pool of servers (usually active/active — all nodes serve traffic), while clustering tightly couples nodes that share state or storage and present as a single logical system. Active/active uses every node and scales capacity; active/passive keeps a hot standby that takes over on failure — simpler but wastes capacity.

Site considerations. When a whole site is lost, an alternate site absorbs the workload. Hot site is a fully operational duplicate with data replicated continuously — near-zero RTO, highest cost. Cold site is an empty facility with power, cooling, and connectivity but no hardware or data — weeks to activate, lowest cost. Warm site sits between the two: hardware and network are in place, but data and apps still need staging — hours to days. Geographic dispersion places sites in different regions so a regional disaster cannot take both down; this trades against latency, replication cost, and sovereignty.

Platform diversity and multi-cloud. Running on multiple operating systems, databases, or cloud providers reduces the blast radius of a single-vendor defect or outage. Multi-cloud spreads workloads across two or more cloud providers — advantages include resilience, lock-in avoidance, and negotiating leverage; disadvantages include inconsistent tooling, data-egress costs, and the operational overhead of maintaining two ways to do everything.

Continuity of operations. The Business Continuity Plan (BCP) is the business-wide plan for keeping the organization operating during and after a disruption — people, processes, facilities, vendors. The Disaster Recovery Plan (DRP) is the IT subset: how to restore technology services after a disaster. BCP is the umbrella; DRP sits inside it.

Capacity planning. Resilience requires headroom in three dimensions: people (on-call staffing, trained responders, surge support), technology (compute, storage, network headroom; scale-up vs. scale-out choices), and infrastructure (power, cooling, physical space). A DR plan that assumes unlimited hands at 3am is a plan that fails at 3am.

Testing. Resilience claims are hypotheses until tested. Tabletop exercises are discussion-based walkthroughs — cheapest, highest frequency, lowest confidence. Simulations inject a failure scenario (chaos engineering) against real systems. Failover tests cut production over to the DR site — highest confidence and highest risk. Parallel processing runs production and DR concurrently to validate DR output matches production without cutting over.

Backups. Backups are distinguished by where they live, how often they run, and how they protect the data. Onsite backups recover fast but share fate with the primary site; offsite backups survive site loss but recover slower. Frequency is set by RPO — one hour of allowable loss means hourly backups minimum. Encryption of backups is mandatory; unencrypted backup media is an attacker’s favorite exfiltration target. Snapshots are point-in-time storage state — fast and space-efficient, but they live on the same storage array and are not a substitute for real backups. Replication copies data continuously or near-continuously to a secondary location; RPO approaches zero but bad writes are replicated too. Journaling stores transaction logs that can be replayed to a point in time (database-style recovery). Recovery testing is the control no one skips in an audit: a backup you have not restored is not a backup.

Power. Resilience dies first at the wall socket. Uninterruptible Power Supply (UPS) bridges the seconds-to-minutes gap between grid failure and generator start. Generators sustain operations for hours or days, limited by fuel. Dual power feeds (two utility feeds or feed plus generator) protect against a single-feed failure. Power is the easiest single point of failure to overlook.

Metrics. Recovery Time Objective (RTO) is how long you can be down before the business is materially harmed. Recovery Point Objective (RPO) is how much data loss is tolerable — measured in time (15 minutes, 1 hour, 24 hours). Mean Time To Recover (MTTR) is the average time to restore service after a failure. Mean Time Between Failures (MTBF) is the average time a system runs between failures. RTO and RPO are the targets; MTTR and MTBF describe the observed behavior.

Site type	What’s there	Time to operate	Cost
Hot	Hardware, current data, running systems	Near-zero (minutes)	Highest
Warm	Hardware + connectivity; data/apps need staging	Hours to days	Medium
Cold	Power, HVAC, floor space — nothing else	Weeks	Lowest

Copy method	Protects against	Does NOT protect against
Snapshot	Accidental deletion, rollback of a bad change	Loss of the storage array (same failure domain)
Backup (onsite)	Data corruption, file deletion	Loss of the site
Backup (offsite)	Site loss, region-scoped disaster	Slower recovery; media handling risk
Replication	Hardware failure, site loss with low RPO	Bad writes — they replicate too
Journaling	Point-in-time database recovery	Storage loss unless combined with backup

Metric	Question it answers	Exam cue
RTO	How long until we’re back up?	“Service must be restored within 4 hours”
RPO	How much data loss is acceptable?	“Lose no more than 15 minutes of transactions”
MTTR	How fast do we actually recover?	“Historical average recovery time”
MTBF	How often does this fail?	“Expected lifetime between failures”

A regional logistics company was hit by ransomware that encrypted most of its file servers and the primary database. The IT director was confident: “We take hourly snapshots on our storage array and nightly backups to a second appliance in the same rack.” The BCP lead is asking harder questions before they agree the recovery plan is sound.

IT Director

“Snapshots at 6am are clean. We roll every volume back — we’re up in two hours.”

BCP Lead

“Those snapshots live on the same array that was encrypted. If the attacker had admin on the storage, the snapshots go too. Have you verified the snapshot volume is intact?”

IT Director

“Fair. Then the nightly backup appliance in the rack.”

BCP Lead

“Same rack means same failure domain. If the ransomware spread laterally to backup credentials, the appliance is also encrypted. Do we have an offsite, immutable, or air-gapped copy?”

IT Director

“Weekly to tape, pulled offsite Fridays.”

BCP Lead

“That’s our floor. Our RPO just became one week of data, not one hour. And have we tested the tape restore in the last 12 months?”

IT Director

“Not this year.”

BCP Lead

“Then we don’t know yet if we have a backup. We’re going to the tape and working the restore test simultaneously. And we’re adding immutable cloud backups to the plan this quarter.”

Key Insight

Snapshots and onsite backups share the fate of the primary. True resilience requires a copy in a different failure domain — offsite, immutable, or air-gapped — plus a tested restore. Without those two, backups are theatre.

Compensating Controls

If perfect offsite/immutable backups are not yet in place: tightly segment backup infrastructure from production, use separate credentials for backup admins, require MFA for all backup management, and schedule monthly restore drills on representative data.

You run a mid-size SaaS business with a stated RPO of 1 hour and an RTO of 4 hours. Revenue impact of a full-day outage is meaningful but not existential. The board wants strong continuity, but is pushing back on cost. You must pick between a hot secondary region with continuous replication, or a warm secondary with hourly snapshots shipped out of region.

Option A

Hot site — active/passive across regions

Full capacity running in a second region, continuous replication, automatic failover. Near-zero RTO/RPO. Highest cost by a wide margin.

Option B

Warm site — hardware ready, hourly snapshots shipped out of region

Standby infrastructure is provisioned but not running production load. Hourly snapshots match the 1-hour RPO; restore + promotion fits within the 4-hour RTO. Cost materially lower than hot.

Option B matches the stated RTO/RPO with meaningful cost savings.

The objectives tell you exactly what to buy: 1-hour RPO means hourly snapshots minimum — you do not need continuous replication to meet the target. A 4-hour RTO comfortably accommodates a warm-site promote/restore sequence. Paying hot-site prices for a 4-hour RTO is overbuying.

When Option A is right: hot/active-passive is the answer when RTO is minutes (financial trading, life-safety systems) or when the revenue cost of even an hour of downtime outweighs the delta in site cost. At 4-hour RTO and meaningful-but-not-existential revenue risk, it is not this scenario.

Both options meet the technical requirement — the exam tests whether you can match the site tier to the numbers. Don’t oversell; the cheapest option that meets RTO/RPO is the right engineering answer.

Conflating HA with DR

Trap: “We have a load-balanced cluster, so we’re covered for disaster.” HA protects against failures within a site. DR protects against loss of the site.

Exam signal: scenarios mentioning a regional outage, fire, or flood are DR questions; scenarios mentioning a server crash or AZ failure are HA questions.

Treating snapshots as backups

Trap: Relying on snapshots for ransomware recovery. Snapshots live on the primary storage system — they share its failure domain and often its credentials.

Exam signal: any answer that treats snapshots as a complete recovery strategy is almost certainly wrong. Pair snapshots with an offsite/immutable backup.

RTO vs. RPO confusion

Trap: Swapping the two under time pressure. RTO = time-to-up; RPO = tolerable data loss. Mnemonic: RTO = Time, RPO = Point (how far back in time).

Exam signal: every scenario with two numbers invites the swap. Read twice; the number labeled in minutes of acceptable loss is always RPO.

Hot site as default answer

Trap: Picking hot site whenever DR is mentioned. Hot is right only when RTO/RPO is near-zero and cost supports it. For most workloads with hours of tolerance, warm is correct.

Exam signal: questions that emphasize cost constraints or “budget is tight” almost always want warm or cold, matched to the stated tolerance.

Skipping backup testing

Trap: Assuming a backup that runs is a backup that works. Media degrades, encryption keys get lost, processes drift — only a completed restore proves the chain works end to end.

Exam signal: the phrase “regular restore tests” is almost always a correct-answer ingredient in resilience questions.

Forgetting power is a layer

Trap: Designing for server and site failure but not grid failure. UPS bridges the gap; generator sustains; dual feeds protect against a single feed failure.

Exam signal: “generator is enough” is a trap — without a UPS, the generator start time is an outage.

Signal

Resilience questions almost always give you two numbers and a cost signal. Extract RTO, extract RPO, note any “budget-constrained” or “mission-critical” cue, and the site tier and backup pattern follow. Scenarios mentioning ransomware specifically test the snapshot-vs-offsite-backup distinction.

Practice Question 1 of 3

A hospital states that its electronic-medical-record system can tolerate no more than 15 minutes of data loss, and must be restored within 1 hour of any disruption. Which pairing best describes the stated objectives?

A RTO = 15 minutes; RPO = 1 hour
B RPO = 15 minutes; RTO = 1 hour
C MTTR = 15 minutes; MTBF = 1 hour
D Both values are MTTR targets

Correct: B. RPO is how much data loss is tolerable (15 minutes here); RTO is how long the outage may last (1 hour here). The exam tests this swap relentlessly.

A wrong: the labels are reversed. C wrong: MTTR and MTBF describe observed behavior, not stated objectives. D wrong: at least one value is a data-loss target, which is RPO.

Source: CompTIA SY0-701 Objectives v5.0 — 3.4 Resilience and Recovery

Practice Question 2 of 3

A finance firm’s internal HR system has a 48-hour RTO and minimal regulatory pressure. The CFO wants to minimize continuity spend. Which site strategy best matches the requirement?

A Hot site with active/active replication
B Warm site with hourly snapshot shipping
C Cold site with nightly offsite backups
D Multi-cloud active/active across three providers

Correct: C. 48-hour RTO and minimal regulatory pressure means a cold site with nightly offsite backups is the cheapest tier that meets the requirement. Spending more would be overengineering.

A and D wrong: both are hot/active strategies that cost far more than the 48-hour RTO justifies. B wrong: warm is still overbuilt for this tolerance.

Source: CompTIA SY0-701 Objectives v5.0 — 3.4 Resilience and Recovery

Practice Question 3 of 3

A manufacturer discovers that ransomware has encrypted its production file servers, its hourly storage snapshots, and the nightly backup appliance in the same rack. Which control would most directly have prevented this single-event wipeout?

A Additional hourly snapshots on the same storage array
B Faster replication between production servers
C Offsite, immutable (or air-gapped) backups stored in a separate failure domain
D Larger UPS capacity in the primary datacenter

Correct: C. The failure here is that snapshots and onsite backups share the production failure domain. An offsite, immutable, or air-gapped copy is in a different failure domain the attacker cannot reach, which is exactly what the exam objectives mean by resilience.

A wrong: more snapshots in the same failure domain still fall to the same event. B wrong: replication copies the encrypted state just as fast. D wrong: power is a separate concern and would not have stopped a ransomware encryption event.

Source: CompTIA SY0-701 Objectives v5.0 — 3.4 Resilience and Recovery

Gallery

Contacts

Resilience and Recovery in Security Architecture

IT Director ↔ BCP Lead

Hot site — active/passive across regions

Warm site — hardware ready, hourly snapshots shipped out of region

Option B matches the stated RTO/RPO with meaningful cost savings.

Services

Learn

Company

Gallery

Contacts

Resilience and Recovery in Security Architecture

IT Director ↔ BCP Lead

Hot site — active/passive across regions

Warm site — hardware ready, hourly snapshots shipped out of region

Option B matches the stated RTO/RPO with meaningful cost savings.

Stay Current on Certifications

Services

Learn

Company