Gallery

Contacts

405 W. Greenlawn Ave Lansing, Michigan 48910

contact@techjacksolutions.com

+1-616-320-4064

3.4 Domain 3 · Security Architecture

Resilience and Recovery in Security Architecture

High availability, hot/warm/cold sites, multi-cloud, backups, power, and the metrics — RTO, RPO, MTTR, MTBF — that turn “stay up” into engineering decisions.

Concept
2
Textbook
3
Reference
4
Real Scenario
5
Hard Choice
6
Common Traps
7
Exam Signal
The Concept

Resilience is the engineering answer to “what happens when something fails”; recovery is the operational answer to “how do we get back up.” Every Domain 3.4 question reduces to two numbers: RTO (how much downtime you can tolerate) and RPO (how much data you can afford to lose). Those numbers drive every other choice — hot vs. warm vs. cold site, replication vs. snapshots, active/active vs. active/passive.

Three mental hooks carry most questions: (1) HA is not DR. High availability addresses failure within a datacenter (server, rack, AZ); disaster recovery addresses loss of the datacenter itself — and the exam punishes conflation. (2) RTO and RPO are the forcing functions. Given a stated tolerance, the site type and backup strategy almost pick themselves. (3) A backup you have not restored is not a backup. Untested recovery is the most common and most testable failure in this domain.

High availability (HA). Redundant components and automatic failover so that a single failure does not take the service down. Two common patterns: load balancing distributes independent requests across a pool of servers (usually active/active — all nodes serve traffic), while clustering tightly couples nodes that share state or storage and present as a single logical system. Active/active uses every node and scales capacity; active/passive keeps a hot standby that takes over on failure — simpler but wastes capacity.

Site considerations. When a whole site is lost, an alternate site absorbs the workload. Hot site is a fully operational duplicate with data replicated continuously — near-zero RTO, highest cost. Cold site is an empty facility with power, cooling, and connectivity but no hardware or data — weeks to activate, lowest cost. Warm site sits between the two: hardware and network are in place, but data and apps still need staging — hours to days. Geographic dispersion places sites in different regions so a regional disaster cannot take both down; this trades against latency, replication cost, and sovereignty.

Platform diversity and multi-cloud. Running on multiple operating systems, databases, or cloud providers reduces the blast radius of a single-vendor defect or outage. Multi-cloud spreads workloads across two or more cloud providers — advantages include resilience, lock-in avoidance, and negotiating leverage; disadvantages include inconsistent tooling, data-egress costs, and the operational overhead of maintaining two ways to do everything.

Continuity of operations. The Business Continuity Plan (BCP) is the business-wide plan for keeping the organization operating during and after a disruption — people, processes, facilities, vendors. The Disaster Recovery Plan (DRP) is the IT subset: how to restore technology services after a disaster. BCP is the umbrella; DRP sits inside it.

Capacity planning. Resilience requires headroom in three dimensions: people (on-call staffing, trained responders, surge support), technology (compute, storage, network headroom; scale-up vs. scale-out choices), and infrastructure (power, cooling, physical space). A DR plan that assumes unlimited hands at 3am is a plan that fails at 3am.

Testing. Resilience claims are hypotheses until tested. Tabletop exercises are discussion-based walkthroughs — cheapest, highest frequency, lowest confidence. Simulations inject a failure scenario (chaos engineering) against real systems. Failover tests cut production over to the DR site — highest confidence and highest risk. Parallel processing runs production and DR concurrently to validate DR output matches production without cutting over.

Backups. Backups are distinguished by where they live, how often they run, and how they protect the data. Onsite backups recover fast but share fate with the primary site; offsite backups survive site loss but recover slower. Frequency is set by RPO — one hour of allowable loss means hourly backups minimum. Encryption of backups is mandatory; unencrypted backup media is an attacker’s favorite exfiltration target. Snapshots are point-in-time storage state — fast and space-efficient, but they live on the same storage array and are not a substitute for real backups. Replication copies data continuously or near-continuously to a secondary location; RPO approaches zero but bad writes are replicated too. Journaling stores transaction logs that can be replayed to a point in time (database-style recovery). Recovery testing is the control no one skips in an audit: a backup you have not restored is not a backup.

Power. Resilience dies first at the wall socket. Uninterruptible Power Supply (UPS) bridges the seconds-to-minutes gap between grid failure and generator start. Generators sustain operations for hours or days, limited by fuel. Dual power feeds (two utility feeds or feed plus generator) protect against a single-feed failure. Power is the easiest single point of failure to overlook.

Metrics. Recovery Time Objective (RTO) is how long you can be down before the business is materially harmed. Recovery Point Objective (RPO) is how much data loss is tolerable — measured in time (15 minutes, 1 hour, 24 hours). Mean Time To Recover (MTTR) is the average time to restore service after a failure. Mean Time Between Failures (MTBF) is the average time a system runs between failures. RTO and RPO are the targets; MTTR and MTBF describe the observed behavior.

Site typeWhat’s thereTime to operateCost
HotHardware, current data, running systemsNear-zero (minutes)Highest
WarmHardware + connectivity; data/apps need stagingHours to daysMedium
ColdPower, HVAC, floor space — nothing elseWeeksLowest
Copy methodProtects againstDoes NOT protect against
SnapshotAccidental deletion, rollback of a bad changeLoss of the storage array (same failure domain)
Backup (onsite)Data corruption, file deletionLoss of the site
Backup (offsite)Site loss, region-scoped disasterSlower recovery; media handling risk
ReplicationHardware failure, site loss with low RPOBad writes — they replicate too
JournalingPoint-in-time database recoveryStorage loss unless combined with backup
MetricQuestion it answersExam cue
RTOHow long until we’re back up?“Service must be restored within 4 hours”
RPOHow much data loss is acceptable?“Lose no more than 15 minutes of transactions”
MTTRHow fast do we actually recover?“Historical average recovery time”
MTBFHow often does this fail?“Expected lifetime between failures”

A regional logistics company was hit by ransomware that encrypted most of its file servers and the primary database. The IT director was confident: “We take hourly snapshots on our storage array and nightly backups to a second appliance in the same rack.” The BCP lead is asking harder questions before they agree the recovery plan is sound.

War Room
IT Director ↔ BCP Lead
Recovery options review
IT Director
“Snapshots at 6am are clean. We roll every volume back — we’re up in two hours.”
BCP Lead
“Those snapshots live on the same array that was encrypted. If the attacker had admin on the storage, the snapshots go too. Have you verified the snapshot volume is intact?”
IT Director
“Fair. Then the nightly backup appliance in the rack.”
BCP Lead
“Same rack means same failure domain. If the ransomware spread laterally to backup credentials, the appliance is also encrypted. Do we have an offsite, immutable, or air-gapped copy?”
IT Director
“Weekly to tape, pulled offsite Fridays.”
BCP Lead
“That’s our floor. Our RPO just became one week of data, not one hour. And have we tested the tape restore in the last 12 months?”
IT Director
“Not this year.”
BCP Lead
“Then we don’t know yet if we have a backup. We’re going to the tape and working the restore test simultaneously. And we’re adding immutable cloud backups to the plan this quarter.”
Key Insight
Snapshots and onsite backups share the fate of the primary. True resilience requires a copy in a different failure domain — offsite, immutable, or air-gapped — plus a tested restore. Without those two, backups are theatre.
Compensating Controls
If perfect offsite/immutable backups are not yet in place: tightly segment backup infrastructure from production, use separate credentials for backup admins, require MFA for all backup management, and schedule monthly restore drills on representative data.

You run a mid-size SaaS business with a stated RPO of 1 hour and an RTO of 4 hours. Revenue impact of a full-day outage is meaningful but not existential. The board wants strong continuity, but is pushing back on cost. You must pick between a hot secondary region with continuous replication, or a warm secondary with hourly snapshots shipped out of region.

Option A
Hot site — active/passive across regions

Full capacity running in a second region, continuous replication, automatic failover. Near-zero RTO/RPO. Highest cost by a wide margin.

Option B
Warm site — hardware ready, hourly snapshots shipped out of region

Standby infrastructure is provisioned but not running production load. Hourly snapshots match the 1-hour RPO; restore + promotion fits within the 4-hour RTO. Cost materially lower than hot.

Option B matches the stated RTO/RPO with meaningful cost savings.

The objectives tell you exactly what to buy: 1-hour RPO means hourly snapshots minimum — you do not need continuous replication to meet the target. A 4-hour RTO comfortably accommodates a warm-site promote/restore sequence. Paying hot-site prices for a 4-hour RTO is overbuying.

When Option A is right: hot/active-passive is the answer when RTO is minutes (financial trading, life-safety systems) or when the revenue cost of even an hour of downtime outweighs the delta in site cost. At 4-hour RTO and meaningful-but-not-existential revenue risk, it is not this scenario.

Both options meet the technical requirement — the exam tests whether you can match the site tier to the numbers. Don’t oversell; the cheapest option that meets RTO/RPO is the right engineering answer.

1
Conflating HA with DR
Trap: “We have a load-balanced cluster, so we’re covered for disaster.” HA protects against failures within a site. DR protects against loss of the site.
Exam signal: scenarios mentioning a regional outage, fire, or flood are DR questions; scenarios mentioning a server crash or AZ failure are HA questions.
2
Treating snapshots as backups
Trap: Relying on snapshots for ransomware recovery. Snapshots live on the primary storage system — they share its failure domain and often its credentials.
Exam signal: any answer that treats snapshots as a complete recovery strategy is almost certainly wrong. Pair snapshots with an offsite/immutable backup.
3
RTO vs. RPO confusion
Trap: Swapping the two under time pressure. RTO = time-to-up; RPO = tolerable data loss. Mnemonic: RTO = Time, RPO = Point (how far back in time).
Exam signal: every scenario with two numbers invites the swap. Read twice; the number labeled in minutes of acceptable loss is always RPO.
4
Hot site as default answer
Trap: Picking hot site whenever DR is mentioned. Hot is right only when RTO/RPO is near-zero and cost supports it. For most workloads with hours of tolerance, warm is correct.
Exam signal: questions that emphasize cost constraints or “budget is tight” almost always want warm or cold, matched to the stated tolerance.
5
Skipping backup testing
Trap: Assuming a backup that runs is a backup that works. Media degrades, encryption keys get lost, processes drift — only a completed restore proves the chain works end to end.
Exam signal: the phrase “regular restore tests” is almost always a correct-answer ingredient in resilience questions.
6
Forgetting power is a layer
Trap: Designing for server and site failure but not grid failure. UPS bridges the gap; generator sustains; dual feeds protect against a single feed failure.
Exam signal: “generator is enough” is a trap — without a UPS, the generator start time is an outage.
Signal
Resilience questions almost always give you two numbers and a cost signal. Extract RTO, extract RPO, note any “budget-constrained” or “mission-critical” cue, and the site tier and backup pattern follow. Scenarios mentioning ransomware specifically test the snapshot-vs-offsite-backup distinction.
Practice Question 1 of 3
A hospital states that its electronic-medical-record system can tolerate no more than 15 minutes of data loss, and must be restored within 1 hour of any disruption. Which pairing best describes the stated objectives?
  • A RTO = 15 minutes; RPO = 1 hour
  • B RPO = 15 minutes; RTO = 1 hour
  • C MTTR = 15 minutes; MTBF = 1 hour
  • D Both values are MTTR targets

Correct: B. RPO is how much data loss is tolerable (15 minutes here); RTO is how long the outage may last (1 hour here). The exam tests this swap relentlessly.

A wrong: the labels are reversed. C wrong: MTTR and MTBF describe observed behavior, not stated objectives. D wrong: at least one value is a data-loss target, which is RPO.

Source: CompTIA SY0-701 Objectives v5.0 — 3.4 Resilience and Recovery

Practice Question 2 of 3
A finance firm’s internal HR system has a 48-hour RTO and minimal regulatory pressure. The CFO wants to minimize continuity spend. Which site strategy best matches the requirement?
  • A Hot site with active/active replication
  • B Warm site with hourly snapshot shipping
  • C Cold site with nightly offsite backups
  • D Multi-cloud active/active across three providers

Correct: C. 48-hour RTO and minimal regulatory pressure means a cold site with nightly offsite backups is the cheapest tier that meets the requirement. Spending more would be overengineering.

A and D wrong: both are hot/active strategies that cost far more than the 48-hour RTO justifies. B wrong: warm is still overbuilt for this tolerance.

Source: CompTIA SY0-701 Objectives v5.0 — 3.4 Resilience and Recovery

Practice Question 3 of 3
A manufacturer discovers that ransomware has encrypted its production file servers, its hourly storage snapshots, and the nightly backup appliance in the same rack. Which control would most directly have prevented this single-event wipeout?
  • A Additional hourly snapshots on the same storage array
  • B Faster replication between production servers
  • C Offsite, immutable (or air-gapped) backups stored in a separate failure domain
  • D Larger UPS capacity in the primary datacenter

Correct: C. The failure here is that snapshots and onsite backups share the production failure domain. An offsite, immutable, or air-gapped copy is in a different failure domain the attacker cannot reach, which is exactly what the exam objectives mean by resilience.

A wrong: more snapshots in the same failure domain still fall to the same event. B wrong: replication copies the encrypted state just as fast. D wrong: power is a separate concern and would not have stopped a ransomware encryption event.

Source: CompTIA SY0-701 Objectives v5.0 — 3.4 Resilience and Recovery

Disclaimer: This content is provided for educational and exam preparation purposes only. It is not official CompTIA content, is not endorsed by CompTIA, and does not guarantee exam success. All practice questions are original and based on the published CompTIA SY0-701 Exam Objectives (v5.0). Always refer to the official CompTIA Security+ Exam Objectives as your primary reference.