How to build a SaaS disaster recovery plan, step by step

PJ Muller

Last updated:

June 18, 2026

•

min read

Your backup policy documents the rules. A disaster recovery plan documents what you actually do when things go wrong.

They're different documents for different moments. The backup policy sits in your internal wiki. It tells you what gets backed up, how often, who owns it. It's written when things are calm and consulted during audits. The disaster recovery plan is what you open at 11pm when your ops lead calls to say the project board is gone. It tells you, step by step, what to do in the next two hours.

Most teams have something resembling a backup policy. Almost none have a disaster recovery plan for their SaaS apps. They assume they'll figure it out when it happens. That assumption fails in two ways: you make worse decisions under pressure, and the person who needs to act might not be the person who knows where the backups are. And because your SaaS provider won't be doing this for you, the plan needs to exist on your side.

This post walks you through building one. It's structured as a sequence of steps because that's how a DRP works: you don't improvise, you follow the plan.

What a SaaS disaster recovery plan is (and isn't)

A DRP for SaaS apps is a documented, tested procedure for restoring data and resuming normal operations after an incident.

It is not a backup policy (that's the governance document), a business continuity plan (which covers broader operational resilience, not just data), or an IT incident response plan (which covers security breaches at the system level). Each of those is a different document with a different scope.

Your SaaS DRP is narrower and more operational. It answers: which specific scenarios might affect our SaaS data, who does what when they happen, and how do we restore things in the right order. For most teams it fits on two pages. It should be findable when the person who wrote it is unavailable.

Step 1: Define your threat scenarios

Generic "data loss" is not a useful planning unit. The actions you take after an accidental bulk deletion are completely different from the actions you take after a ransomware attack. Start by listing the specific scenarios your plan needs to cover.

The six most common for SaaS-dependent teams:

Accidental bulk deletion. A team member deletes a project, board, or significant set of records. The most frequent scenario. Usually discovered within hours. Recovery is typically straightforward if backups exist.

Bad import or integration error. A CSV import overwrites field values across hundreds of records. A third-party integration misfires and clears or corrupts data. Nothing was deleted, so the recycle bin won't help. These are among the most common data loss scenarios and among the hardest to detect quickly.

Compromised account. A team member's credentials are stolen. The attacker, or the compromised employee, deletes or exports data before the account is locked. The Snowflake breach in 2024 showed this pattern at scale: stolen credentials, no MFA, significant data loss across 160+ organisations. Your ransomware protection posture matters here.

Malicious insider. A departing employee with admin access systematically deletes data before leaving. This is different from an accident: the deletion is intentional and may include emptying the recycle bin to eliminate the most obvious recovery path.

Provider outage. Your SaaS app is unavailable for an extended period. This isn't a data loss scenario in the traditional sense, but if your team needs to access data during the outage (for client meetings, calls, ongoing work), your DRP should cover how to access it. Google Drive sync creates a readable copy of your backup data that's accessible even if the source app and ProBackup are both temporarily down.

Silent data corruption. Data is modified incorrectly over days or weeks before anyone notices. The recycle bin has long expired by the time the problem is found, and the clean snapshot needed may be weeks in the past.

For each scenario, document: how you'd detect it, the likely data impact, and the approximate time sensitivity (how long before the recovery window closes or the business impact becomes critical).

Step 2: Set your RTO and RPO per app

Before you can plan a recovery, you need to know what "recovered" means. That requires two numbers per app.

Recovery Time Objective (RTO) is how long the team can function without the data. Recovery Point Objective (RPO) is how much data loss is acceptable, measured in time since the last clean snapshot.

For a quick DRP, you don't need to calculate these for every app. Focus on your Tier 1 apps (the ones you identified in your SaaS data protection audit) and document a realistic RTO and RPO for each.

A practical format for setting these is to think about operational dependency: the tighter the RTO, the more the team is blocked without that data. CRM data tends to block client-facing work within hours. Project boards can usually wait a day. Knowledge bases rarely cause a crisis if they're down overnight.

AppRTORPONotesAsana4 hours24 hoursSales team blocked after half a dayHubSpot2 hours24 hoursClient calls require deal historyMonday.com8 hours24 hoursCan reconstruct daily standup from memoryNotion24 hours72 hoursKnowledge base, non-urgent

These numbers shape your recovery prioritisation. When two things need restoring at once, you restore the app with the tighter RTO first.

Step 3: Map your recovery resources

When an incident happens, whoever is managing the recovery needs to know exactly where to go and what to use. Document this before you need it.

For each Tier 1 app, record:

Backup tool and login URL. Where is the backup vault? What credentials are needed? (Don't store passwords in the DRP document itself, but note where they're held, e.g. "1Password vault, Ops team folder.")
Who has restore access. Name specific people, not roles. "Head of IT" is useless if that person is the one who just left.
Snapshot location. For ProBackup users: Home page, then "Go to..." for the relevant app. For Google Drive sync users: the backup folder structure in Drive, organised by app and project.
Escalation contact. If the primary restore person can't be reached, who is second?
Backup tool support contact. For ProBackup: support is available via the in-app chat and at support.probackup.io. For incidents where the volume or complexity is high, Pro and Premium plans have priority support.

This section of your DRP should be a single reference page someone can scan in under two minutes.

ProBackup expert note: One thing that catches teams off guard during incidents: the person with admin access to the backup tool isn't always the same person who knows the SaaS app well enough to verify a restore. Build both roles into your recovery resources. You need someone who can operate the backup tool, and someone who can look at the restored data and confirm it's correct. These are often different people.

Step 4: Write your incident response runbook

This is the core of the DRP. A runbook is a documented, sequential procedure that anyone on the team can follow. It removes decision-making under pressure and ensures nothing important gets skipped.

The sequence for most SaaS data loss incidents follows six stages:

1. Detect

Establish that an incident has occurred. Common signals: a team member reports missing data, a ProBackup smart alert fires indicating unusual deletion activity, a weekly status email shows a spike in deletions. Document who receives alerts and who is responsible for triaging them.

2. Assess

Before acting, understand the scope. What data is affected? Which app, which workspace, which records? When did it happen (or when might it have started)? Is it still ongoing (e.g. a misfiring integration still running) or contained? Stopping an ongoing cause before restoring prevents restoring data into a still-broken state.

3. Contain

If the cause is still active, stop it. Revoke access for a compromised account. Disconnect the misbehaving integration. Pause imports that are still running. A restore is pointless if the same thing will happen again immediately.

4. Restore

Using your documented recovery resources, identify the correct snapshot date, select the affected items, and trigger the restore. For ProBackup: navigate to the vault, select the snapshot date preceding the incident, identify affected records, and use "Restore as new records" to create copies without overwriting current data. Track progress in the Restore Report. See the full step-by-step procedure in how to test your SaaS backups, which covers the same navigation path used in a real restore.

For silent corruption scenarios where the start date is unknown, work backwards through available snapshots from the most recent until you find a clean version of the affected data.

5. Verify

Don't close the incident until someone has confirmed the restored data is accurate. This should be the person who knows the data well enough to spot something wrong: the project owner, the account manager, whoever works in that workspace daily. Check field values, comments, attachments, and any related records.

6. Document

Record what happened: the timeline, cause, data affected, snapshot date used, who performed the restore, time to resolution, and whether any data was unrecoverable. This log serves three purposes: it improves your next response, it's evidence for compliance audits (SOC 2's Availability criterion and ISO 27001 Annex A.12.3 both require incident documentation), and it's the input for a post-incident review to prevent recurrence. Store it alongside your restore test log.

Step 5: Assign roles and contacts

A runbook without names is a procedure without owners. For each stage of the runbook, assign a primary and a backup person:

RolePrimaryBackupIncident detection / triage[Name][Name]Scope assessment[Name][Name]Containment (access revocation)[Name][Name]Restore execution[Name][Name]Data verification[Name][Name]Stakeholder communication[Name][Name]Incident documentation[Name][Name]

Also document your escalation path. If the primary restore person can't be reached within [30 minutes], who do they escalate to? Who has authority to make the call to involve external support or notify affected clients?

For out-of-hours incidents, include personal contact numbers or Slack handles that work when email is too slow. Data loss at 9pm doesn't wait until Monday.

ProBackup expert note: Communication is the stage teams most often forget to plan. During an incident, someone needs to be updating stakeholders: what happened, what's being done, what the timeline looks like. This is a separate job from the technical recovery. If the person doing the restore is also managing client communications, both suffer. Assign them separately before you need to.

Step 6: Test the plan

A disaster recovery plan you've never tested is an untested assumption. Run a tabletop exercise once a year: gather the relevant people, walk through a realistic scenario ("it's Monday morning, someone reports the HubSpot pipeline is empty"), and talk through each stage of the runbook. Who would do what? What would they need? Where would the friction be?

Tabletop exercises are low-cost and reveal gaps in the plan without the pressure of a real incident. Common things they surface: missing contact numbers, unclear ownership at the containment stage, backup access held only by someone who's since left the company.

Once a year, run a live drill: trigger an actual restore from a real snapshot, time the process end to end, and document how long each stage took. If your RTO for HubSpot is two hours and your live drill takes four, you've found the gap before it matters.

After any real incident, treat it as a drill debrief. Update the runbook based on what actually happened versus what the plan assumed.

Template: SaaS DRP one-pager

Below is a condensed template. A working DRP doesn't need to be long. It needs to be findable, current, and specific enough to act on.

[Organisation name] SaaS disaster recovery plan

Version: 1.0 | Owner: [Name, Role] | Last tested: [Date]

Covered apps: [List Tier 1 SaaS apps]

Recovery objectives:

AppRTORPO[App][e.g. 4 hours][e.g. 24 hours]

Recovery resources:

AppBackup toolRestore accessEscalation[App][e.g. ProBackup][Names][Name, contact]

Threat scenarios covered: Accidental deletion / Bad import or integration error / Compromised account / Malicious insider / Provider outage / Silent data corruption

Incident response sequence:

Detect: [who receives alerts, how incidents are reported]
Assess: [who assesses scope, what information is gathered]
Contain: [who revokes access or stops the cause]
Restore: [who performs the restore, backup tool used, procedure reference]
Verify: [who confirms data accuracy]
Document: [where the incident log is recorded]

Roles and contacts:

RolePrimaryBackupOut-of-hours contactTriage[Name][Name][Signal/mobile]Restore[Name][Name][Signal/mobile]Verification[Name][Name][Signal/mobile]Communications[Name][Name][Signal/mobile]

Related documents: [Link to backup policy] | [Link to SaaS data protection audit] | [Link to restore test log]

What to do next

Getting to this point means you've thought seriously about what recovery actually looks like: who acts, in what order, with what tools. That's the work most teams skip. The next step is making sure the backup infrastructure your runbook relies on is actually in place.

If you already have a backup policy in place, the DRP is the natural next document: same apps, same ownership, different purpose. Start with the roles and contact table and the RTO/RPO table from your audit, then build the runbook around them.

If you haven't yet set up independent automated backups, there's no point building a runbook that points to a backup vault that doesn't exist. Start a free trial of ProBackup, get your first snapshot running, then build the plan around that. Setup takes about three minutes per app. Plans start at $25/month (billed yearly) and cover 19+ platforms under a single licence.

See how other teams have used ProBackup when they've actually needed to recover on our success stories page.

‍

Share this post

No items found.

PJ Muller

CTO

ProBackup

PJ is the CTO of ProBackup, where he brings a passion for software craftsmanship to building the technical foundation that keeps customer data safe and reliable. When he's not architecting backup systems, you'll find him hiking in the mountains - fuelled by chocolate.