"Disaster Recovery" and related thoughts...

Backup, Archive, High Availability, Disaster Recovery, Business Continuity. All related. Yet all different.

One of my colleagues was recently faced with needing to run “a DR [disaster recovery] workshop” for a client. My initial impression was:

What disasters are they planning for?
I’ll bet they are thinking about Coronavirus and working remotely. That’s not really DR.
Or are they really thinking about a backup strategy?

So I decided to turn some of my rambling thoughts into a blog post. Each of these topics could be a post in its own right – I’m just scraping the surface here…

Let’s start with backup (and recovery)

Backups (of data) are a fairly simple concept. Anything that would create a problem if it was lost should be backed up. For example, my digital photos are considered to not exist at all unless they are synchronised (or backed up) to at least two other places (some network-attached storage, and the cloud).

In a business context, we run backups in order to be able to recover (restore) our content (configuration or data) within a given window. We may have weekly full backups and daily incremental or differential backups (perhaps with more regular snapshots), then retain parent, grandparent and great-grandparent copies of the full backups (four weeks) and keep each of these as (lunar) monthly backups for a year. That’s just an example – each organisation will have its own backup/retention policies and those backups may be stored on or off-site, on tape or disk.

In summary, backups are about making sure we have an up to date copy of our important configuration information and data, so we can recover it if the primary copy is lost or damaged.

And for bonus content, some services we might consider in a modern infrastructure context include Azure Backup or AWS Backup.

Backups must be verified and periodically tested in order to have any use.

Archiving information

When I wrote about backups above, I mentioned keeping multiple copies covering various points in time. Whilst some may consider this adequate for archival, archival is the storage of data for long-term preservation of read-only access – for example, documents that must be stored for an extended period of time (for example 7, 10, 25, 99 years). Once that would have been paper documents, in boxes. Now it might be digital files (or database contents) on tape or disk (potentially cloud storage).

Archival might still use backup software and associated retention policies, but we’ll think carefully about the medium we store it on. For very long term physical storage we might need to consider the media formats (paper is bulky and transferred to microfiche, or old magnetic media degrades, so it’s moved to optical storage – but the hardware becomes obsolete, so it’s moved to another format). If storing on disk (on-premises or in the cloud), we can use slower (cheaper) disks and accept that restoration from the archive may take additional time.

In summary, archival is about long-term data storage, generally measured in many years and archives might be stored off-line, or near-line.

Technologies we might use for archival are similar to backups, but we could consider lower-cost storage – e.g. Azure Storage‘s Cool or Archive tiers or Amazon S3 Glacier.

Keeping systems highly available

High Availability (HA) is about making sure that our systems are available for as much time as possible – or certainly within a given service level agreement (SLA).

Traditionally, we used technologies like a redundant array of inexpensive devices (RAID) for disks or memory, error checking memory, or redundant power supplies. We might also have created server clusters or farms. All of these methods have the intention of removing single points of failure (SPOFs).

In the cloud, we leave a lot of the infrastructure considerations to the cloud service provider and we design for failure in other ways.

We assume that virtual machines will fail and create availability sets.
We plan to scale out across multiple hosts for applications that can take advantage of that architecture.
We store data in multiple regions.
We may even consider multiple clouds.

Again, the level of redundancy built into the app and its supporting infrastructure must be designed according to requirements – as defined by the SLA. There may be no point in providing an expensive four nines uptime for an application that’s used once a month by one person, who works normal office hours. But, then again, what if that application is business critical – like payroll? Again, refer to the SLA – and maybe think about business continuity too… more on that in a moment.

Some of my clients have tried to implement Windows Server clusters in Azure. I’ve yet to be convinced and still consider that it’s old-world thinking applied in a contemporary scenario. There are better ways to design a highly available file service in 2020.

In summary, high availability is about ensuring that an application or service is available within the requirements of the associated service level agreement.

Technologies might include some of the hardware considerations I listed earlier, but these days we’re probably thinking more about:

Azure Virtual Machine Availability Sets.
Azure Virtual Machine Scale Sets.
Elastic database pools in Azure SQL Database or Amazon RDS.
Autoscaling and other capabilities in Azure App Service.
Azure Traffic Manager (DNS), Load Balancer (layer 4) or Application Gateway (layer 7)/AWS Elastic Load Balancing (various options).
AWS/Azure Availability Zones/Regions (e.g. for data replication).
Multi-cloud architectures (but think carefully).

Remember to also consider other applications/systems upon which an application relies.

Also, quoting from some of Microsoft’s training materials:

“To achieve four 9’s (99.99%), you probably can’t rely on manual intervention to recover from failures. The application must be self-diagnosing and self-healing.
Beyond four 9’s, it is challenging to detect outages quickly enough to meet the SLA.
Think about the time window that your SLA is measured against. The smaller the window, the tighter the tolerances. It probably doesn’t make sense to define your SLA in terms of hourly or daily uptime.”
Microsoft Learn: Design for recoverability and availability in Azure: High Availability

Disaster recovery

As the name suggests, Disaster Recovery (DR) is about recovering from a disaster, whatever that might be.

It could be physical damage to a piece of hardware (a switch, a server) that requires replacement or recovery from backup. It could be a whole server room or datacentre that’s been damaged or destroyed. It could be data loss as a result of malicious or accidental actions by an employee.

This is where DR plans come into play- firstly analysing the risks that might lead to disaster (including possible data loss and major downtime scenarios) and then looking at recovery objectives – the application’s recovery point objective (RPO) and recovery time objective (RTO).

Quoting Microsoft’s training materials again:

“Recovery Point Objective (RPO): The maximum duration of acceptable data loss. RPO is measured in units of time, not volume: “30 minutes of data”, “four hours of data”, and so on. RPO is about limiting and recovering from data loss, not data theft.
Recovery Time Objective (RTO): The maximum duration of acceptable downtime, where “downtime” needs to be defined by your specification. For example, if the acceptable downtime duration is eight hours in the event of a disaster, then your RTO is eight hours.”
Microsoft Learn: Design for recoverability and availability in Azure: Disaster Recovery

For example, I may have a database that needs to be able to withstand no more than 15 minutes’ data loss and an associated SLA that dictates no more than 4 hours’ downtime in a given period. For that, my RPO is 15 minutes and the RTO is 4 hours. I need to make sure that I take snapshots (e.g. of transaction logs for replay) at least every 15 minutes and that my restoration process to get from offline to fully recovered takes no more than 4 hours (which will, of course, determine the technologies used).

Considerations when creating a DR plan might include:

What are the requirements for each application/service?
How are systems linked – what are the dependencies between applications/services?
How will you recover within the required RPO and RTO constraints?
How can replicated data be switched over?
Are there multiple environments (e.g. dev, test and production)?
How will you recover from logical errors in a database that might impact several generations of backup, or that may have spread through multiple data replicas?
What about cloud services – do you need to backup SaaS data (e.g. Office 365)? (Possibly not, if you’re happy with a retention-period based restoration from a “recycle bin” or similar but what if an administrator deletes some data?)

As can be seen, there are many factors here – more than I can go into in this blog post, but a disaster recovery strategy needs to consider backup/recovery, archive, availability (high or otherwise), technology and service (it may help to think about some of the ITIL service design processes).

In summary, disaster recovery is about having a plan to be able to recover from an event that results in downtime and data loss.

Technologies that might help include Azure Site Recovery. Applications can also be designed with data replication and recovery in mind, for example, using geo-replication capabilities in Azure Storage/Amazon S3, Azure SQL Server/Amazon RDS or using a globally-distributed database such as Azure Cosmos DB. And DR plans must be periodically tested.

Business continuity

Finally, Business Continuity (BC). This is something that many organisations will have had to contend with over the last few weeks and months.

BC is often confused with DR but they are different. Business continuity is about continuing to conduct business when something goes wrong. That may be how to carry on working whilst working on recovering from a disaster. Or it may be how to adapt processes to allow a workforce to continue functioning in compliance with social distancing regulations.

Again, BC needs a plan. But many of those plans will be reconsidered now – if your BC arrangements are that in the event of an office closure, people go to a hosted DR site with some spare equipment that will be made available within an agreed timescale, that might not help in the event of a global pandemic, when everyone else wants to use that facility. Instead, how will your workforce continue to work at home? Which systems are important?How will you provide secure remote access to those systems? (How will you serve customers whilst employees are also looking after children?) The list goes on.

Technology may help with BC, but technology alone will not provide a solution. The use of modern approaches to End User Computing will certainly make secure remote and mobile working a possibility (indeed, organisations that have taken a modern approach will probably already be familiar with those practices) but a lot of the issues will relate to people and process.

In summary, Business Continuity plans may be invoked if there is a disaster but they are about adapting business processes to maintain service in times of disruption.

Wrapping up

As I was writing this post, I thought about many tangents that I could go off and cover. I’m pretty sure the topic could be a book and this post scrapes the surface. Nevertheless, I hope my thoughts are useful and show that disaster recovery cannot be considered in isolation.

[This is an edited version of a post that was originally published at markwilson.it]

About the author

Mark Wilson