To answer this question for your company, the criticalness of your application and database must be clear. In a High Availability Environment, a standby-database can be part of the technical solution to ensure minimum downtime for your database.
But when moving to a public or private cloud (IaaS, PaaS), the providers ensure a certain high availability, so why should a standby database be considered as a viable configuration when high availability is needed?
In this post, I’ll cover the following topics:
- The root cause of downtime in a datacenter
- Cloud provider tools to cover downtime and availability implementations of the main cloud providers
- A few causes of downtime that are NOT covered by cloud providers (but standby dbs do…)
- Does downtime of a public cloud occur often and what are the risks?
- Is a standby database expensive?
When talking about your IaaS or PaaS availability, you need to know which security measures you want to take. So… what are the root causes of downtime, and can they be avoided by choosing the right provider / platform / configuration?
Root causes of downtime.
The following image gives us an idea of the root causes of downtime. There are a lot of sources to choose from, but I chose this image from sciencedirect in particular because it shows the planned downtime in relation to the unplanned downtime. As you can see, planned downtime is quite a big portion of the cake.
Provider tools and measures
What are generally the tools and measures of the public cloud providers trying to cover these causes?
- Virtualization, like Hyper-V, Oracle VM, VMware.
- Regions / zones (areas with multiple data centers)
- Load balancing
- Storage mirroring
- Authentication / Authorization
- Dedicated network
- Backup / recovery
- Machine learning
These seem pretty nice. And for a lot of applications and databases they are quite suitable, because you do not need a standby-database.
For a good comparison of the different implementations in the public clouds, here is a ‘cheat-sheet‘.
What is not covered
What do these measures generally NOT cover, related to the earlier mentioned root causes of downtime?
- Planned downtime by you, e.g. patches of the IaaS database or releases of the application software. Planned downtime / retirement by the public cloud provider.
- The time the provider needs to react, or chooses to react at unplanned downtime issues.
- Human errors / functional database corruptions
I’ll try to clarify these defects.
Ad 1. Planned downtime / retirement.
Cloud or not, your application or database is still running on hardware that needs maintenance, submitted by the provider. E.g. Amazon (AWS) calls these ‘scheduled maintenance events’.
Most of the time, you are able to reschedule the event, but it could still result in downtime.
Worst case scenario: When AWS detects irreparable failure of the underlying host for your instance, it schedules the instance to stop or terminate, depending on the type of root device for the instance.
You could get these scary emails:
This is beyond your control, but you should be prepared for this kind of event.
Your application software also needs attention, which could result in downtime. Fortunately, scheduled downtime is in your hands. You are in control. But if you want to avoid downtime, you also have to take measures. A standby-database could be a suitable tool for this.
Ad 2. Response time of the cloud provider.
What happens in case of (natural) disasters, attacks, a power down of a data center or a network failure? Well, the provider will fix it, won’t they? Yes, they will… eventually.
And there’s the catch. Providers are not capable of servicing individual clients in case of a big disaster, which is quite understandable. They even put it in writing:
A statement from Azure about this (GRS stands for Global Redundant storage) : https://blogs.msdn.microsoft.com/paulking/2016/04/15/storage-in-azure-and-how-to-plan-for-dr/
GRS does not fail over to the secondary location unless there is TOTAL data center outage. If a few clusters are down and you are affected you may find yourself dead in the water without a paddle until those clusters are fixed.
we will hold off on doing the failover and focus on recovering the data in the primary location….
Which means you’re out of control, and you have to wait until the provider fixes it.
Ad 3. Some human errors.
Causes of human errors are mostly not technical, and may even be avoided by following tested procedures. The only thing you can do on a technical level is use the standard availability options within a database, e.g. flashback, RMAN, export. When that’s insufficient, you could consider a standby database with a deliberate ‘gap’ in synchronizing the database. This means the synchronizing of the standby-database is always a bit behind, which gives you more time to react.
What’s the risc
Should you be worried? Does downtime in public clouds occur often?
Not all the downtime of public cloud providers is publicly available, and data is always arguable, but this is a nice explanation about the figures.
A few examples:
- On Friday, March 2, 2018, a power outage hit the AWS East Region (Ashburn), affecting hundreds of critical enterprise services like Atlassian, Slack and Twilio. Significant corporate websites and Amazon’s own services were impacted as well. It’s analyzed and explained here.
- In May, 2018, Amazon was hit by an outage — the company witnessed a critical connectivity issue due to some hardware failure in a data center in North Virginia. AWS’s EC2, Relational Database Service, Workspaces, and Redshift were all impacted by the outage. The same day, Amazon said in an update: “customers with EC2 instances in the availability zone may see issues with connectivity to the affected instances.”
- In June, 2018, Microsoft Azure suffered a critical outage overnight that affected the platform’s storage and networking services. The outage affected the Northern Europe region. The reason was an underlying temperature issue in one of the data centers in the region. According to the company, the outage started at 5.45 PM and lasted till 4.30 AM. However, it seemed that many customers faced issues for a long time, despite Azure Support claiming that engineers had “mitigated the issue and impacted services should be recovered at this time”.
Is a standby database expensive?
Compared to the costs of downtime: no. But that’s not always easy to explain to the decision makers. Especially if you’re designing or implementing a large VM with a database and its only task is to synchronize the data, and it’s not accessible to users…
But there are several ways to make it more bearable and therefore easier to sell to the managers:
- Make the standby database accessible to read for users. E.g. Active Data Guard in the case of Oracle Enterprise Edition database. Downside: expensive license costs of the Active Data Guard option together with the Enterprise Edition
- Use the standby database for backup / export functionality. This will make it a useful database
- Downsize the standby database, and scale when needed
- Activate the standby database/VM only when syncing the data. Could be 1 or 2 hours a day
- Use software packages like Dbvisit: a disaster recovery solution for Oracle Standard Edition databases
A standby database is not a luxury product. It’s only needed and affordable with very, very critical applications / databases. You can use it more often than you may be aware of, and lift your availability of your application / database to a higher level at quite affordable costs.
When you trust your data to a provider, you are introducing an extra management layer, which is out of your span of control. When using a standby database, you’ve got at least some control.
Causes of downtime: https://www.sciencedirect.com/topics/computer-science/unplanned-downtime
AWS scheduled maintenance : https://aws.amazon.com/premiumsupport/knowledge-center/ec2-scheduled-maintenance-action/