Cloud Computing: Availability and SLA of Azure virtual machines
Updated: Jan 25, 2022
In this post, I want to discuss the critical topic of the availability of a virtual machine because this is an important topic that can provide the answer on how to make sure that your servers will keep functioning in case of local hardware failure or a global catastrophe.
Now note that usually, when discussing availability solutions for virtual machines, we are going to create other instances of the virtual machine that will work together. So, the main thing that we need to answer here is how to distribute these instances using the available Azure solutions.
Before diving into the technical solutions of Azure cloud, let's start by reviewing the five SLA times for virtual machines as approved by Microsoft:
for more information:
Availability concepts in Azure
There are four core concepts we need to be familiar with:
So let's go through them:
A fault domain is a logical group of physical hardware that shares a common power source and network switch similar to a rack (a cabinet with a single power source and network source which can host multiple servers) in a traditional data center. Using this solution, if there's a problem with the power of networking in the domain (=rack), all servers will become obsolete. So, it's apparent that you want to make sure your servers are spread across more than one fault domain.
An update domain is a group of physical hardware that can undergo maintenance and be rebooted simultaneously. So what this means is that Azure is in control of updating and rebooting the underlying hardware of the virtual machines as it always ensures that the host machines are up to date, secured, and have all the hotfixes that are needed. Also, as part of this maintenance process, sometimes the host machine should be rebooted, meaning that all virtual machines installed are going down for some time.
It is also essential to understand that this whole maintenance process is done by Azure at its discretion, meaning that we don have any control over when and where this will happen. So, if all your servers are in the same update domain, they'll reboot at the same time during maintenance. This is the main reason you want to make sure your servers are spread across more than one update domain.
I want to add a necessary clarification that the Update domain is not identical to the fault domain. A fault domain is a physical hardware (Rack), while an update domain is logical. So host VM's an update domain can be spread across multiple cross domains but are still part of the same logical group.
Availability set is a Collection of Fault Domains and Update Domains your virtual machines will be spread across. A single Availability set can contain up to 3 Fault Domains and 20 Update Domains. The main thing to remember when using Availability sets is that all domains (Fault & Update) should reside in the same datacenter (AKA: Zone).
Using the Availability set, we can ensure that the Azure virtual machines are deployed across multiple isolated hardware clusters, which is crucial to ensure that in any case of hardware/software failure within Azure, only a sub-set of your VM's are impacted. Your overall solution is safe and still available for use.
Also, the Availability set is free! You pay only for the additional VMs. So make sure that you deploy identical VMs into the same Availability Set to ensure they won't shut down simultaneously when a single fault domain shuts down, or an update domain is rebooted for a maintenance process.
Here is a basic example of how can you create a new Availability Set:
Here is an architecture design of the availability set:
A physically separate datacenter within an Azure region. in that case, each data center functions as a fault & update domain. The fact that your machines are spread across different physical data centers can ensure that your VM's are always available even if one datacenter is completely down (not like availability set; all VMs resides in the same data center).