We should all have recovery plans to be used in the event of a disaster, we should also have plans that allow us to test these processes, we may even have the opportunity to test them occasionally.

Do we all have plans that can be used for a pre-emptive controlled failover i.e. where everything is up and working at the primary site but the risk to service is so high that the decision is taken to switch services to the secondary site, in a controlled way and at a time of our choosing rather than risk waiting for an issue that will cause a service outage at the most inconvenient time, even if there is a chance that risk will never be realized, e.g. when a datacentre is under threat from flooding the decision is made to switch services from that datacentre even though the point has not been reached that the flooding is certain.

Depending upon your configuration there are number of activities that can reduce the service disruption and potential data loss during a controlled failover of services. Some of these activities require preparation well in advance and need to be designed in, whereas others need only be built into the processes used to switch services between sites.

Design related

The following should be considered during the design stage and be includes as part of the overall SQL Server AlwaysOn Cluster design.

The File Share Witness

If you are running with a File Share Witness (FSW), I suspect you will have a single FSW located on the primary site. To prepare for a planned failover it is prudent to have a second FSW available that is located at the secondary site ready to be used in the event of a controlled failover.

Why the second FSW, there is a high probability that should your services be running on the secondary site when access is lost the FSW located at the primary site a service interruption will occur that requires intervention. I.e. the cluster will crash and will need to be forced up without a quorum, while running services on the secondary site also running the FSW on the secondary site alleviates the risk to the cluster if or when access is lost to the primary site.

Switchover process related

The following should be considered and included in the controlled failover processes, each section is dependent on the overall design and may not be relevant to all designs.

Change the SQL Availability Mode

This activity is only relevant for environments using the Asynchronous Commit Availability Mode.

Changing the Availability Mode from Asynchronous Commit to Synchronous Commit means that switchover between sites can occur without data loss and that the switch over less disruptive as it won’t need to be forced.

Prior to switching between sites for each Availability Group change the SQL Availability Group commit mode from Asynchronous Commit to Synchronous Commit

Monitor the environment and wait until all databases have a Failover Readiness of No Data Loss before progressing with the switch of Availability Groups from the primary site to the secondary site.

This should have no effect of SQL Database availability

Prepare Cluster to run on the secondary site.

Reconfigure the cluster so that should the primary site become un-available it continues to run without issues.

Note, in the short term the risk is reversed, should the primary site loose contact with the FSW on the secondary site before services have been switched it could cause cluster issues for the cluster at the primary site.

For each Cluster change the FSW so that it is now located on the secondary site rather than the primary site.

For clusters that don’t have equal votes for all members change the nodes on the secondary site to have an equal vote.

This should have no effect of SQL Database availability

Move Cluster to secondary site.

In preparation for the loss of the primary site migrate the core cluster resources from the primary site to the secondary site, not SQL Server related services.

This should have no effect of SQL Database availability

About the author