Dec 1, 2023, Infrastructure

OVH Fire – Activate Your Disaster Recovery Plan

Jacek Bochenek Cloud and Security Team Leader - CISSP, CISM, CCSP
“Anything that can go wrong, will go wrong”

The epigram of Murphy’s law used in the title is one of the most popular quotes circulating around the Internet. Most people however either don’t understand it, or don’t take it seriously. Yes, there is this Murphy, and yes, things can go wrong, but for sure it won’t happen to me.

Unfortunately things go wrong, even in places that are designed with redundancy and resilience in mind. At 00:47 on Wednesday, March 10, 2021 things went terribly wrong at the OVH data center site in Strasbourg.

OVH Fire – a case of total data loss

In the early morning, the founder of OVH posted a note on Twitter

In this incident all servers and data in the SBG2 were completely destroyed. Servers and data in 4 out of 8 rooms in SBG1 were also destroyed. Buildings SBG3 and SBG4 were not affected, but are shut down, at least until 15th of March (SBG1 and SBG4) and by March 19th (SBG3).

Due to this incident many sites are down. Many critical internal IT systems are also affected. Unfortunately a lot of companies will either go out of business or lose market share.

Disaster Recovery Plan – who cares?

In his first post on Twitter the owner of OVH said: “We recommend to activate Disaster Recovery Plan”. So what is this plan? Simply put, it is a technical complement to the Business Continuity Plan which enables the restoration of service as quickly as possible. It is a sole responsibility of a company to prepare and test such a plan. That’s why, even though there were a lot of critical responses below Octave Kaba post, with some people implying that it is the responsibility of OVH to provide fully recoverable services, and that it is the sole responsibility of OVH to bring those systems up and restore the data, that is not true.

What happens if my data is lost forever?

Before we go any further, some facts about companies that lose their data:

  • 94% of companies suffering from a catastrophic data loss do not survive – 43% never reopen and 51% close within two years. (University of Texas)

  • 7 out of 10 small firms that experience a major data loss go out of business within a year. (DTI/Price Waterhouse Coopers)

  • 93% of companies that lost their data center for 10 days or more due to a disaster, filed for bankruptcy within one year of the disaster

How do we prepare for the worst case scenario?

What should a company do to protect their most valuable asset – data? First of all, it should do a thorough analysis of their assets. You can’t protect what you don’t know. Secondly, it should do a comprehensive analysis of their business operations and processes to prepare a list of their critical systems and the threats posed to those resources through the BIA – Business Impact Analysis process. Within this process we should define a few important metrics. The most important is the MTD – Maximum Tolerable Downtime – it is the maximum length of time a business function can be inoperable without causing irreparable harm. Second is the RTO – Recovery Time Objective – it is the time you can recover a system in case of disruption. The goal is to keep RTO lower than MTD. We should also define another important metric – RPO – Recovery Point Objective, which basically will tell us how much data we can lose in case of a major disaster.

Having each business function clearly defined, we can proceed to prepare detailed Disaster Recovery Plans. The way we will protect our systems will mainly depend on the values we’ve described above. In cases the system is so critical that basically we’ll have to have an active – active solution where the same system is running in two different locations at the same time. In contrast some systems will need to be restored within 2 days and a backup at a remote location or in the cloud will be sufficient. The most important thing is to have such a plan in place and to regularly test it – either in the form of reading through procedures, table top exercise or through full simulation tests. The worst thing that can happen, except losing all your data, is to have a backup plan, that you can’t execute.

Cloud to the rescue

“This sounds complicated”

So, what can we do to protect our systems? This of course will depend on the criticality of our system, and how important it is to our organization. The one thing we should keep in mind is the more stringent the requirements as to availability of our system, the more complicated the procedures and implementation will be. So, what can we do to more efficiently protect our systems and provide more secure and robust services to our customers?

One of the simplest things we can do is to at least send a copy of our backups to an offsite location. To do this, we can easily use the services of major cloud providers. Secondly, we should look at the services offered by the major public cloud providers – AWS, Azure, Google, Oracle. They’ve designed their services and infrastructure with reliability and security in mind. This is what sets apart these companies from competitors.

“Regions, Availability Zones, Data Centers”

The big players design their infrastructure around the concept of Regions, Availability Zones and Data Centers. A region is isolated and independent of one another. Usually each cloud provider has many regions around the world. Each region consists of two or more Availability Zones. All Availability Zones within a region are interconnected. Each Availability Zone consists of one or more Data Centers, depending on the cloud provider.

Even though we already have a very reliable system at the Availability Zone level, we can easily deploy multi AZ applications or go even further and deploy them in multiple regions. We can even utilize multi cloud architecture for our mission critical systems.

This infrastructure together with offered services, provide a great flexibility in which we can easily design and deploy our highly available applications and services.

I hope it gives a high level overview of Disaster Recovery practices and the possibilities of public cloud in providing robust, reliable and highly available applications. Please feel free to contact us if you have any questions.or would like to learn more on how our team can help you with Business Continuity Plans or with the migration of your infrastructure to the public cloud.

At the end, we should all try to be prepared, and I hope that nobody will never ever have to write a message like this, after a disaster that happened in OVH (actual quote from a post on Twitter):

I can’t find a Disaster Recovery Plan in OVH panel. Can you guide me?

[Update 2023] The aftermath of OVH Fire

In the years following the devastating fire at OVHcloud’s SBG2 data center in Strasbourg, the aftermath has unfolded with significant legal, financial, and reputational implications. Over a hundred companies joined a class action lawsuit, claiming over €9 million in damages, while four larger companies pursued individual lawsuits. OVHcloud’s initial compensation offer of €900 was deemed insufficient by the affected parties, leading to a push for an amicable settlement to avoid a commercial court trial. OVHcloud defended its position by stating the event was unforeseeable and that reasonable precautions were taken.

The financial toll on OVHcloud was substantial, with the fire potentially costing over €100 million, a significant figure compared to its annual turnover. Insurance coverage alleviated some of the financial burden, but the reputational damage and slowed growth were harder to quantify. The company’s IPO filings revealed a revenue shortfall and substantial costs for server replacements and other associated expenses.

This incident underscored the inherent risks in data center operations and the importance of robust disaster recovery plans. OVHcloud’s approach to using existing industrial buildings for data centers and its proprietary technology came under scrutiny, emphasizing the need for stringent safety and compliance measures. 

As the dust settled, the OVH fire became a case study in crisis management, highlighting the critical need for transparency, preparedness, and a solid understanding of the risks in cloud computing and data storage services.