The more we rely on computers and hosted services, the more we may experience outages as these companies try their best to offer stable 24/7 services. This article helps you understand some of the challenges we all have as service providers to provide constant access to our services.
Note that outages refer to UNPLANNED downtime and exclude services that are unavailable due to maintenance and other types of planned interventions.
Sampling across major outages in recent years
Since 2020, we have seen multiple outages in the IT world, some affected more users than others but all of them had a significant impact one way or another.
Here is a sample of outages impacting some of the major players in the IT industry:
- Google: Google had multiple outages in the past few years. 3 of them in 2020 alone lasted for an average of 4 hours and affected most of their services such as Google docs, Gmail, Google Drive, and YouTube. More recently, Google's services were impacted during an update (August 2022).
- Rogers Canada: In July 2022, Canadian telecom Rogers communication experienced a service outage affecting more than 12 million users. Everyone from cable internet to cell users was impacted, which also had an impact on commercial services such as interact payments and some federal government services. Even Emergency services (911) were impacted. Services were partially recovered after 24h and took up to 4 days to be fully restored.
- Facebook: On October 4th, 2021 Facebook network and subsidiaries (Messenger, Instagram, WhatsApp, etc.) were unavailable for about 7 hours.
- Amazon: 4 major outages affected AWS since 2020. the last one in December 2021 included potential data loss. However, due to Amazon's massive infrastructure, most outages are limited to specific regions and are resolved in about an hour.
- Azure (Microsoft): In 2021 Microsoft Azure suffered 2 major outages affecting users worldwide for up to 16 hours.
- FAA: On January 11, 2023, more than 11,000 flights were delayed and at least 1,300 were cancelled after the Notam system went offline a day earlier.
The outages mentioned above affected companies with massive resources who are recognized as leaders in the IT industry. "Smaller" players are also affected, for example:
- Roblox: Roblox is a gaming platform where people can create their own games and play games created by other users. They have millions of active users. Roblox suffered a 3-day outage in October 2021 after one of their special events ended up being "too popular" and their platform could not handle the demand.
- KY hospital: Taylor Regional hospital in Kentucky at the end of January 2022 posted an urgent notice mentioning that their entire system, including phone lines, was down. The outage is believed to be the result of ransomware. It took close to 10 weeks for the outage to be resolved.
- SOLABS: On April 2021 a power issue caused an outage for our SOLABS QM10 solution for over 5,000 users. Our network went back and forth from offline to online at regular intervals. The complete restoration of our services was completed over a 5 days window with our staff working day and night during that period of time.
Above is a small sample only and at the same time a tribute to men and women who work behind the scenes day and night to keep our IT infrastructures available at all times.
COVID-19 and the switch to "remote work" brought challenges all over the globe for IT resources who needed to support a whole new world.
Our work is important.
Google outage: https://en.wikipedia.org/wiki/Google_services_outages
Facebook outages: https://en.wikipedia.org/wiki/2021_Facebook_outage
Healthcare - hospital outage: https://healthitsecurity.com/news/ky-hospital-systems-still-down-1-week-after-cybersecurity-incident & https://www.scmagazine.com/analysis/ransomware/amid-recovery-kentucky-hospital-details-cyberattack-discovered-in-january
FAA Outage: https://www.faa.gov/newsroom/faa-notam-statement