Microsoft Global Outage: Cybersecurity flaw causes worldwide IT meltdown

Summarize with:

A global IT meltdown caused by a CrowdStrike security upgrade left major airlines grounded, banks offline, and medical practices incapacitated. Swapnil Mishra from APAC News Network delves into the root causes of this widespread disruption and discusses critical strategies organizations can adopt to prevent such crises in the future.

New Delhi: The world woke up to an unusual morning this Friday, as they planned to start their work day but couldn’t log into their systems. CrowdStrike, a cybersecurity company that provides services to a variety of industries, experienced outages in various regions, which resulted in the cessation of news broadcasts and the cancellation of flights.

The Federal Aviation Administration has reported that all flights were suspended by at least three major U.S. airlines: American, United, and Delta. A global outage has resulted in the sudden shutdown of Windows computers with a blue screen of death, causing television channels, airports, and institutions to go offline.

Around the world, IT outages have forced businesses and organizations offline. GP practices in the UK have reported that they are unable to schedule appointments or access patient details. After a brief interruption, Sky News returned to its programming, and the largest train company in Britain alerted commuters to potential disruptions due to “widespread IT issues.”

Banks, supermarkets, and other large organizations reported experiencing computer issues that affected their operations globally. Some airlines issued delayed alerts, while others grounded planes. This is what we currently know:

What caused the widespread Windows outage:

CrowdStrike released a security upgrade with flaws that led to the outage. The business was “actively working with customers impacted by a defect found in a single content update for Windows hosts,” according to a statement sent by CEO George Kurtz. “This is not a security incident or cyberattack,” he continued. A solution has been implemented after the problem was located and isolated.

The problem seemed to stem from a CrowdStrike software update known as Falcon Sensor, according to independent cybersecurity expert and consultant Lukasz Olejnik. Computers have received an updated software fix, but according to Mr. Olejnik, outages will likely continue because it is unclear how to address the enormous number of machines that have already been impacted.

The currently recommended fix, which involves manually restarting each machine into safe mode, erasing a certain file, and then restarting the computer normally, is mostly to blame for the issue.

Despite being a somewhat straightforward operation, security experts stated that it cannot be automated on a large scale.

IT problems were reported by businesses worldwide, including banks, telecom companies, TV and radio broadcasters, and supermarkets. Additionally impacted were US carriers, such as American carriers, Delta, and United Airlines, whose flights were halted. Spain and Germany airports were also reporting problems.

Indian Computer Emergency Response Team (CERT-In) has said the following method can be used as a workaround:

– Boot Windows into Safe Mode or the Windows Recovery Environment Navigate to the C:\Windows\System32\drivers\CrowdStrike directory Locate the file matching “C-00000291*.sys”, and delete it.

– Boot the host normally.

A looming issue subsided but not completely resolved:

Microsoft said the preliminary root cause was a “configuration change” in a portion of its Azure backend workloads. It caused interruption between storage and compute resources which resulted in connectivity failures that affected downstream Microsoft 365 services dependent on these connections, the company said.

“Our services are still seeing continuous improvements while we continue to take mitigation actions,” Microsoft said in a post on X.

Ashwini Vaishnaw, the Union Minister of Information and Broadcasting, has disclosed that the Ministry of Electronics and Information Technology (MEITY) is collaborating with Microsoft and its partners to address a pervasive Windows 10 outage. This issue, which can be attributed to a recent update of CrowdStrike’s Falcon sensor, has resulted in a significant number of PCs experiencing a halt at the recovery screen. The issue has caused operational disruptions at government offices, banks, airports, and corporations worldwide.

The chief executive officer of CrowdStrike Holdings has also announced on the social media platform X that the company has identified the update that caused a global Windows system malfunction and that a “fix has been deployed.”

Commenting on the issue, Kumar Ritesh, CEO & Founder, CYFIRMA said, “The massive outage in Microsoft systems caused by CrowdStrike updates was due to a compatibility issue between CrowdStrike’s Falcon sensor and a Windows update. When the CrowdStrike sensor, a critical endpoint protection agent, was updated, it conflicted with changes introduced in the latest Windows update. Such incidents underscore the importance of rigorous compatibility testing between security solutions and operating system updates to prevent widespread disruptions.

He added, “We would always encourage organizations to implement monitoring solutions that detect anomalies, performance issues, or unexpected behavior.”

Omer Grossman, Chief Information Officer (CIO) at CyberArk commented on the issue saying that, “The current event appears – even in July – that it will be one of the most significant cyber issues of 2024. The damage to business processes at the global level is dramatic. The glitch is due to a software update of CrowdStrike’s EDR product. This is a product that runs with high privileges that protects endpoints. A malfunction in this can, as we are seeing in the current incident, cause the operating system to crash.”

He explained further saying, “There are two main issues on the agenda: The first is how customers get back online and regain continuity of business processes. It turns out that because the endpoints have crashed – the Blue Screen of Death – they cannot be updated remotely and this problem must be solved manually, endpoint by endpoint. This is expected to be a process that will take days.

The second is around what caused the malfunction. The range of possibilities ranges from human error – for instance a developer who downloaded an update without sufficient quality control – to the complex and intriguing scenario of a deep cyberattack, prepared ahead of time and involving an attacker activating a “doomsday command” or “kill switch”. CrowdStrike’s analysis and updates in the coming days will be of the utmost interest.”

Commenting on the issue, Jake Moore, Global Security Advisor at ESET said that, “These outages are increasing in volume due to the sheer increase in the number of online users and traffic. After witnessing the blue screen of death (BSOD), many people are quick to suspect a cyberattack or find similarities to Netflix’s Leave The World Behind but this can often add to the confusion. It highlights the importance of these services and the millions of people they serve.”

He added, “The inconvenience caused by the loss of access to services for thousands of people serves as a reminder of our dependence on Big Tech such as Microsoft in running our daily lives and businesses. Upgrades and maintenance to systems and networks can unintentionally include small errors, which can have wide-reaching consequences as experienced today by Crowdstrike’s customers.”

Securing the future of cybersecurity :

The recent global IT meltdown underscores the need for robust cybersecurity measures. To avoid such disruptions in the future, organizations can implement several strategies:

Thorough Testing in a Controlled Environment: Before deploying updates, organizations should create a testing environment that mirrors their production systems. This helps identify compatibility issues or unexpected behavior.
Gradual Rollout of Updates: Deploy updates incrementally, monitoring a subset of systems for adverse effects before a wider rollout.
Regular Backups and Reliable Restore Points: Maintain and test regular backups of critical systems to quickly restore functionality if updates cause problems.
Use of Patch Management Tools: Automate update deployment with tools that allow scheduling, tracking, and rollback of changes if needed.
Diverse IT Infrastructure: Reduce reliance on a single type of infrastructure to mitigate the impact of widespread technical incidents.
Robust Cyber-Resilience Plans: Implement and regularly update cyber-resilience plans that include multiple fail-safes to ensure business continuity.

Jake Moore also suggested a few strategies that the organizations can implement.

“Businesses must test their infrastructure and have multiple fail safes in place, however large the company is, this is typically referred to as a cyber-resilience plan. But as often is the case, it is simply impossible to simulate the size and magnitude of the issue in a safe environment without testing the actual network.

Another aspect of this incident relates to “diversity” in the use of large-scale IT infrastructure. This applies to critical systems like operating systems (OSes), cybersecurity products, and other globally deployed (scaled) applications. Where diversity is low, a single technical incident, not to mention a security issue, can lead to global-scale outages with subsequent knock-on effects.”

Talking about the measures that the organizations can implement to avoid such issues in the future, Ritesh said that, “There are measures that can be put in place to avoid such disruptions.

Before deploying any security update or software patch, create a testing environment that mirrors production systems.
Test the update thoroughly in this environment to identify any compatibility issues or unexpected behavior.
Avoid deploying updates across all systems simultaneously. Instead, roll them out gradually to a subset of machines.
Monitor these systems closely for any adverse effects. If everything looks good, proceed with a wider rollout.
Regularly back up critical systems so that in case an update causes problems like the current situation with Crowdstrike updates, you can restore the system to a previous state.
Ensure backups are tested and reliable. Use patch management tools to automate the deployment of updates. These tools allow you to schedule updates, track their status, and roll back changes if needed.”

Conclusion: The recent global IT outage underscores the critical importance of rigorous testing, strategic update deployment, and robust cyber-resilience planning. To prevent such incidents in the future, organizations can implement several strategies including thorough testing in environments mirroring production systems. Enterprises can also implement a gradual rollout of updates, regular backups, and the use of patch management tools. Diversifying IT infrastructure and having robust cyber-resilience plans with multiple fail-safes are also crucial measures to ensure business continuity during disruptions.

Swapnil Mishra, APAC News Network

Summarize with: