[CPWG] ICANN position on the Facebook outage

Raymond Mamattah mamattah.raymond at gmail.com
Sat Oct 9 19:56:37 UTC 2021


Root Cause Analysis

Summary of issue:

This incident, on October 4, 2021, impacted Facebook’s backbone network.
This resulted in disruption across all Facebook systems and products
globally, including Workplace from Facebook.


This incident was an internal issue and there were no malicious third
parties or bad actors involved in causing the incident. Our investigation
shows no impact to user data confidentiality or integrity.


The underlying cause of the outage also impacted many internal systems,
making it harder to diagnose and resolve the issue quickly.

Cause of issue:

This outage was triggered by the system that manages our global backbone
network capacity. The backbone is the network Facebook has built to connect
all our computing facilities together, which consists of tens of thousands
of miles of fiber-optic cables crossing the globe and linking all our data
centers.


During a routine maintenance job, a command was issued with the intention
to assess the availability of global backbone capacity, which
unintentionally took down all the connections in our backbone network,
effectively disconnecting Facebook data centers globally. Our systems are
designed to audit commands like these to prevent mistakes like this, but a
bug in that audit tool prevented it from properly stopping the command.


This change caused a complete disconnection of our server connections
between our data centers and the internet. And that total loss of
connection caused a second issue that made things worse.


One of the jobs performed by our smaller facilities is to respond to DNS
queries. Those queries are answered by our authoritative name servers that
occupy well known IP addresses themselves, which in turn are advertised to
the rest of the internet via another protocol called the Border Gateway
Protocol (BGP).


To ensure reliable operation, our DNS servers disable those BGP
advertisements if they themselves can not speak to our data centers, since
this is an indication of an unhealthy network connection. In the recent
outage the entire backbone was removed from operation, making these
locations declare themselves unhealthy and withdraw those BGP
advertisements. The end result was that our DNS servers became unreachable
even though they were still operational. This made it impossible for the
rest of the internet to find our servers.

Workplace timeline:

This incident related to a network outage that was experienced globally
across Facebook services and included Workplace. The outage was live for
around 6 hours, from approximately 16:40 - 23:30 BST.

Steps to mitigate:

The nature of the outage meant it was not possible to access our data
centers through our normal means because the networks were down, and the
total loss of DNS broke many of the internal tools we’d normally use to
investigate and resolve outages like this.


Our primary and out-of-band network access was down, so we sent engineers
onsite to the data centers to have them debug the issue and restart the
systems. But this took time, because these facilities are designed with
high levels of physical and system security in mind. They’re hard to get
into, and once you’re inside, the hardware and routers are designed to be
difficult to modify even when you have physical access to them. So it took
extra time to activate the secure access protocols needed to get people
onsite and able to work on the servers. Only then could we confirm the
issue and bring our backbone back online.


Once our backbone network connectivity was restored across our data center
regions, everything came back up with it. But the problem was not over — we
knew that flipping our services back on all at once could potentially cause
a new round of crashes due to a surge in traffic. Individual data centers
were reporting dips in power usage in the range of tens of megawatts, and
suddenly reversing such a dip in power consumption could put everything
from electrical systems to caches at risk.


In the end, our services came back up relatively quickly without any
further systemwide failures.

Prevention of recurrence:

We’ve done extensive work hardening our systems to prevent unauthorized
access, and ultimately it was this hardening that slowed us down as we
tried to recover from an outage caused not by malicious activity, but an
error of our own making. It is our belief that a tradeoff like this is
worth it — greatly increased day-to-day security vs. a slower recovery from
a rare event like this.


However, we’ll also be looking for ways to simulate events like this moving
forward to ensure better preparedness and ensuring that we take every
measure to strengthen our testing, drills, and overall resilience to make
sure events like this happen as rarely as possible.

Regards,

Raymond Mamattah
Accra, Ghana



On Sat, Oct 9, 2021, 7:48 PM sivasubramanian muthusamy via CPWG <
cpwg at icann.org> wrote:

> An organisation such as Facebook is most likely to have a thorough design
> of redundancy, and why did it fail? Isn't this a Security and Stability
> issue  for ICANN to examine? How is this NOT an ICANN issue?
>
> Sivasubramanian M
>
> On Sat, Oct 9, 2021, 21:45 Jonathan Zuck via CPWG <cpwg at icann.org> wrote:
>
>> Sounds like consensus
>>
>> Jonathan Zuck
>> Executive Director
>> Innovators Network Foundation
>> www.InnovatorsNetwork.org
>> Main: +1 (202) 827-7594
>> Direct: +1 (202) 420-7483
>> ------------------------------
>> *From:* CPWG <cpwg-bounces at icann.org> on behalf of h.raiche--- via CPWG <
>> cpwg at icann.org>
>> *Sent:* Saturday, October 9, 2021 6:37:10 AM
>> *To:* Roberto Gaetano <roberto_gaetano at hotmail.com>
>> *Cc:* CPWG <cpwg at icann.org>
>> *Subject:* Re: [CPWG] ICANN position on the Facebook outage
>>
>> Agree with both Seun and the original post.  It is NOT an ICANN issue.
>> That said, a brief post from ICANN with the sort of simple text - the
>> CloudFlare post is a really good example
>>
>> Holly
>>
>> On Oct 9, 2021, at 7:30 PM, Roberto Gaetano via CPWG <cpwg at icann.org>
>> wrote:
>>
>> +1
>> agree, it is not an ICANN issue - but in absence of a formal ICANN
>> statement some Internet users might have a different impression
>> r
>>
>>
>> On 09.10.2021, at 10:25, Seun Ojedeji via CPWG <cpwg at icann.org> wrote:
>>
>> +1 to this; it's certainly not an ICANN issue.
>>
>> Regards
>> Sent from my mobile
>> Kindly excuse brevity and typos
>> Every word has consequences.
>> Every silence does too!
>>
>> On Wed, 6 Oct 2021, 15:43 John McCormac via CPWG, <cpwg at icann.org> wrote:
>>
>> The problem with Facebook was self inflicted. Perhaps the simplest
>> solution for ICANN would be a one page text with a graphic explaining
>> that Facebook (or other large company) is not the Internet.
>>
>> The discussion on today's call seemed like a kind of regulatory
>> overreach with a desire to have ICANN tell large companies how to
>> construct their own network architecture. This really is not an ICANN
>> issue.
>>
>> For those who haven't seen it yet, the CloudFlare blog post on what
>> happened with Facebook is worth reading:
>>
>> https://blog.cloudflare.com/october-2021-facebook-outage/
>>
>> Regards...jmcc
>> --
>> **********************************************************
>> John McCormac  *  e-mail: jmcc at hosterstats.com
>> MC2            *  web: http://www.hosterstats.com/
>> 22 Viewmount   *  Domain Registrations Statistics
>> Waterford      *  Domnomics - the business of domain names
>> Ireland        *  https://amzn.to/2OPtEIO
>> IE             *  Skype: hosterstats.com
>> **********************************************************
>>
>> --
>> This email has been checked for viruses by AVG.
>> https://www.avg.com
>>
>> _______________________________________________
>> CPWG mailing list
>> CPWG at icann.org
>> https://mm.icann.org/mailman/listinfo/cpwg
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of your
>> personal data for purposes of subscribing to this mailing list accordance
>> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and
>> the website Terms of Service (https://www.icann.org/privacy/tos). You
>> can visit the Mailman link above to change your membership status or
>> configuration, including unsubscribing, setting digest-style delivery or
>> disabling delivery altogether (e.g., for a vacation), and so on.
>>
>> _______________________________________________
>> CPWG mailing list
>> CPWG at icann.org
>> https://mm.icann.org/mailman/listinfo/cpwg
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of your
>> personal data for purposes of subscribing to this mailing list accordance
>> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and
>> the website Terms of Service (https://www.icann.org/privacy/tos). You
>> can visit the Mailman link above to change your membership status or
>> configuration, including unsubscribing, setting digest-style delivery or
>> disabling delivery altogether (e.g., for a vacation), and so on.
>>
>>
>> _______________________________________________
>> CPWG mailing list
>> CPWG at icann.org
>> https://mm.icann.org/mailman/listinfo/cpwg
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of your
>> personal data for purposes of subscribing to this mailing list accordance
>> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and
>> the website Terms of Service (https://www.icann.org/privacy/tos). You
>> can visit the Mailman link above to change your membership status or
>> configuration, including unsubscribing, setting digest-style delivery or
>> disabling delivery altogether (e.g., for a vacation), and so on.
>>
>>
>> _______________________________________________
>> CPWG mailing list
>> CPWG at icann.org
>> https://mm.icann.org/mailman/listinfo/cpwg
>>
>> _______________________________________________
>> By submitting your personal data, you consent to the processing of your
>> personal data for purposes of subscribing to this mailing list accordance
>> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and
>> the website Terms of Service (https://www.icann.org/privacy/tos). You
>> can visit the Mailman link above to change your membership status or
>> configuration, including unsubscribing, setting digest-style delivery or
>> disabling delivery altogether (e.g., for a vacation), and so on.
>
> _______________________________________________
> CPWG mailing list
> CPWG at icann.org
> https://mm.icann.org/mailman/listinfo/cpwg
>
> _______________________________________________
> By submitting your personal data, you consent to the processing of your
> personal data for purposes of subscribing to this mailing list accordance
> with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and
> the website Terms of Service (https://www.icann.org/privacy/tos). You can
> visit the Mailman link above to change your membership status or
> configuration, including unsubscribing, setting digest-style delivery or
> disabling delivery altogether (e.g., for a vacation), and so on.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mm.icann.org/pipermail/cpwg/attachments/20211009/28226b3b/attachment-0001.html>


More information about the CPWG mailing list