*
AWS outage affects major websites and apps globally
*
Issue originated in US site known for outages
*
Service improved, then problems recurred before recovery
*
Gaming and financial platforms among those impacted by AWS
issue
(Updates to reinstate, no change to text.)
By Greg Bensinger, Shubham Kalia and Deborah Mary Sophia
SAN FRANCISCO, Oct 20 (Reuters) - Amazon.com ( AMZN ) cloud
service returned to normal operations on Monday afternoon, the
company said, after an internet outage that caused global
turmoil among thousands of sites, including some of the web's
most popular apps like Snapchat and Reddit ( RDDT ).
Still, Amazon ( AMZN ) said some AWS services had a backlog
of messages that would take a few hours to process.
AWS hosts applications and computer processes for companies
around the world, and the disruption knocked workers from London
to Tokyo offline and halted others from conducting normal
everyday tasks like paying hairdressers or changing their
airline tickets. Users on Monday afternoon had complained of
lingering difficulties using services such as digital wallet
Venmo and video calling site Zoom.
It was the largest internet disruption since last year's
CrowdStrike ( CRWD ) malfunction hobbled technology systems in hospitals,
banks and airports, highlighting the vulnerability of the
world's interconnected technologies.
It was at least the third time in five years that AWS's
northern Virginia cluster, known as US-EAST-1, contributed to a
major internet meltdown.
Amazon ( AMZN ) did not address a request for more clarity about why
that particular data center keeps being impacted. The problems
stemmed from what is known as the Domain Name System, or DNS,
which prevented applications from finding the correct address
for AWS's DynamoDB API, a cloud database relied upon to store
user information and other critical data.
ROOT CAUSE IS NETWORK HEALTH MONITOR
Earlier, AWS said the root cause of the outage was an
underlying subsystem that monitors the health of its network
load balancers used to distribute traffic across several
servers.
The issue, AWS said, originated from within the "EC2
internal network", Amazon's ( AMZN ) "Elastic Compute Cloud" service,
which provides on-demand cloud capacity within AWS.
Shortly after 3 p.m. PT (2200 GMT), Amazon ( AMZN ) said, "all AWS
services returned to normal operations. Some services such as
AWS Config, Redshift, and Connect continue to have a backlog of
messages that they will finish processing over the next few
hours."
Ken Birman, a computer science professor at Cornell
University, said software developers need to build better fault
tolerance. He said AWS provides tools developers can use to
protect themselves in the event of a problem at one of any of
its sprawling network of data centers, and developers can also
create backups with other cloud providers.
"When people cut costs and cut corners to try to get an
application up, and then forget that they skipped that last step
and didn't really protect against an outage, those companies are
the ones who really ought to be scrutinized later," Birman told
Reuters.
ISSUE ORIGINATED FROM AWS SITE KNOWN FOR PREVIOUS OUTAGES
AWS provides computing power, data storage and other digital
services to companies, governments and individuals and is the
world's largest cloud provider, followed by Microsoft's ( MSFT )
Azure and Alphabet's Google Cloud.
Disruptions to its servers can cause outages across websites
and platforms - ranging from food delivery apps to gaming
platforms and airline systems - that rely on its cloud
infrastructure.
AWS said on its status page that Monday's outage originated
at its US-EAST-1 location, its oldest and largest for web
services. The site suffered outages in 2021 and 2020.
According to documentation on the AWS website, the US-EAST-1
site is often the default region for many AWS services.
"FRAGILE INFRASTRUCTURES"
The problem highlights how interconnected everyday digital
services have become and their reliance on a small number of
global cloud providers, with one glitch wreaking havoc on
business and day-to-day life, experts and academics said.
"This outage once again highlights the dependency we have on
relatively fragile infrastructures," said Jake Moore, global
cybersecurity advisor at European cybersecurity firm ESET.
In Britain, Lloyd Bank, Bank of Scotland and
telecom service providers Vodafone ( VOD ) and BT were
all hit, according to Downdetector's UK website, as was UK tax,
payments and customs authority HMRC's website.
"The main reason for this issue is that all these big
companies have relied on just one service," said Nishanth
Sastry, director of research at the University of Surrey's
Department of Computer Science.
Ookla, which owns Downdetector, said over 4 million users
reported issues due to the incident.
"For major businesses, hours of cloud downtime translate to
millions in lost productivity and revenue," said Ryan Griffin,
U.S. cyber practice leader at insurance broker McGill and
Partners.
Wall Street was largely unfazed, sending Amazon ( AMZN ) shares 1.6%
higher to $216.48.
FROM SNAPCHAT TO VENMO: OUTAGE TAKES DOWN APPS
Ookla said at least a thousand companies were affected by
the outage.
Apps like Reddit ( RDDT ), Roblox ( RBLX ), Snapchat
and Duolingo ( DUOL ) had all been affected.
Artificial intelligence startup Perplexity, cryptocurrency
exchange Coinbase and trading app Robinhood
all experienced platform disruptions and attributed them to AWS.
Amazon's ( AMZN ) own services, including its shopping website, Prime
Video and Alexa, were also hit.
Fortnite, owned by Epic Games, Clash Royale and Clash of
Clans were among the gaming platforms affected. Uber ( UBER )
rival Lyft ( LYFT ) was also knocked down in the United States.
In a post on X, Signal President Meredith Whittaker
confirmed the messaging app was hit by the outage, though
billionaire Elon Musk, who owns X, said his platform continued
to work.