Earlier this week, Amazon held its annual celebration of consumerism known as Prime Day, a 36-hour orgy of buying that spanned over a dozen countries.
Earlier this week, Amazon held its annual celebration of consumerism known as Prime Day, a 36-hour orgy of buying that spanned over a dozen countries. By Amazon’s own reporting, it was a success. Consumers bought over 100 million products, according to a statement from the company. They gobbled up deals, purchasing more than a quarter-million Instant Pots, more than a million smart-home gadgets, and stuff like water filters, DNA tests, and school supplies.
But it wasn’t smooth sailing. There were widespread reports of problems, and shoppers were faced with issues that included seeing an error page with a dog on it, finding that their shopping carts became empty, or having trouble when clicking on a “shop all deals” page. The chart at downdetector.com shows a spike in issues on Monday, July 16, when Prime Day kicked off.
Amazon has not explained what caused those problems. “It wasn’t all a walk in the (dog) park, we had a ruff start – we know some customers were temporarily unable to make purchases,” the company said in a statement, referring to those canine-filled error pages.
All this raises the questions: In light of the fact that even a web behemoth like Amazon.com can suffer hiccups, how do companies prepare when they know they’re going to expect a flood of traffic—and why do those systems still sometimes fail?
Fault tolerance and the Chaos Monkey
Companies need to set themselves up in advance for a deluge of traffic, like stockpiling your kitchen before a flood of hungry guests and making sure you can run to the grocery store quickly if you need to, too.
One tactic is to ensure that they have enough computing capacity available to dynamically adjust to the traffic they get. And an easy way to do that is to take advantage of the vast scale of cloud computing from the likes of Amazon Web Services (AWS), and competitors like the Google Cloud Platform and Microsoft Azure. Then, a company’s computing capacity can do what the industry refers to as “elastic scaling,” meaning that as they need more resources—computing power in response to web traffic—they can get it, in real time. It's the equivalent of calling in computing reinforcements.
Of course, there’s a hint of irony in the fact that Amazon.com had problems on a day it knew it was going to receive a surge of traffic, given that it owns a service, AWS, that it sells to companies to avoid having such problems.
“The solution to every problem is to add more machines—you can’t do that if you’re a mom-and-pop shop,” says Justine Sherry, an assistant professor of computer science at Carnegie Mellon University who studies computer networks. “You’re probably still better off having resources that Amazon [via AWS] has put together, than what you can cobble together yourself.”
A related way that companies ensure that traffic is routed smoothly is using load balancers—machines in data centers that decide which other machines in the same center handle the requests, an important task whether traffic is light or heavy. Those machines will show you a copy of the website you want to visit, called a replica.
“The load balancer is just choosing a replica to give you,” Sherry says. “That’s really the magic that makes cloud computing work—it looks like one machine, but it’s actually thousands or hundreds of thousands, and that’s why they can handle so much load.”
That’s not all companies do when running data centers. They also think about preparing for the fact that some aspect of it could fail. And if a piece does break, will the system still work? For that, like an airplane, they need to have redundancy. The concept is called “fault tolerance.” And to test their fault tolerance, engineers will purposely conduct stress tests.
“I often find the kinds of things that they do really surprising—because they generally go in and try to break their own machines,” Sherry says. One tool for that is Netflix’s aptly named Chaos Monkey, which is software that disables parts of a system to see how it holds up to partial failure.
Another strategy for ensuring server stability: don’t mess with it before all the traffic hits. It’s a common approach in the retail industry as the holiday shopping season approaches, says Shuman Ghosemajumder, the CTO of cybersecurity company Shape Security. “They will lock down their infrastructure well before the peak season begins,” he says. “Often in September, sometimes as early as August, they’ll say ‘no changes are going into our infrastructure—because we just don’t understand what effect they might have under load.’”
But it wasn’t smooth sailing. There were widespread reports of problems, and shoppers were faced with issues that included seeing an error page with a dog on it, finding that their shopping carts became empty, or having trouble when clicking on a “shop all deals” page. The chart at downdetector.com shows a spike in issues on Monday, July 16, when Prime Day kicked off.
Amazon has not explained what caused those problems. “It wasn’t all a walk in the (dog) park, we had a ruff start – we know some customers were temporarily unable to make purchases,” the company said in a statement, referring to those canine-filled error pages.
All this raises the questions: In light of the fact that even a web behemoth like Amazon.com can suffer hiccups, how do companies prepare when they know they’re going to expect a flood of traffic—and why do those systems still sometimes fail?
Fault tolerance and the Chaos Monkey
Companies need to set themselves up in advance for a deluge of traffic, like stockpiling your kitchen before a flood of hungry guests and making sure you can run to the grocery store quickly if you need to, too.
One tactic is to ensure that they have enough computing capacity available to dynamically adjust to the traffic they get. And an easy way to do that is to take advantage of the vast scale of cloud computing from the likes of Amazon Web Services (AWS), and competitors like the Google Cloud Platform and Microsoft Azure. Then, a company’s computing capacity can do what the industry refers to as “elastic scaling,” meaning that as they need more resources—computing power in response to web traffic—they can get it, in real time. It's the equivalent of calling in computing reinforcements.
Of course, there’s a hint of irony in the fact that Amazon.com had problems on a day it knew it was going to receive a surge of traffic, given that it owns a service, AWS, that it sells to companies to avoid having such problems.
“The solution to every problem is to add more machines—you can’t do that if you’re a mom-and-pop shop,” says Justine Sherry, an assistant professor of computer science at Carnegie Mellon University who studies computer networks. “You’re probably still better off having resources that Amazon [via AWS] has put together, than what you can cobble together yourself.”
A related way that companies ensure that traffic is routed smoothly is using load balancers—machines in data centers that decide which other machines in the same center handle the requests, an important task whether traffic is light or heavy. Those machines will show you a copy of the website you want to visit, called a replica.
“The load balancer is just choosing a replica to give you,” Sherry says. “That’s really the magic that makes cloud computing work—it looks like one machine, but it’s actually thousands or hundreds of thousands, and that’s why they can handle so much load.”
That’s not all companies do when running data centers. They also think about preparing for the fact that some aspect of it could fail. And if a piece does break, will the system still work? For that, like an airplane, they need to have redundancy. The concept is called “fault tolerance.” And to test their fault tolerance, engineers will purposely conduct stress tests.
“I often find the kinds of things that they do really surprising—because they generally go in and try to break their own machines,” Sherry says. One tool for that is Netflix’s aptly named Chaos Monkey, which is software that disables parts of a system to see how it holds up to partial failure.
Another strategy for ensuring server stability: don’t mess with it before all the traffic hits. It’s a common approach in the retail industry as the holiday shopping season approaches, says Shuman Ghosemajumder, the CTO of cybersecurity company Shape Security. “They will lock down their infrastructure well before the peak season begins,” he says. “Often in September, sometimes as early as August, they’ll say ‘no changes are going into our infrastructure—because we just don’t understand what effect they might have under load.’”