[ad_1]
Imagine for a moment that the millions of computer chips inside the servers that power the world’s largest data centers have rare, nearly undetectable flaws. And the only way to find flaws was to throw those chips into huge computer problems that would have been unthinkable just a decade ago.
As the tiny switches in computer chips shrink to a few atoms wide, the reliability of the chips has become another concern for the people who run the world’s largest networks. companies like Amazon, Facebook, Twitter and many other sites Last year it had surprising outages.
The outages had various causes, including programming errors and congestion in the networks. However, as cloud computing networks become larger and more complex, there is a growing concern that at the most fundamental level they are still dependent on computer chips that are less reliable and in some cases less predictable.
Last year, researchers at both Facebook and Google published studies describing computer hardware failures, the causes of which are not easy to pinpoint. They argued that the problem was not in the software – it was somewhere in the computer hardware made by various companies. Google declined to comment on his work, while Facebook did not make requests for comment on his work.
“They’re actually seeing these silent errors coming from the underlying hardware,” said Subhasish Mitra, an electrical engineer at Stanford University who specializes in testing computer hardware. Dr. Mitra said people are increasingly believing that manufacturing defects are due to these so-called silent errors that are not easily caught.
Researchers are concerned that they have found rare flaws as they try to solve larger and larger computing problems that are stressing their systems in unexpected ways.
Companies operating large data centers began reporting systematic problems over a decade ago. In engineering publication in 2015 IEEE Spectrum, A group of computer scientists working on hardware reliability at the University of Toronto reported that every year, 4 percent of Google’s millions of computers encounter errors that go undetected and cause them to shut down unexpectedly.
In a microprocessor with billions of transistors or a computer memory card made up of trillions of tiny switches, each capable of storing 1 or 0, even the smallest mistake can now disrupt systems that routinely perform billions of calculations every second.
At the dawn of the semiconductor era, engineers were concerned about the possibility of cosmic rays occasionally inverting a single transistor and changing the result of a calculation. They now worry that the keys themselves are becoming less and less reliable. Facebook researchers even claim that keys are becoming more prone to wear and the lifespan of computer memories or processors may be shorter than previously believed.
There is growing evidence that the problem gets worse with each new generation of chip. A statement Research published in 2020 by chipmaker Advanced Micro Devices found that the most advanced computer memory chips at the time were about 5.5 times less reliable than the previous generation. AMD did not respond to requests for comment on the report.
These bugs are hard to track down, said David Ditzel, a senior hardware engineer in Mountain View, California, president and founder of Esperanto Technologies, a manufacturer of a new type of processor designed for artificial intelligence applications. The newcomer to the market had 1,000 processors made of 28 billion transistors.
He likens the chip to an apartment building that would span the entire United States. Dr. Using Mr. Ditzel’s metaphor, Mitra said, finding new faults is like looking for a single running tap in an apartment in that building that only fails when the bedroom light is on and the apartment door is open.
Until now, computer designers have tried to fix hardware flaws by adding special circuits that correct errors to chips. Circuits automatically detect and correct erroneous data. It was once considered an extremely rare problem. But a few years ago, Google production teams started reporting bugs that were crazy hard to diagnose. According to their reports, calculation errors occurred intermittently and were difficult to reproduce.
A group of researchers tried to trace the problem and published their findings last year. They concluded that the company’s large data centers, made up of computer systems based on millions of processor “cores”, encountered new faults that were likely a combination of several factors: smaller transistors approaching the physical limits, and poor testing.
In their article titled “Uncountable Cores,” Google researchers stated that the problem was tough enough and they took the equivalent of decades of engineering time to solve it.
Modern processor chips consist of dozens of processor cores that compute engines that make it possible to split tasks and solve them in parallel. The researchers found a small subset of nuclei that rarely, and only under certain conditions, give false results. They described the behavior as sporadic. In some cases, cores may produce errors only when the computational speed or temperature is changed.
According to Google, increased complexity in processor design was a major reason for the failure. But engineers also said that smaller transistors, three-dimensional chips, and new designs that only fail in certain situations add to the problem.
a similar paper Launched last year, a group of Facebook researchers noted that some processors will pass manufacturers’ tests but then begin to exhibit failures while in the field.
Intel executives said they are familiar with Google and Facebook research papers and are working with both companies to develop new ways to detect and fix hardware errors.
Bryan Jorgensen, vice president of Intel’s data platforms group, said the researchers’ claims were true and that “the challenge they’ve put up against the industry is the right place to go.”
He said Intel has recently started a project to help create standard, open-source software for data center operators. The software will make it possible for them to find and fix hardware errors that are not detected by the onboard circuits in the chips.
This challenge was highlighted last year when several of Intel’s customers silently issued warnings about undetected bugs being created in their systems. Lenovo, the world’s largest personal computer manufacturer, informed its customers These design changes in several generations of Intel’s Xeon processors meant that the chips could produce a greater number of uncorrectable errors than previous Intel microprocessors.
Intel has not made a public statement on the matter, but Mr. Jorgensen acknowledged the issue and said it has now been fixed. The company has since changed its design.
Computer engineers are split on how to respond to this challenge. A common response is the demand for new types of software that proactively monitors hardware failures and enables system operators to remove hardware when it starts to fail. This has created an opportunity for new start-ups to offer software that monitors the health of underlying chips in their data centers.
One such operation is the company TidalScale in Los Gatos, California, which makes proprietary software for companies trying to minimize hardware outages. Its CEO, Gary Smerdon, suggested that TidalScale and others are facing an imposing challenge.
“It will be a bit like switching engines while a plane is still flying,” he said.
[ad_2]
Source link
