Engineers at Meta Reveal How They Mitigate Hidden CPU Defects at Scale | Data Center Knowledge

Engineers at Meta (formerly Facebook) have published a paper outlining methods for detecting and mitigating the so-called Silent Data Corruption (SDC) errors caused by minor CPU defects, often developed over time.

The paper, titled Detecting Silent Data Corruptions in the Wild, describes the intensive testing regimen that the company has used internally for the past three years to keep its own servers performing as expected.

“Historically, each CPU went through only a few hours of testing as part of infrastructure burn-in tests. Further testing was typically conducted via sampling,” Harish Dattatraya Dixit, engineering lead at Meta, explained in a lengthy post on the company’s blog.

“We observe that novel detection approaches are required for application health and fleet resiliency. We demonstrate the ability to test at scale and get through billions of seconds of testing every month across a large fleet consistently. These novel techniques enable us to detect silent data corruptions and mitigate them at scale.”

Detect the undetectable

The main challenge with SDC errors are they are not captured by convention CPU error reporting mechanisms.

Issues with silicon transistors, which measure just a few nanometers across, can result in serious application-level problems, but it is almost impossible to track where these originated. Likely causes for CPU defects can include temperature variance and age, among other factors.

Late year, Meta’s team of engineers outlined the problem in a separate paper that described common defect types observed in silicon manufacturing that lead to SDCs.

Now, the same team has shared the testing strategies internally developed to detect and mitigate SDCs within large server fleets.

In a nutshell, there are two ways to test installed server CPUs, in-production, and out-of-production, and both come with their own benefits and drawbacks. In-production is faster, but can have a negative impact on live workloads. Out-of-production is safer, but requires system downtime, making it an even less palatable option, unless done during scheduled maintenance or reboot – which explains why Meta calls it “opportunistic testing.” According to Dixit, the trick to a healthy data center is doing both, all the time.

“At Meta, we implement multiple methods to detect SDCs, the two most effective of which are opportunistic and periodic testing and in-production/ripple testing,” he said.

“Opportunistic testing has been live within the fleet for around three years. However, in evaluating the trade-offs from one to the other, we’ve determined that both approaches are equally important to detecting SDCs, and we recommend using and deploying both in a large-scale fleet.”

According to Dixit, in Meta’s own data centers, in-production testing – that mostly just injects “bit patterns with expected results every now and then” – would detect 70 percent of the fleet data corruptions within 15 days, but the rest of the errors could only be identified with opportunistic testing. For this reason, Meta’s machines go through opportunistic testing an average of once every six months.

“We shared our early results with CPU vendors, who have learned lessons from our fleet and implemented mechanisms within their tools to achieve similar results,” Dixit said.

In order to further advance the state of SDC detection, Meta has issued a request for proposals, looking to award up to five university research teams up to $50,000 each. Applications for this grant are open until March 22.

Leave a Reply

%d bloggers like this: