Early detection of configuration errors to reduce failure damage Xu et al, OSDI ’16
Here’s one of those wonderful papers that you can read in the morning, and be taking advantage of the results the same afternoon! Remember the ‘ Simple testing can prevent most critical failures ‘ paper from OSDI’14 that we looked at last month? In that paper we learned that trivial mistakes in error handling, which are easy to test for, accounted for a vast majority of catastrophic production incidents. Well, as soon as you’ve got your error / exception handlers sorted out, you might want to read today’s paper to discover another class of easy-to-test for bugs that are also disproportionately responsible for nasty production failures.
Facebook’s ‘ Holistic configuration management ‘ paper stresses the importance of version control and testing for configuration, as configuration errors are a major source of site errors. Xu et al. study configuration parameters in the wild, focusing especially on those associated with reliability, availability, and serviceability (RAS) features. What they find is that very often configuration values are not tested as part of system initialization. The program runs along happily until reaching a point (say for example, it needs to failover) where it needs to read some configuration for the first time, and then it blows up – typically when you most need it.
So here’s the short takeaway test all of your configuration settings as part of system initialization and fail-fast if there’s a problem . Do that, and you’ll cut out another big source of major production errors. The paper itself is in two parts: the first part (where I’ll focus most of my attention in this short write-up) is an analysis of these latent configuration errors in real code bases; the second part introduces a tool called PCheck, which if your program is written in C or Java can even find latent configuration usage and automatically write tests for you!
Latent configuration errors can result in severe failures, as they are often associated with configurations used to control critical situations such as fail-over, error handling, backup, load balancing, mirroring, etc… Their detection or exposure is often too late to limit the failure damage.
In a study of real-world configuration issues in the the products of COMP-A, “major storage company in the US,” with footnote “we are required to keep the company and its products anonymous,” it turns out that 75% of all high severity configuration-related errors are caused by latent configuration errors. It may well be that the authors are required to keep COMP-A anonymous, but I couldn’t help noticing the author affiliations printed in big type on the front page. A more than fair chance that company is NetApp I would say!
The authors also studied a number of real-world open-source systems (see table below), and inspected usage of all of their RAS-related configuration parameters.
They looked at how many of those parameters were explicity checked vs simply being used when first required, yielding the results below:
Many of the studied RAS parameters do not have any special code for checking the correctness of their settings. Instead, the correctness is verified (implicitly) when the parameters’ values are actually used in operations such as a file open call.
Here’s an example of a real-world latent configuration error in MapReduce: