- Software applications are becoming more complex, outages are becoming more expensive, and consumers are becoming less tolerant of downtime.
- Chaos engineering is not quite mainstream yet, but the majority of respondents to this Q&A believe that this is past the innovation stage and well into early adoption on the adoption curve. Incident response and mitigation is becoming increasingly important and prioritized within organizations.
- The core concepts of chaos engineering are well established, but over the past several years the wider communities’ understanding of it has grown. Engineers are starting to learn that it’s a principled practice of experimentation and information sharing, not the folklore of completely random attacks.
- Software engineers have always tested in production (even if they don’t realize it). Chaos engineering can provide a more formal approach.
- Focusing on the people and process aspects can add a lot of value by, for example, running incident response and management rehearsal game days. The biggest hurdle to adoption, as with most technological change, often lies with people. Defining the rationale for the practice, setting expectations, and building trust throughout the organization is key.
- Too many people think that the point of building and running systems is to not make mistakes. Once they understand that a great system is resilient and not perfect, then the organizational understanding of the benefits and practices of chaos engineering often develops naturally.
- One of the nice things about chaos engineering is that you don’t need a lot of additional tools to start. For example, you can use Linux’s native kill command to stop processes and iptables to introduce network connectivity issues.
- At Google, teams run DiRT (disaster and recovery testing) where they physically go into data centers and do “terrible things to the machines” that service Google’s internal systems — like unplugging hardware. Less invasive open-source and commercial tooling exists for organizations that don’t run at Google scale.
- In order for chaos engineering investments to be beneficial, an organization will first need to build a culture of learning from incidents. Without such learning, investments may fail to generate value and people may get frustrated. Being able to observe your systems and understanding the impact of downtime are also key prerequisites.
The second Chaos Conf
event is taking place in San Francisco over 25-26 September, and InfoQ will be providing summaries of key takeaways from the event. In preparation for the conference, InfoQ sat down with a number of the presenters to discuss topics such as the evolution and adoption of chaos engineering, important learning related to the people and process aspects of running chaos experiments, and the biggest obstacles to mainstream adoption.
Readers looking to continue their learning journey with chaos engineering can find a summary of the recent Chaos Community Day v4.0
, a multi-article chaos engineering eMag
, and a summary of various QCon talks onresilient systems.
InfoQ: Many thanks for taking part in the Chaos Conf 2019 pre-event Q&A. Could you briefly introduce yourself please?
Jason Yee: Hi, I’m a senior technical evangelist at Datadog. Datadog is a SaaS-based observability platform that enables developers, operations, and business teams to get better insight into user experiences and application performance.
Caroline Dickey: Hi! I’m a site-reliability engineer at Mailchimp, an all-in-one marketing platform for small businesses. I build tooling and configuration to support engineer momentum, and develop monitoring and service-level objectives in order to promote the health of our application, and I lead the chaos-engineering initiative at Mailchimp.
Joyce Lin: Hey! I’m a developer advocate at Postman — an API development platform used broadly in the development community. I work with organizations who are pioneering better software development practices.
Robert (Bobby) Ross: My name is Robert Ross but people like to call me Bobby Tables after the popular XKCD comic. I’m a bleeding-edge enthusiast for technology, so I like the pain of trying new things. I’m the CEO of FireHydrant, an incident response tool.
Kolton Andrus: I’m CEO and co-founder of Gremlin. I cut my teeth on building reliable systems at Amazon, where my team was in charge of the retail website’s availability. I built their first "chaos engineering" platform (that name hadn’t been coined yet) and helped other teams perform their first experiments. Then I joined Netflix, who had just started blogging on this topic. I had the opportunity to build their next generation of fault-injection testing ( FIT
), and worked with all of the critical streaming teams to improve our reliability from three nines to four nines of uptime (99.99%). After seeing the value to Amazon and Netflix, I felt strongly that everyone would need this approach and great tooling to support it — so I founded Gremlin
Yury Niño: Hi, I am a software engineer from Universidad Nacional in Colombia. Currently, I am working as a DevOps engineer at Aval Digital Labs, an initiative that leads the digital transformation of a group of banks in my country.
Also, I am a chaos engineer advocate. I love breaking software applications, designing resilience strategies, running experiments in production and solving hard performance issues. I am studying how human errors and lack of observability are involved in the safety of software systems. At this moment, I am leading an initiative to create the first chaos community in Colombia. We are building Gaveta, a mobile app for supporting the execution of chaos game days.
Jose Esquivel: Hi, I’m an engineering manager at Backcountry.com where I work within supply-chain management for the fulfillment, marketing, and content teams. They are running about 65 APIs that interact within our network, with other third-party systems, and with public APIs. My responsibility, from a technical perspective, is to make sure these APIs remain stable and that new features are added on time.
Subbu Allamaraju: I’m a VP at Expedia Group. My tenure at Expedia started with bootstrapping a strategic migration of our travel platforms to the cloud. These days, I spend a lot more time on continuing to energize the pace of this transformation, and more importantly, set us up for operational excellence. I feel there is a lot we can learn from each other, so I write on my blog at https://subbu.org and speak at conferences.
Dave Rensin: Howdy! I’m a senior engineering director at Google currently working on strategic projects for our CFO. At Google, I’ve run customer support, part of SRE, and global network capacity planning.
Paul Osman: I run the Site Reliability Engineering team at Under Armour Connected Fitness. My team works on products like MyFitnessPal, MapMyFitness, and Endomondo. Before joining Under Armour, I worked at Pagerduty, SoundCloud, 500px, and Mozilla. Most of my career, I’ve straddled the worlds of ops and software development, so naturally found myself gravitating towards SRE work. I’ve practiced chaos engineering at several companies over the years.
InfoQ: How has chaos engineering evolved over the last year? And what about the adoption — is chaos engineering mainstream yet?
Yee: I don’t think chaos engineering itself has really evolved — the concepts are still the same, but our understanding of it has grown. Engineers are starting to learn that it’s a principled practice of experimentation and information sharing, not the folklore of completely random attacks. As the myth gets busted and best practices are established, it has fueled adoption. It’s not quite mainstream yet, but we’re certainly past the "innovator" stage and well into the "early adopter" segment of the adoption curve.
Dickey: Incident response and mitigation is becoming increasingly important and prioritized within organizations. Applications are becoming more complex, outages are becoming more expensive, and consumers are becoming less tolerant of downtime. And fascinatingly, the trend over the past year or two has been away from root-cause analysis or casting blame, and towards thinking of incidents as a perfect storm. Chaos engineering is the ideal counterpart for this industry paradigm shift since it allows engineers to identify and fix issues that could lead to a cascading failure before it ever occurs.
Chaos engineering is mainstream within large tech companies and early adopters but is still finding its footing within small and medium-sized companies. However, it’s a trend that’s here to stay, since software isn’t getting any less distributed, and humans aren’t getting any less error-prone! I’m excited to see what the adoption of chaos engineering and related practices looks like in a few years — it’s a fun time to be a part of this industry.
Lin: I believe that chaos engineering is still the new kid on the block that most teams think about implementing one day. Once a team has covered all the bases when it comes to pre-release testing, then initiating chaos experiments becomes a lot more feasible.
Ross: One big thing I’ve seen come out of the chaos-engineering movement is the popularization of game days. They’re not a totally new concept, but I’ve seen and heard of more and more companies performing them. People are seeing they already have these staging environments where they test software before release, so why not test their incident response process there too? Large companies have been doing this for years, but smaller shops are doing this more and more.
Andrus: Only a few years ago, I often had to explain why chaos engineering was necessary. Now, most teams doing serious operations understand the value and just need some help getting started. The early adopters that have been doing this for a couple of years now are having a lot of success, and are tackling how to scale it across their organizations. The early majority are still testing it out with projects and teams before embracing it more holistically.
Niño: I think chaos engineering has been moving ahead at a staggering speed in a very short amount of time. In 2016, the authors of the classic article "Chaos Engineering" wrote asking if it could be possible to build a set of tools for automating failure injection that was reusable across organizations. Nowadays, just three years later, we have more than 20 tools to implement chaos engineering and several organizations are using them to build more reliable systems in production.
According to the latest Technology Radar published by Thoughtworks, chaos engineering has moved from a much-talked-about idea to an accepted and mainstream approach to improving and assuring distributed system resilience.
Esquivel: Online business has been growing in regard to load, and for e-commerce the majority of the traffic is focused on a few weeks of the year. This means that during this period of high traffic, the stability is directly proportional to the revenue of the company. Adoption of test harnesses like unit, integration, security, load, and finally chaos testing has become mission critical. Adoption has been improving, and we are sure that we need chaos engineering. Now we need to find the time investment to mature it.
Allamaraju: Not yet.
Though this topic has been around for about a decade, the understanding that led to this idea called chaos engineering is still nascent in the industry.
Here is why I think so. Very few of us in this industry see the software we operate in our production environments as stochastic and non-linear systems, and we bring in a number of assumptions into our systems with each change we make. Consequently, when teams pick up chaos engineering, they start testing operating systems and network and application-level failures through various chaos-engineering tools but not go far enough to learn about safety.
Even though most of the work we do to our production systems seem benign, our work can sometimes push our systems to unsafe zones, thus preventing us from delivering the value that customers need and want. So, learning from your incidents before you go deep into chaos engineering techniques is very important.
Rensin: I think chaos engineering has always been mainstream; we just called it something different. If your customers find it before you do, we call that "customer support". If your telco (or provider) finds it before you do, we call that "bad luck". That, I think, is the biggest change over the last year. As more people discover what chaos engineering is, the more they’re discovering that they’ve already been doing it (or needed to do it) in their human systems.
Osman: Interest is exploding, but lots of companies are still struggling with how to get started. I think that’s because it’s not a magic bullet — chaos engineering exists within a matrix of capabilities that together can be really effective, but without things like observability, a good incident response process, blameless postmortems, etc., you won’t likely get results.
Basically, if you’re not a learning organization, there’s no point in doing experiments, so chaos engineering won’t help you. On the bright side, more and more people are realizing that modern systems are complex and we can’t predict failures. Just having QA teams test in a staging environment isn’t going to cut it anymore, hence the interest in chaos engineering. I think this all probably means it’s mainstream now. 🙂
InfoQ: What is the biggest hurdle to adopting chaos-engineering practices?
Yee: The biggest hurdle, as with most technological change, is with people. Getting approval from leaders to intentionally break your production systems is a very hard sell, especially without a clear understanding of the business risks and benefits. Creating a solid strategy to manage the blast radius of chaos tests, ensuring discovered weak points are improved, and communicating the business value are all key to winning support.
Dickey: The cultural shift — getting engineers to prioritize and value destructive testing when they have plenty of work sitting in their backlog — has been the most challenging part of driving chaos-engineering adoption at Mailchimp. Planning a game day or implementing chaos-engineering automation requires several hours of pre-work at a minimum, and without a dedicated chaos-engineering team, taking that time to plan and scope scenarios can feel contrary to forward momentum.
So it’s important to remember that every engineering norm required a cultural shift – whether that was adoption of unit tests, code reviews, daily standups, or work-from-home options. Since Mailchimp’s chaos-engineering program was largely bootstrapped by the SRE team, we had to rely on tech talks, internal newsletters, reaching out to teams directly, and making game days fun (we recommend chaos cupcakes) to drive engagement throughout the engineering department.
Lin: A common pitfall is not clearly communicating the reasons why you’re adopting chaos-engineering practices in the first place. When the rest of the organization doesn’t yet understand or believe in the benefit, there is going to be fear and confusion. Worse yet, there is often the perception that you’re just breaking stuff randomly and without legitimate hypotheses.
Ross: You need a process pusher. It’s easy to keep doing what you’re doing because that’s what you’ve always done. You need a champion for trying out chaos engineering. In a way I think the term "chaos engineering" might scare some people into trying it, so you need an individual (or a few) to really explain the value to the rest of the team and get everyone onboard. Getting buy-in can be hard.
Andrus: I think many people understand the general concept but underestimate the value it can provide to their business. They see it as yet another thing to do instead of a way to save them time with their ongoing projects, whether it be migrating to the cloud or adopting a new service like Kubernetes. Too often, reliability is an afterthought, so we are putting a lot of work into educating the market and explaining the value of putting in the upfront effort.
Niño: I have had to jump many hurdles. Understanding how to build resilient systems is about software, infrastructure, networks, data, and even about getting the bullet item. However, I think the biggest challenge is the cultural change.
Getting the approval of the customers to experiment with their systems is difficult, in my experience. When we talk about chaos engineering, customers get excited at first sight. For example, we hear expressions such as "if the top companies are doing this, we should look at chaos engineering too." However, when they understand the fundamental concepts, they say things such as "we are not going to inject failure in our production."
On the other hand, I think it’s important to mention the limitations too, to clarify that chaos engineering on its own won’t make the system more robust and that it is a mission for people who are designing the systems. Chaos engineering allows us to know how the system reacts to turbulent conditions in production and builds confidence in its reliability capacity, but designing strategies is our responsibility. The discipline is not a magic box that generates solutions to our weaknesses.
David Woods has a good point on this matter: "Expanding a system’s ability to handle some additional perturbations increases the system’s vulnerability in other ways to other kinds of events. This is a fundamental tradeoff for complex adaptive systems".
Esquivel: Enterprises must be IT savvy so that they can overcome the biggest hurdle in adopting chaos engineering, which is the lack of a culture that gives high priority to stability. Upper management must understand technology and know that investing in quality translates into delivering effective long-term solutions. Delivering functionality is not enough, as whatever is built as a functional feature must guarantee that the following non-functional features are part of the solution: maintainability and consistency through unit and integration testing, performance through load testing, and stability (for example by injecting chaos testing).
Allamaraju: The biggest hurdle, in my opinion, is understanding the rationale behind chaos-engineering practices, developing a culture of operating our systems safely, and articulating the relative value of such practices against all other work. You can’t succeed with chaos engineering if you can’t articulate and create hypothesis to demonstrate the value it can bring.
Once you cross those hurdles, the actual practice of chaos engineering falls into place naturally.
Rensin: Letting go of perfect. Too many people think that the point of systems is to not make mistakes. Once they get the idea that a great system is resilient, not perfect, then chaos engineering comes naturally.
Osman: I think trust is the most common hurdle. You have to have a culture where teams trust each other and non-engineering stakeholders trust that engineering teams have the customer’s best interest in mind. That trust is extremely important — if teams fear what might happen if they run a chaos experiment and it doesn’t go well, they’ll be understandably hesitant.
Chaos engineering reveals one of those inconvenient truths about complex systems: that we aren’t fully in control, and can’t predict what will happen in production. Unfortunately, some organizations will be reluctant to face this reality. This is why it’s incredibly important to have practices like blameless postmortems and a good incident response process in place before you look at introducing chaos engineering. Those practices impact the culture of an organization and make discussions about chaos engineering possible.
InfoQ: How important are the people aspects with chaos engineering? Can you provide any references for leaders looking to improve resilience across an organization?
Dickey: Chaos-engineering adoption is as much a marketing effort as it is a technical one. Leadership buy-in is crucial for a chaos-engineering program to succeed long term. Awesome Chaos Engineering
is a comprehensive list of resources that I’ve found to be a helpful starting place.
Lin: Most organizations have an existing fear of delivering software faster, and mitigate that fear with more and more pre-release testing. Once an organization begins trusting their mean time to recovery, there’s a willingness to learn more with high-impact chaos experiments.
Ross: You might uncover some seriously defective code/infrastructure design when you chaos-test something — something that might make you go "how did this happen?" It’s important to keep in mind that organizations introduce problems, not individuals. You might discover a problem that some code causes and while you won’t say their name at all, that person that wrote the code knows it was them. Depending on the person, they might feel bad and maybe even tighten up because of it. You need to make sure that person feels supported and enabled, because we all are going to make mistakes (in fact, If you write code that is in production, you already have). Avoid finger pointing at all costs. If you can’t avoid it, you’re not ready for chaos engineering.
Andrus: One of my pet peeves in our industry is the lack of investment in training our teams before they are placed on call. There’s a reason we run fire drills: it’s so we can have the opportunity to practice in advance and build muscle memory so people aren’t running around with their hair on fire when something bad happens. Similarly, a large part of incident management is coordination across teams — often teams that haven’t interacted previously. Creating a space for those relationships to be built outside of urgent situations allows for greater preparation and better results.
Niño: I think chaos engineering is a discipline created by people, for people. We, the humans, our attitudes, and the tradeoffs we are making under pressure are all contributors to how the system runs and behaves. In chaos engineering, we design failures, run experiments in production, analyze the results, and generate resilience strategies.
Regarding references, I think some awesome people and organizations have improved their resilience using the benefits of chaos engineering. The Gremlin team is doing a great job leading this space; the tool and the documentation are a very good references for the leaders who are worried about the resilience.
Esquivel: Once people understand that what you build you own — and by ownership, I mean company ownership — it is easy to gain support, and people will easily participate. At Backcountry, improving resiliency is championed by the VP of product, the director of engineering, and several engineering managers. In this case, the top-to-bottom support has been critical in order to make progress.
Allamaraju: People are at the heart of this discipline. After all, it is people that build and operate our systems. It is people that can listen to feedback from those systems through operational metrics. It is people that can learn from incidents to know when systems work and when they don’t.
Unfortunately, we’re still learning about these through experience, which takes time.
Rensin: People are the most important part of any system. We build and run these things for the people. People are also the biggest barrier (or accelerator). The very best way I can think to create organizational resilience is to periodically "turn off" critical humans. What I mean is "Today, Sally will not be answering email or IM for any reason. Feel free to be social but she’s off limits for the purpose of work." Why would we do that? To find all the tribal knowledge that only Sally has and get it all written down — or at least spread around. Do that with enough people randomly, and suddenly you have a much more resilient team.
Osman: I think, based on my answers above, you won’t be surprised when I say that people aspects are critical! The best tools and processes are nothing if people don’t trust them and buy into them. In terms of references, it’s seven years old now, but I always point people to John Allspaw’s article " Fault Injection in Production
" that was published in ACM Queue. It’s a great introduction to this concept. From the same author and year, the "Blameless PostMortems and a Just Culture
" blog post is great. I’ve mentioned the importance of a good incident response process a few times; if an organization doesn’t have a process they’re happy with, I always just refer them to PagerDuty’s excellent incident response documentation
. It’s one of those cases where you can just point to it and say "if you don’t do anything right now, just do this."
InfoQ: Can you recommend your favourite tooling and practices/games within the chaos space?
Yee: One of the nice things about chaos engineering is that you don’t need a lot of additional tools. For example, you can use Linux’s native kill command to stop processes and iptables to introduce network connectivity issues. Some other tools we use at Datadog include Comcast
, a tool to simulate poor network conditions, and Vegeta
, an HTTP load-testing tool.
Dickey: Mailchimp has been a Gremlin customer for about a year, and we’ve been happy with the flexibility and functionality their tooling offers. We chose to use their tool rather than build our own because we believed that our developers’ time was better spent creating amazing products for our customers than developing an in-house tool for testing. We also have some unique technical constraints (come see my conference talk to learn more!) that prevented us from using many open-source tools.
The SRE team runs game days every month, and more often as requested by other teams. We have used incident-simulation game days (sometimes called wargames or fire drills) to help our engineers get more comfortable responding to incidents.
Ross: I really like trying chaos engineering without any tools as a precursor. Basically, you perform a game day where one team of engineers is responsible for breaking the site and not telling the other team what they’re going to do (or when — that can be fun). It will reveal all sorts of things about where domain knowledge exists, do people know the process when something breaks, do you have an alert set up for it, do they have the right tools to investigate, etc.
Andrus: Some of the best tooling I can recommend is really getting to know your command-line and observability tools. Be able to grep/sort/cut/uniq a set of logs to narrow down symptoms. Be comfortable with TCPDump or network analysis tools so you can really understand what’s happening under the covers. Know how latency distributions, aggregation and percentiles, and what you’re measuring are key to understanding your system’s behavior.
There’s also this tool called Gremlin. 😉 We recently launched a free version for those just getting started that allows you to run shut-down and CPU experiments.
Niño: For industries that still have many restrictions for deploying to the cloud, I recommend using tools such as Chaos Monkey for Spring Boot. It is a great tool that works very well for conducting experiments that involve latency, controlled exceptions, and killing active applications in on-premise architectures.
I’ve also been exploring ChaoSlingr, a new tool for doing security chaos engineering. It is built for AWS and allows you to push failures into the system in a way that allows you not only to identify security issues but also to better understand the infrastructure.
Esquivel: We have gone from built-in tools to commercial software, and our recommendation is to go commercial unless you are willing to invest a lot of time and money. Our weapons of choice against mass instability are Jenkins to kick off load testing using Gatling and Gremlin to unleash chaos under a predefined impact radius that makes us feel comfortable to conduct chaos experiments in production and development environments.
Allamaraju: I don’t think there is one favorite tool or practice. Instead, I would look at the architecture of the system, the assumptions we are making, observed weaknesses through incidents, develop hypotheses to test for failures, and then figure out what tools or practices can help test hypotheses.
Rensin: At Google, we run DiRT (disaster and recovery testing) where we physically go into data centers and do terrible things to the machines that service Google’s internal systems — like unplug hardware. There’s a small team that plans and executes it so that we don’t cause customer harm, but mostly our engineers experience it as a real outage. We find a lot of things this way, and the DiRT postmortems are some of the best reading at Google.
Osman: In terms of practices, I really like planning chaos experiments during incident postmortems. When you’re going through the timeline of an incident, take note of the surprises — perhaps a database failed and no one really knows why, or maybe a spike in latency in service A caused errors that nobody anticipated — these are all great things to run chaos experiments on so the team can learn more about the behavior of the system under certain kinds of stress. Richard Cook has talked about "systems as imagined" versus "systems as they are found" and postmortems are great opportunities for teasing out some of those differences. Those differences are perfect fodder for high-value chaos experiments.
InfoQ: Can you provide a two-sentence summary of your Chaos Conf talk? Why should readers attend?
Yee: If resilience is not optional but a requirement for modern applications, then we need to change how we build software. I’ll be talking about resilience-driven development, a way to apply chaos principles early in the development process in order to guarantee more resilient applications.
Dickey: For many companies, chaos engineering means testing dependencies between services or killing container instances. My talk covers why and how to practice chaos engineering when those approaches aren’t an option, like when your main application is a monolith running on bare metal.
Lin: Who is responsible for chaos in an organization? At very large and established companies, it’s probably the SRE or DevOps team that handles this, but what if you don’t have clearly defined functions? Who else would handle chaos experiments? My talk will cover research on this topic.
Ross: We’re going to show how you can combine the contributing factors of a previous high-severity incident into a chaos experiment. From there, we’re going to try to see if that incident happens again based on all of the learnings we gathered by responding to the last one.
Andrus: A lot has developed in the space this year. We’ve seen an increase in the number of public-facing outages, a trend we fear will only get worse before it gets better. The talk will cover how companies can prepare for common, well-known outages, so that attendees walk away knowing how to build a more reliable internet.
Niño: Nowadays, there are a lot of books, tools, decks, and papers that talk about chaos engineering. It can be overwhelming for those who are just starting to explore this discipline. My talk, "Hot recipes for doing Chaos Engineering", provides instructions and experiments to try as a way for new chaos engineers to get on board.
Esquivel: I would like to address a common problem when engaging in chaos experimentations: what is a good roadmap for chaos engineering? I want to kick this off by showing eight stability patterns, talk about three ways to achieve observability, and wrap it up with a real-time chaos test.
Allamaraju: I plan to talk about how to form failure hypothesis. Based on my experience at the Expedia Group, I plan to share the architectural guardrails we use to drive safety, and how to make value-based hypotheses to promote safety.
Rensin: I think we can (and should) apply the principles of chaos engineering to how we develop and train our teams. There’s no reason to limit these ideas to just our software or hardware. I think I can convince attendees why this makes better teams and companies.
Osman: Ana Medina and I are going to talk about ways to introduce chaos engineering to your organization. We’ll draw from practical examples at companies we’ve worked at. We’ll also describe various onramps for introducing these practices. You should definitely attend our talk if you’ve struggled with getting support for the idea in your own company, or if you’re interested in chaos engineering but not sure how to get started.
About the interviewees
Jason Yee is a technical evangelist at Datadog, where he works to inspire developers and ops engineers with the power of metrics and monitoring. Previously, he was the community manager for DevOps & Performance at O’Reilly Media and a software engineer at MongoDB.
Caroline Dickey is a site-reliability engineer at Mailchimp in Atlanta, where she builds internal tooling, works with development teams to develop SLIs and SLOs, and leads the chaos-engineering initiative including coordinating monthly game days.
Joyce Lin is a lead developer advocate with Postman. Postman is used by 7 million developers and more than 300,000 companies to access 130 million APIs every month. She frequently works with API-first organizations, and has strong opinions about do’s and don’ts when it comes to modern software practices.