At Cask we are huge proponents of test automation. We routinely run unit, integration, and performance tests to ensure the correctness and stability of our software. While these tests are great at catching a lot of issues early on, there are some aspects of distributed systems that are hard to capture with these tests – for instance: the impact of HBase compaction, resource availability of clusters over time, slow leaks of resources, and the like. We realized the importance of testing the stability of our software over a longer period of time, and addressed this with a suite of long running tests. In this blog post we are going to touch upon some aspects of the long running test framework that we developed at Cask. We believe that the ideas and approaches developed could have benefits beyond CDAP.
We designed the long running tests to run against a specific CDAP cluster over extended periods of time. There are two aspects of the design
- The long running test framework itself
- Test cases that run as a part of the long running test suite
The long running test framework is designed to execute test cases periodically. Each iteration of the test executes data operations on a cluster and verifies these operations. Over an extended period of time this can be used to verify the stability of the cluster. The tests can save state between iterations so that a later iteration can use the saved state for verification.
A typical test runs some data operations and waits for the operations to complete before verifying them. However each long running test iteration only queues data for ingestion and does not wait for the data operations to complete. The verification of the data operations executed during an iteration happens in the next iteration of the test. This structure allows data operations of all long running tests to be executed in parallel. This design choice helped us in keeping the overall test pattern simple yet scalable.
The design of test cases involve defining three overall stages
- Setup: A life-cycle stage that is called once during the first run of long running test. This stage allows for checking cluster status, deploying applications, starting any programs and other setup operations
- Run verification: This stage is designed to make a call after every run to get the state saved from the previous run and verify the correctness of the system
- Run operations: This stage is designed to run after the verification stage and inject data to be verified in the next run and save required state. This stage will be called only if the verification stage is successful.
We leveraged JUnit’s Test suite and Runner infrastructure to implement the test cases, with each test case extending from a LongRunningTestBase that has the lifecycle methods described in the previous section. The test cases are run periodically on Bamboo, with each run executing a verification and operations stage.
Sample test case:
Testing other scenarios
This framework is extensible to other aspects of distributed systems. For example,
- High Availability testing: This framework can be used to test high availability by bringing down services while tests are running to verify failover strategies.
- Upgrade testing: These tests can also be used for verifying CDAP upgrade by running a certain number of iterations before and after the software upgrade.
We have been running long running tests using this proposed framework for testing CDAP and hope we have provided you with an interesting overview of our approach for long running tests in distributed systems. We are always trying to improve our efforts, so if you have comments or suggestions, please drop us a note and let us know what you think.