CERN runs OpenStack Ironic to provision all the new hardware deliveries and the on-demand requests for baremetal instances. It replaced already most of the workflows and tools to manage the lifecycle of physical nodes, but we continue to work with the upstream community to improve the pre-production burn-in, the up-front performance validation and the integration of retirement workflows.
During the last 2 years the service has grown from 0 to ~3100 physical nodes. The goal is to manage all the physical resources in the data centre (~15k physical nodes) using Ironic.
Ironic integrates tightly with OpenStack Nova. For Nova there is very little distinction between a virtual or a physical instance. The only configuration change required is DEFAULT/compute_driver=ironic.IronicDriver. However, at scale there are significant differences.
In this blog I will describe how CERN configures Nova to work with Ironic and what are the challenges that we face.
CERN Cloud Infrastructure uses Nova Cells to manage and scale the infrastructure. Currently we have ~80 Cells and Ironic nodes are managed by one of them (we call it the “Ironic Cell”). The “Ironic Cell” has a control plane (that runs nova-conductor and rabbitmq) and a different node that only runs nova-compute with the Ironic driver. In this way we keep the same architecture for all the deployment which is easier to maintain.
Having only one nova-compute to deal with all the requests to Ironic is a fault tolerance issue. We always planned to have more compute-node(s) but an Ironic resource can only be managed by one nova-compute. The different Ironic resources can be sharded between different nova-compute(s). The algorithm uses a hash ring to ensure that each Ironic resource is only mapped to a nova-compute. Anyway, if an Ironic instance is already provisioned it can’t be moved to a different compute-node and in case of a nova-compute failure the user can’t perform any API operation for his instance.
Two years ago we decided to keep only one nova-compute node for Ironic. At that time we had only few nodes available and the shard functionality had been introduced recently.
During the last 2 years a lot has changed. We moved from Newton to Rocky release, the Placement service was introduced and the number of nodes managed by Ironic increased significantly. With all these changes we have been observing that the running time of the periodic task in nova-compute that reports the available resources, increased dramatically. Of course, this is due to the increase of the number of nodes that needs to be managed/reported by nova-compute but also the more complex structures that are now used to express the resources.
The number of placement requests from nova-compute is directly related to the update_available_resource periodic task. In this blog post I will use the number of placement requests to show the cycle time of this periodic task and also evaluate the impact on the number of requests to placement.
For the ~3100 nodes managed by Ironic the update_available_resource periodic task in nova-compute can take hours to cycle through all the resources.
Fig. 2 shows the number of Placement requests from the update_available_resource periodic task in the “Ironic Cell”. This graph gives the illusion that this periodic task is always running. By default it runs every 60 seconds. When it’s running, all operations that require action from nova-compute (for example: create/delete instance) are queued. These operations are only handled between the different cycles. This means that an user request can wait hours before nova-compute executes the request!
The configuration option DEFAULTupdate_resources_interval was introduced a few releases ago and allows to set the interval between resources updates. By default is set to 0, which means it will run at the default periodic task interval (60 seconds). In order to mitigate the problem of user requests being queued we set this value to 24 hours. The compute resources updates for the “Ironic Cell” only happen during the night. However, we need to SIGHUP nova-compute every time that a new set of resources are added into Ironic.
The next graphs in this blog post have DEFAULTupdate_resources_interval set to 3 hours for a better understanding. In Fig. 2 we can now clearly visualize the cycle that takes to update the resources.
The update_available_resource takes a long time to update all the resources. So we started to investigate what’s consuming that much time. This work is based in the Rocky release that we are running in production (18.0.2). Most of the problems found are already addressed in Stein or have work in progress. However, I believe this information is still useful for all the Operators that are deploying Ironic with Nova Rocky or a previous release. Also, it’s interesting to understand how the Nova integration with Ironic behaves at scale.
In Nova we handle scalability issues with the introduction of new Cells. Unfortunately, we can’t have the same approach when dealing with Ironic resources. An Ironic setup can only be mapped to a single Nova Cell. Otherwise, the same resources will be created in different Cells. The option to have multiple Ironic setups and map them to different Nova Cells is a management nightmare.
The preferred option is to introduce more nova-compute(s) in the “Ironic Cell” to shard the deployment. However, we will need to introduce several new nova-compute(s) to have a reasonable update_available_resource cycle considering that the time scales linearly.
The first thing that we changed was the configuration option compute/resource_provider_association_refresh option to a very high value. This sets the interval of the nova-compute cache update for the compute node resource provider’s inventories, aggregates, and traits. This configuration option was set to a very high value because it’s not possible to disable it in Rocky. The possibility to disable it was introduced in Stein with 11a5fcbb6a – Allow resource_provider_association_refresh=0. In Fig. 3 we can observe that the number of placement requests reduced significantly.
All the results that are shown in this blog post use the patch 8c797450cb – “Perf: Use dicts for ProviderTree roots” merged for the Train release. Without this patch the update_available_resource cycle would have taken several orders of magnitude more.
We also “simplified” the scheduler.report.update_from_provider_tree function to only set the inventory for the specific compute resource that is analysed. Let’s called it the “CERN patch”.
pd = new_tree.data(compute_uuid) with catch_all(pd.uuid): self._set_inventory_for_provider( context, pd.uuid, pd.inventory)
With this patch we observed a reduction in the update_available_resource periodic task cycle from ~1h:50m to ~50 minutes (Fig. 5).
Diving through the code we also saw that the Ironic driver uses “requires_allocation_refresh=True”. This was required in pre-Pike to correct allocation records before the flavour migration. In Rocky if the operator only uses Resource Classes for Ironic resources this is not required. In fact this is removed in Stein release with – 6a68f9140d – “remove virt driver requires_allocation_refresh”.
Setting “requires_allocation_refresh=False” (this is not a configuration option but an Ironic driver variable) has very little impact in the update_available_resource cycle time. The number of placement requests reduces as expected.
Another issue that was affecting our deployment was the double resource update that happens in compute.resource_tracker._update_available_resource. Again this is fixed in Stein and backported to Rocky with c9b74bcfa09 – “Update resources once in update_available_resource”.
As you may expect removing the _update duplication reduces significantly the update_available_resource cycle time. It was reduced from ~50 minutes minutes to ~25 minutes.
But the biggest bottleneck that we found is the provider_tree “deepcopy” required in scheduler.report.get_provider_tree_and_ensure_root. When the provider_tree grows to a significant number of Ironic resources, this operation can take up to 0.5 seconds per resource. Considering that we have ~3100 nodes available in Ironic most of the time of the update_available_resource periodic task is spent performing provider_tree deepcopy. We need to understand if it is really required to do a deepcopy of all the provider tree structure considering that when using the Ironic driver it can contain thousands of resources.
Something that I haven’t discussed yet is the reduction of the number of placement requests that happens when nova-compute is restarted (the “trapezium” shape that we see in all the previous graphs). When nova-compute is restarted it creates the provider tree cache in the node. As the size of this structure increases the number of requests to placement decreases because the “deepcopy” takes more time to finish. We also see that there are many more placement requests than in subsequent cycles (disabled by resource_provider_association_refresh=999999999).
This ended up in a very long article. Let’s wrap up…
When using Ironic driver the update_available_resource periodic task reports all the resources managed by Ironic. Considering the number of nodes, it can take a significant time. When this periodic task is running, nova-compute will not perform any other operations. Can we parallelize this?
Obviously having only one nova-compute doesn’t provide a Fault Tolerant solution. Having multiple compute nodes is the way to go! However, we haven’t seen the experience of any deployment that is using sharding at scale.
With changes in the default configuration options, backports and some new code, we reduced the update_available_resource periodic task cycle from several hours to ~25 minutes. Considering that time scales linearly with the number of Ironic resources and the resources are evenly balanced between the different nova-computes available in case of sharding, we will need several nova-computes in order to have a reasonable update_available_resource cycle time.
Most of bottlenecks that we found are already fixed in the Stein release.
The provider_tree “deepcopy” should be re-evaluated. When using the Ironic driver the provider tree can contain thousands of resources. This is a costly operation.
Considering all the improvements in the Stein release we will finish the Nova upgrade and only then consider the nova-compute sharding again. I will report our experience in a future blog post.