We’re excited to announce support for Amazon’s X1 instances . Now in Domino, you can do data science on machines with 128 cores and 2TB of RAM — with one click:
The X1 hardware tier is available in our cloud-hosted environment and can be made available to customers using Domino in their own VPCs.
Needless to say, with access to this unprecedented level of compute power, we had some fun. Read on for some of our reflections about doing data science with X1 instances.
Processing Power: Working with 128 Cores Under the Hood
Access to 128 cores on a single machine was nearly unheard of even just a few years ago, much less on a platform which could trivially be rented by the minute. Core counts at this scale were previously only the domain of distributed and HPC systems.
The ability to distribute a machine learning workload to 128 cores is a non-trivial problem, but two common techniques are (1) parallelizing the machine learning itself and (2) parallelizing fitting of the algorithm across multiple possible configurations (i.e., grid search).
Parallelizing grid search is fairly straightforward, and packages like scikit-learn and caret offer great solutions for this. Parallelization of a machine learning algorithm, however, is a challenging problem. There are a number of natural limitations to this approach, not least of which are the large-scale matrix operations at the core of many machine learning algorithms. These have natural bounds on the amount of parallelism that can be beneficial.
To explore these limits, I undertook a short and incomplete analysis of two modern machine learning toolkits, H2O and XGBoost for the task of fitting a GBM with 1,000 trees on the canonical airline dataset. I don’t undertake the task of validating the goodness of fit of the models generated. In this case, I’m simply interested in seeing how much parallelism these two packages are able to leverage when given a large number of cores.
Using H2O’s R package version 126.96.36.199 and training on 100k rows of the airline dataset, the system was able to train a single model with 1,000 trees in 813 seconds. Full theoretical processor utilization would be 12,800%, that is, 100% utilization for each core. During training, processor utilization peaked at roughly 5,600%, implying 56 cores were in use.
Given the nature of the GBM algorithm, this limitation is understandable. There is an explicit limit on the amount of parallelism possible for training as determined by the shape of the input to the algorithm. It is also interesting to note that while peak memory usage of 46GB is high for GBM, it was still a very small percentage of total available RAM on the X1. Although H2O’s GBM algorithm provides excellent performance, it was not able to harness most of the processing power and memory available with an X1 instance.
When fitting multiple models and attempting to search a large hyperparameter space, the power of the X1 instance type and H2O’s Grid tools show value. Using H2O’s Grid Search example , H2O’s package was able to utilize roughly 35 cores.