CYCLADES: Conflict-free asynchronous machine learning Pan et al. NIPS 2016
“Conflict-free,” the magic words that mean we can process things concurrently or in parallel at full speed, with no need for coordination. Today’s paper introduces Cyclades, a system for speeding up machine learning on a single NUMA node. In the evaluation, the authors used NUMA nodes with 18 CPUs and 1TB memory. Extending Cyclades to work across NUMA nodes, or even in a distributed setting using aparameter server is reserved for future investigation.
The overall hierarchy of ML speed and scale probably looks something like this:
Clusters of GPUs [w. CPU support] – Speed and Scale
Multi-GPUs on a single node [w. CPU support] for SPEED,Clusters of CPUs for SCALE
Multi-core CPUs on a single node
Single core (thread) systems
Remember not to underestimate what can be achieved on a single thread . Because of their accessibility and ease of programming, systems that work on a single box can have great utility. Cyclades efficiently utilises multiple cores through a graph partitioning step that allocates connected-component subgraphs to cores, eliminating the need for cross-core coordination. One of the compelling things that is unique about Cyclades is that unlike other systems we’ve seen ( Asynchronous Complex Analytics ,Petuum) that use asynchronous processing to speed up ML, Cyclades maintains serial equivalence . That is, it guarantees to give the same results as a serial implementation. Without serial equivalence, other systems rely on analyses that the deviation from serial is not significant (stays within some bound).
Since it returns exactly the same output as a serial implementation, any algorithm parallelized by our framework inherits the correctness proof of the serial counterpart without modifications. Additionally, if a particular heuristic serial algorithm is popular, but does not have a rigorous analysis, such as backpropagation on neural networks , Cyclades still guarantees that its execution will return a serially equivalent output.
Cyclades works for a large family of algorithms based on stochastic updates . This includes logistic regression, least squares, support vector machines, word embeddings, stochastic gradient descent, matrix completion and factorization, and more.
At the core of Cyclades is a clever partitioning strategy. Consider a graph with two types of vertices (bipartite), update functions and model variables. We create an edge from an update function to a variable if that function reads or writes it.
From this we can create a conflict graph (Cyclades never actually fully materializes such a graph, but it’s a useful conception). The conflict graph contains only update function vertices, and links two vertices with an edge if they share at least one variable in the update graph.
The key now is to partition the conflict graph into connected components, and allocate those connected components to cores.