请选择 进入手机版 | 继续访问电脑版

技术控

    今日:41| 主题:54631
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] Cyclades: Conflict-free asynchronous machine learning

[复制链接]
我怀念的 发表于 2016年10月4日 13:28
254 6

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
CYCLADES: Conflict-free asynchronous machine learning Pan et al. NIPS 2016
   “Conflict-free,” the magic words that mean we can process things concurrently or in parallel at full speed, with no need for coordination. Today’s paper introduces Cyclades, a system for speeding up machine learning on a single NUMA node. In the evaluation, the authors used NUMA nodes with 18 CPUs and 1TB memory. Extending Cyclades to work across NUMA nodes, or even in a distributed setting using aparameter server is reserved for future investigation.
  The overall hierarchy of ML speed and scale probably looks something like this:
  
       
  • Clusters of GPUs [w. CPU support] – Speed and Scale   
  • Multi-GPUs on a single node [w. CPU support] for SPEED,Clusters of CPUs for SCALE   
  • Multi-core CPUs on a single node   
  • Single core (thread) systems  
   Remember not to underestimate what can be achieved on a single thread . Because of their accessibility and ease of programming, systems that work on a single box can have great utility. Cyclades efficiently utilises multiple cores through a graph partitioning step that allocates connected-component subgraphs to cores, eliminating the need for cross-core coordination. One of the compelling things that is unique about Cyclades is that unlike other systems we’ve seen ( Asynchronous Complex Analytics ,Petuum) that use asynchronous processing to speed up ML, Cyclades maintains serial equivalence . That is, it guarantees to give the same results as a serial implementation. Without serial equivalence, other systems rely on analyses that the deviation from serial is not significant (stays within some bound).
   Since it returns exactly the same output as a serial implementation, any algorithm parallelized by our framework inherits the correctness proof of the serial counterpart without modifications. Additionally, if a particular heuristic serial algorithm is popular, but does not have a rigorous analysis, such as backpropagation on neural networks , Cyclades still guarantees that its execution will return a serially equivalent output.
   Cyclades works for a large family of algorithms based on stochastic updates . This includes logistic regression, least squares, support vector machines, word embeddings, stochastic gradient descent, matrix completion and factorization, and more.
  At the core of Cyclades is a clever partitioning strategy. Consider a graph with two types of vertices (bipartite), update functions and model variables. We create an edge from an update function to a variable if that function reads or writes it.
   

Cyclades: Conflict-free asynchronous machine learning

Cyclades: Conflict-free asynchronous machine learning-1-技术控-asynchronous,machine,machine,learning,machine,learning,ex6,machine,learning,pdf,machine,learning,ex4

   From this we can create a conflict graph (Cyclades never actually fully materializes such a graph, but it’s a useful conception). The conflict graph contains only update function vertices, and links two vertices with an edge if they share at least one variable in the update graph.
   

Cyclades: Conflict-free asynchronous machine learning

Cyclades: Conflict-free asynchronous machine learning-2-技术控-asynchronous,machine,machine,learning,machine,learning,ex6,machine,learning,pdf,machine,learning,ex4

  The key now is to partition the conflict graph into connected components, and allocate those connected components to cores.
   

Cyclades: Conflict-free asynchronous machine learning

Cyclades: Conflict-free asynchronous machine learning-3-技术控-asynchronous,machine,machine,learning,machine,learning,ex6,machine,learning,pdf,machine,learning,ex4

  This relies on a result established in a 2016 Electronic Journal of Combinatorics paper, “The phase transition in site percolation on pseudo-random graphs,” by Krivelevich. No, I’m not brave enough to try and cover a Journal of Combinatorics paper on this blog! The result is as follows:
   Let G be a graph on n vertices, with maximum vertex degree Δ. Let us sample each vertex independently with probability p = (1 – &eps;)/&Delta and define as G’ the induced subgraph on the sampled vertices. Then, the largest connected component of G’ has size at most (4/&esp; 2 ) log n, with high probability.
   This is pithily summarized by the authors as ‘frugal sampling shatters conflicts.’ It tells us that by careful sampling of the vertices in the conflict graph, we can induce a subgraph with at least O(log n / &eps; 2 ) components to distribute across cores.
   ( There’s something of importance in the ‘induced subgraph’ phrase that I feel I don’t fully understand. Because what really matters surely is the size of the largest connected component in G (not G’)? (We do the actual computation on G, right?). My interpretation is that by sampling we can create a subgraph G’ from which we can infer connected components that with high probability match the connected components of G (and have certain maximum size)… But then why would the number of components depend on our sampling frequency?? The results indicate that the process clearly works, so we can press on, but if anyone can put me straight here I’d appreciate it! ).
  Moreover, since there are no conflicts between different conflict-groups, the processing of updates per any single group will never interact with the variables corresponding to the updates of another conflict group.
   Cyclades samples in batches of size B = (1-&eps;)n/Δ , and identifies conflict groups for each batch using a connected components algorithm. The appendix shows that this can be done efficiently in time O(num_edges . log 2 n / P) where P is the number of processor cores available.
  We can put all the pieces together for the complete Cyclades algorithm:
   

Cyclades: Conflict-free asynchronous machine learning

Cyclades: Conflict-free asynchronous machine learning-4-技术控-asynchronous,machine,machine,learning,machine,learning,ex6,machine,learning,pdf,machine,learning,ex4

   The inner loop is parallelized and can be performed asynchronously with no need for memory locks as all the cores access non-overlapping subsets of the input x . This gives good cache coherence, and as each core potentially access the same coordinates multiple times, good cache locality.
  Observe that Cyclades bypasses the need to establish convergence guarantees for the parallel algorithm.
  And we have the main theorem:
   

Cyclades: Conflict-free asynchronous machine learning

Cyclades: Conflict-free asynchronous machine learning-5-技术控-asynchronous,machine,machine,learning,machine,learning,ex6,machine,learning,pdf,machine,learning,ex4

  Does it work in practice?

  The authors implemented Cyclades in C++ and tested it against HogWild! on equivalent number of cores, and also against a single threaded implementation .
  Cyclades is initially slower but converges faster:
   

Cyclades: Conflict-free asynchronous machine learning

Cyclades: Conflict-free asynchronous machine learning-6-技术控-asynchronous,machine,machine,learning,machine,learning,ex6,machine,learning,pdf,machine,learning,ex4

  It achieves a relative speed-up of up to 5x over HogWild! when both systems are given the same number of cores.
   

Cyclades: Conflict-free asynchronous machine learning

Cyclades: Conflict-free asynchronous machine learning-7-技术控-asynchronous,machine,machine,learning,machine,learning,ex6,machine,learning,pdf,machine,learning,ex4

  When we decrease the step-size after each epoch, Cyclades shows even stronger relative performance:
   

Cyclades: Conflict-free asynchronous machine learning

Cyclades: Conflict-free asynchronous machine learning-8-技术控-asynchronous,machine,machine,learning,machine,learning,ex6,machine,learning,pdf,machine,learning,ex4

  It would be interesting to see where else the Krivelevich connected component result can be applied, it seems that many graph computations could potentially benefit.



上一篇:Python科学计算与数据分析库/包大全
下一篇:Classic Emacs editor gets a new-school makeover
我爱她她不爱我 发表于 2016年10月4日 14:28
我是一个有名望的恶棍…
回复 支持 反对

使用道具 举报

坤彩煜 发表于 2016年10月6日 21:05
煮熟的鸭子飞不了,煮熟的楼主在哪里?
回复 支持 反对

使用道具 举报

不该有的小情绪 发表于 2016年10月9日 11:05
爱我的人请继续,恨我的人别放弃.
回复 支持 反对

使用道具 举报

贾学磊 发表于 2016年10月17日 13:17
我也是坐沙发的
回复 支持 反对

使用道具 举报

幼珊 发表于 2016年11月3日 18:54
楼上的能详细介绍一下么?
回复 支持 反对

使用道具 举报

蕾雁 发表于 2016年11月7日 15:03
小时候的梦想并不是要当什么科学家,幻想自己是地主家的少爷,家有良田千顷,终日不学无术,没事领着一群狗奴才上街去调戏一下良家少女……
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读


回页顶回复上一篇下一篇回列表
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )

© 2001-2017 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表