技术控

    今日:136| 主题:49228
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] Zero Cost Abstractions

[复制链接]
石头也很帅 发表于 4 天前
35 3

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
If you read my blog, you can probably guess that I like Rust. I like Rust for a variety of reasons. Itserror model, for example. But today I want to highlight a different reason: zero-cost abstractions.  
  Recently I had the opportunity to do some optimisation work on    Claxon, my    FLACdecoder. In the process I found a piece of code which in my opinion demonstrates the power of Rust very well, and I would like to share that with you here:  
  1. for i in 12..buffer.len() {
  2.     let prediction = coefficients.iter()
  3.                                  .zip(&buffer[i - 12..i])
  4.                                  .map(|(&c, &s)| c * s as i64)
  5.                                  .sum::<i64>() >> qlp_shift;
  6.     let delta = buffer[i];
  7.     buffer[i] = prediction as i32 + delta;
  8. }
复制代码
For context, the following variables are in scope:
  1. buffer: &mut [i32];      // A mutable array slice.
  2. coefficients: [i64; 12]; // A fixed-size array of 12 elements.
  3. qlp_shift: i16;
复制代码
The snippet is part of a function that restores sample values from residues. This is something that happens a lot during decoding, and this particular loop makes up roughly 20% of the total decoding time. It had better be efficient.
  Ingredients

  The snippet computes a fixed-point arithmetic inner product between the coefficients and a window that slides over the buffer. This value is the prediction for a sample. After adding the residue to the prediction, the window is advanced. An inner product can be neatly expressed by zipping the coefficients with the window, mapping multiplication over the pairs, and then taking the sum. I would call this an abstraction: it moves the emphasis away from how the value is computed, focusing on a declarative specification instead.
  The snippet is short and clear — in my opinion — but there is quite a bit going on behind the scenes. Let’s break it down.
  
       
  •       12..buffer.len()constructs a      Rangestructure with an upper and lower bound. Iterating over it with a for loop implicitly creates an iterator, however in the case of      Rangethat iterator is the structure itself.   
  •       coefficients().iter()constructs a slice iterator, and the call to      zipimplicitly constructs an iterator for the      bufferslice as well.   
  •       zipand      mapboth wrap their input iterators in a new iterator structure.   
  • A closure is being passed to      map. The closure does not capture anything from its environment in this case.   
  •       sumrepeatedly calls      next()on its input iterator, pattern matches on the result, and adds up the values until the iterator is exhausted.   
  • Indexing into and slicing      bufferwill panic if an index is out of bounds, as Rust does not allow reading past the end of an array.  
  It appears that these high-level constructs come at a price. Many intermediate structures are created which would not be present in a hand-written inner product. Fortunately these structures are not allocated on the heap, as they would likely be in a language like Java or Python. Iterators also introduce extra control flow:    zipwill terminate after one of its inner iterators is exhausted, so in principle it has to branch twice on every iteration. And of course iterating itself involves a call to    next()on every iteration. Are we trading performance for convenience here?  
  Generated code

  The only proper way to reason about the cost of these abstractions is to inspect the generated machine code. Claxon comes with a    decodeexample program, which I compiled in release mode with Rustc 1.13 stable. Let’s take a look at the result:  
  1. 10c00:
  2. movslq %r14d,%r11
  3. movslq -0x2c(%r8,%rdi,4),%rsi
  4. imul   %r10,%rsi
  5. movslq -0x30(%r8,%rdi,4),%r14
  6. imul   %rbp,%r14
  7. add    %rsi,%r14
  8. movslq -0x28(%r8,%rdi,4),%rsi
  9. imul   %rdx,%rsi
  10. add    %rsi,%r14
  11. movslq -0x24(%r8,%rdi,4),%rsi
  12. imul   %rax,%rsi
  13. add    %rsi,%r14
  14. movslq -0x20(%r8,%rdi,4),%rsi
  15. imul   %rbx,%rsi
  16. add    %rsi,%r14
  17. movslq -0x1c(%r8,%rdi,4),%rsi
  18. imul   %r15,%rsi
  19. add    %rsi,%r14
  20. movslq -0x18(%r8,%rdi,4),%rsi
  21. imul   %r13,%rsi
  22. add    %rsi,%r14
  23. movslq -0x14(%r8,%rdi,4),%rsi
  24. imul   %r12,%rsi
  25. add    %rsi,%r14
  26. movslq -0x10(%r8,%rdi,4),%rsi
  27. imul   0x8(%rsp),%rsi
  28. add    %rsi,%r14
  29. movslq -0xc(%r8,%rdi,4),%rsi
  30. imul   0x18(%rsp),%rsi
  31. add    %rsi,%r14
  32. movslq -0x8(%r8,%rdi,4),%rsi
  33. imul   0x20(%rsp),%rsi
  34. add    %rsi,%r14
  35. imul   0x10(%rsp),%r11
  36. add    %r11,%r14
  37. sar    %cl,%r14
  38. add    (%r8,%rdi,4),%r14d
  39. mov    %r14d,(%r8,%rdi,4)
  40. inc    %rdi
  41. cmp    %r9,%rdi
  42. jb     10c00 <claxon::subframe::predict_lpc::h6c02f07b190820c0+0x2b0>
复制代码
All overhead is gone    completely. What happened here? First of all, no loop is used to compute the inner product. The input slices have a fixed size of 12 elements, and despite the use of iterators, the compiler was able to unroll everything here. The element-wise computations are efficient too. There are 12    movslqs which load a value from the buffer and simultaneously widen it. There are 12 multiplications and 12 additions, 11 for the inner product, and one to add the delta after arithmetic shifting (    sar) the sum right. Note that the coefficients are not even loaded inside the loop, they are kept in registers at all times. After the sample has been computed, it is stored simply with a    mov. All bounds checks have been elided. The final three instructions handle control flow of the loop. Not a single instruction is redundant here, and I could not have written this better myself.  
  I don’t want to end this post without at least touching briefly upon vectorisation. It might look like the compiler missed an opportunity for vectorisation here, but I do not think that this is the case. For various reasons the above snippet is not as obvious to vectorise as it might seem at first sight. But in any case, a missed opportunity for vectorisation is just that: a missed opportunity. It is not abstraction overhead.
  Conclusion

  In this post I’ve shown a small snippet of code that uses high-level constructs such as closures and iterator combinators, yet the code compiles down to the same instructions that a hand-written C program would compile to. Rust lives up to its promise: abstractions are truly zero-cost.
友荐云推荐




上一篇:swoole-vue-webim:基于 Vue 和 Swoole 构建的 Web 聊天应用
下一篇:TANK: A very high performance distributed log service
酷辣虫提示酷辣虫禁止发表任何与中华人民共和国法律有抵触的内容!所有内容由用户发布,并不代表酷辣虫的观点,酷辣虫无法对用户发布内容真实性提供任何的保证,请自行验证并承担风险与后果。如您有版权、违规等问题,请通过"联系我们"或"违规举报"告知我们处理。

jhj 发表于 4 天前
当爱情不在的时候,请对他说声祝福,毕竟,曾经爱过。
回复 支持 反对

使用道具 举报

华哥电商 发表于 3 天前
LZ敢整点更有创意的不?兄弟们等着围观捏~
回复 支持 反对

使用道具 举报

每天只签到不留言的,升级永远没有见贴就留言的快。说明:”复制粘贴很重要!
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读

扫码访问 @iTTTTT瑞翔 的微博
回页顶回复上一篇下一篇回列表手机版
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )|网站地图 酷辣虫

© 2001-2016 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表