技术控

    今日:74| 主题:49270
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] Beating the Compiler

[复制链接]

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
    Beating The Compiler    November 28th, 2016  

  An oft-repeated fact on programming forums these days is that a decent optimizing compiler will always beat a puny human's attempt at hand-written assembler. There are rare cases, like MPEG decoders, where making good use of the SIMD intrinsics can allow assembly to massively beat the compiler. But generally you'll hear that the compiler will always do better.
  The reason given for this is usually that a modern CPU has so many pipelines and instruction hazards to deal with, and a naive assembly routine won't do as good a job dealing with them.
  But is it true? Let's not just take some guys word on the Internet as gospel, let's do a little experiment and find out.
  We'll just take a simple piece of code to look at here. I'm not going to pick an example that would benefit heavily from exotic intrinsics. Instead I'll just use an old standard,    quicksort.  
  Here's the naive C++ quicksort we'll be testing against:
  1. struct Item { int key; int value; };
  2. extern "C"
  3. void sortRoutine(Item *items, int count)
  4. {
  5.     if (count <= 1)
  6.         return; // already sorted if only zero/one element
  7.     // Pick the pivot.
  8.     Item pivot = items[count-1];
  9.     int low = 0;
  10.     for (int pos=0;pos<count-1;pos++)
  11.     {
  12.         if (items[pos].key <= pivot.key) {
  13.             // swap elements
  14.             Item tmp = items[pos];
  15.             items[pos] = items[low];
  16.             items[low] = tmp;
  17.             low++;
  18.         }
  19.     }
  20.     items[count-1] = items[low];
  21.     items[low] = pivot;
  22.     sortRoutine(items, low);
  23.     sortRoutine(items+low+1, count-1-low);
  24. }
复制代码
It's nothing fancy. It's not the best sorting algorithm in the world, and this isn't even the best implementation of it, but it's simple and it'll do just fine.
  Now let's try a direct naive assembly implementation of it:
  1. sortRoutine:
  2.     ; rcx = items
  3.     ; rdx = count
  4.     sub rdx, 1
  5.     jbe done
  6.     ; Pick the pivot.
  7.     mov r8, [rcx+rdx*8] ; r8 = pivot data
  8.     xor r9, r9          ; r9 = low
  9.     xor r10, r10        ; r10 = pos
  10. partition:
  11.     cmp [rcx+r10*8], r8d
  12.     jg noswap
  13.     ; swap elements
  14.     mov rax, [rcx+r10*8]
  15.     mov r11, [rcx+r9*8]
  16.     mov [rcx+r9*8], rax
  17.     mov [rcx+r10*8], r11
  18.     inc r9
  19. noswap:
  20.     inc r10
  21.     cmp r10, rdx
  22.     jb partition
  23.     ; move pivot into place
  24.     mov rax, [rcx+r9*8]
  25.     mov [rcx+rdx*8], rax
  26.     mov [rcx+r9*8], r8
  27.     ; recurse
  28.     sub rdx, r9
  29.     lea rax, [rcx+r9*8+8]
  30.     push rax
  31.     push rdx
  32.     mov rdx, r9
  33.     call sortRoutine
  34.     pop rdx
  35.     pop rcx
  36.     test rdx, rdx
  37.     jnz sortRoutine
  38. done:
  39.     ret
复制代码
It was quite easy to write this, largely due to Intel's flexible memory addressing operators. What's interesting is that I made no real attempt to pay attention to scheduling, pipelining, etc. I just wrote it out as simple assembly.
  Now let's time it. I wrote a simple test harness that sorts a 1000000-item array. I run the test 100 times and take the best-case across the whole set. I'll compile the C++ version with gcc 4.8.1, clang 3.8.0, and MSVC 2013.
  Results

  1. sort_cpp_recurse_gcc.exe      : Took 99ms best-case across 100 runs
  2. sort_cpp_recurse_clang.exe    : Took 99ms best-case across 100 runs
  3. sort_cpp_recurse_ms.exe       : Took 98ms best-case across 100 runs
  4. sort_asm_recurse.exe          : Took 92ms best-case across 100 runs
复制代码
Well that's interesting. The different compilers did mostly around the same, with MSVC having a slight edge.    But the assembly version ran fastest, being a good 7% faster in this case.  
  Now the thing about C++ is that it's not always a good representation of the underlying machine. It's fine when it comes to variables, but it's representation of the stack is very limited. C++ pretends we can only use the stack as a call stack, whereas in reality one of the things we can do is to use it as a    datastack instead.  
  So let's try that and see what happens. We'll remove the recursive calls to sortRoutine, and instead push/pop our data ranges directly from the stack. This allows us to run in a single loop without having to actually recurse. This can often have a strong benefit because it removes the overhead of entering/exiting the function each time.
  I won't post the code for it here, but it's in the zipfile below if you want to see it.
  1. sort_cpp_recurse_gcc.exe      : Took 99ms best-case across 100 runs
  2. sort_cpp_recurse_clang.exe    : Took 99ms best-case across 100 runs
  3. sort_cpp_recurse_ms.exe       : Took 98ms best-case across 100 runs
  4. sort_asm_recurse.exe          : Took 92ms best-case across 100 runs
  5. sort_cpp_iter_gcc.exe         : Took 106ms best-case across 100 runs
  6. sort_cpp_iter_clang.exe       : Took 97ms best-case across 100 runs
  7. sort_cpp_iter_ms.exe          : Took 95ms best-case across 100 runs
  8. sort_asm_iter.exe             : Took 92ms best-case across 100 runs
复制代码
Hmm. The assembly version comes out almost the same. I suspect this is because while an iterative approach removes the function-setup overhead, our hand-written x64 version doesn't actually    havemuch function-setup overhead, so doesn't really benefit.  
  But for the C++ versions, it's a different story. Most of them got a slight speedup, but gcc is actually slower! As far as I can tell from looking at the disassembly, it seems like it's managed to out-clever itself. The increased control paths and things for it to consider have given it too many balls to keep in the air at once.
  I compiled these tests on x64, where the default calling convention is fastcall. I suspect that if compiled for 32-bit instead, using the cdecl stack-based convention, the non-recursive version would have done better relatively. I didn't try it though, I'll leave that as an exercise to the reader.
  Conclusion

  So it seems like the old "modern compilers are always faster than you" spiel is not necessarily true. What    istrue however, is that the compiler did a    good enoughjob, and the code is easier to work with. So while you might be able to squeeze a little more speed out, it's probably not worth the maintenance hassle.  
  The assembly version    wasfaster though. I suppose if there's anything to be learned here, it's that people on the Internet    may sometimes be full of shit.  
    Download
    sorttest.zip - All the code used for this article.   
        Written by Richard Mitton,

    software engineer and travelling wizard.
    Follow me on twitter:      http://twitter.com/grumpygiant
友荐云推荐




上一篇:降维攻击:目标,比率指标
下一篇:CECPQ1 results
酷辣虫提示酷辣虫禁止发表任何与中华人民共和国法律有抵触的内容!所有内容由用户发布,并不代表酷辣虫的观点,酷辣虫无法对用户发布内容真实性提供任何的保证,请自行验证并承担风险与后果。如您有版权、违规等问题,请通过"联系我们"或"违规举报"告知我们处理。

寂静的神经 发表于 7 天前
围观了!
回复 支持 反对

使用道具 举报

星妍 发表于 6 天前
锄禾日当午,发帖真辛苦。谁知坛中餐,帖帖皆辛苦!
回复 支持 反对

使用道具 举报

承茗达z 发表于 6 天前
吾表兄,年四十余.始从文,连考三年而不中.遂习武,练武场上发一矢,中鼓吏,逐之出.改学医,自撰一良方,服之,卒.
回复 支持 反对

使用道具 举报

范敏 发表于 6 天前
看过相关报道,楼下后来二了
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读

扫码访问 @iTTTTT瑞翔 的微博
回页顶回复上一篇下一篇回列表手机版
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )|网站地图 酷辣虫

© 2001-2016 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表