技术控

    今日:114| 主题:49288
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] Optimisations in Verve

[复制链接]
秀才 发表于 2016-10-1 13:48:13
95 1

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
I have definitely heard that “Premature optimisation is the root of all evil” and all that stuff, but well, sometimes optimising things can be fun.
  I started    Verveas a Virtual Machine (VM) rather than as a programming language, just for fun, since I was starting to dig into    JavaScriptCore(JSC), and thought it’d be nice to try implementing a small VM.  
  MVP

  My first milestone was the easiest possible, execute a simple program, like “Hello, World”. However, in order to achieve that, I needed to write this program somehow, and I chose to start with a lisp-like simple language, since it’s as easy to parse as it gets.
     
  1. (print "Hello, World!")
复制代码
   Since my main goal was building the VM, I started with some kind of bytecode right away, rather than interpreting the AST at first.
  As soon as my compiler front-end was smart enough to spit out some bytecode for simple function calls I jumped straight into the fun part: writing the interpreter.
  It started out the simplest way I could think of: a big switch on the opcode (a simple    intin this case).    original interpreter  
  I was pretty happy with the result, everything kinda worked and I had only spent a few days. But I needed some new milestone to keep the motivation going.
  fib(40)

  And that would be my next milestone: beat JSC on    fib(40). I know, it’s meaningless, and please,    I’m not saying my dummy VM is faster than ANY real VM, but still, seemed like a fun goal.  
     
  1. // the Fibonacci implementation
  2. fn fib(n: int) -> int {
  3.   if (n < 2) n
  4.   else fib(n-1) + fib(n-2)
  5. }
复制代码
   At first, I admit I thought it wouldn’t be too hard: JSC is a full-featured VM, but it has to do way more stuff than my rather simplistic VM:
  
       
  • First of all: I wasn’t considering the JIT, only interpreter vs interpreter.   
  • JavaScript is not a simple language to parse, I only had my lisp-like dummy language.   
  • It deals with caches, profiles, counters, and many other things, which add overhead.   
  • It has much more code, which affects load time, cache locality, etc.   
  • The interpreter is written in a high-level assembly language, which doesn’t allow abusing platform specific registers.  
  So I decided to give it a shot, and see how would it compare. Of course, at first it wouldn’t even compile on my VM, but after a few days, I finally got it working.
  And how fast was it? I couldn’t even know, it took so long that I never saw it finish (definitely over 30min).
  Damn, it was    at least 15x slowerthanmy x86 emulator, written in JavaScript itself!  
  Optimisations

  Ok, so enough with history time… I’ll mention 3 key areas I worked on optimising:
  
       
  • Scope lookup   
  • Interpreter loop   
  • Function calls  
  These cover pretty much all the time spent in a simple program such as    fib(40). I never had any issues with parsing, despite having bits of the parser I know are    farfrom optimal.  
  Scope lookup

  First of all, by scope lookup I mean all the time spent on generating scopes, deleting scopes, and actually looking values up in the scope.
  The scope is (TL;DR) where the variables are stored: You start with the global scope, and every function introduces a new scope, i.e. the variables defined inside the function can’t be accessed out of it.
     
  1. // global scope
  2. (function () {
  3.   // scope a
  4.   (function () {
  5.     // scope b
  6.   })
  7. })
复制代码
   Verve has always had    lexical scoping:  
     
  1. var a = 2;
  2. var plusA = (function () {
  3. var a = 10;
  4. return function(b) { return b + a; }
  5. })();
  6. plusA(5) // 15
复制代码
   Here the function    plusAcan access    a, even though it’s declared in it’s parent scope, rather than in it’s body. And when the function is called, it still looks up the scope where it was defined, rather than where it was called from. This means that functions have to keep track of their enclosing scope.  
  My initial implementation was, again, the simplest I could think about: A    Closurewas an object that would hold a pointer to a    Scope, and the    Scopewas a linked list so you could look for a variable up through the scope chain.  
     
  1. // pseudocode
  2. class Closure {
  3.   scope: Scope,
  4.   fn: Function
  5. }
  6. class Scope {
  7.   parentScope: Scope,
  8.   _values: Map<String, Value>,
  9.   get(key) { _values.get(key) || parentScope.get(key) }
  10. }
复制代码
   In order to deal with the case where a    Closurelives longer than its    Scope(e.g. a    Closurethat returns a    Closure), I wrapped it in C++ shared pointers: the    Scoperemains alive as long as there’s anything pointing to it. Three things could be pointing to a scope:  
  
       
  • It could be the scope for the code the VM was currently executing   
  • It could be a parent of the scope mentioned above   
  • A      Closurecould be holding onto it.  
  Optimising the Scope

  The first optimisation follows my favorite motto: Nothing is faster than doing nothing.
  We only need this whole Scope thing if the function actually uses it! During parsing we can analyse all the variables used in the function, and if it doesn’t reference anything from the parent scope, we don’t need to keep track of the enclosing scope at all.
  This saves creating new scopes on every function invocation, but it doesn’t speed up lookups in the scope:    fib(40)is recursive, and for every invocation of    fib(n)we’ll look    fibup twice, one for    fib(n-1)and one for    fib(n-2).  
  In order to speed up the lookups, I worked on a few optimisations:
  
       
  •       std::shared_ptrwasn’t fast enough. It was just way more robust than what I needed… Since it was only used in a few places, Replacing it with a simple      refCountfield and doing manual reference counting worked just as well and was much faster.      commit   
  •       std::unordered_mapwasn’t fast enough. Same thing. Most of the scopes hold very few values for my small programs, replacing the std implementation with a simple hash map built with a single small array and quick hashing by just using the least significant bits of the pointer was way faster.      (part of the commit above)   
  • Going to C++ for every lookup was too slow: Once I had optimised the interpreter (read below) going to C++ meant saving the registers state and aligning the stack. Moving the whole lookup into      asmwas faster.      commit   
  • Caching the lookups: Adding a side table as a linear cache of the lookups, combined with the      asmimplementation of the lookup, made it so that cache hits only take      2 instructions!      commit  
  Ok, that’s enough about scopes…
  Interpreter loop

  As mentioned above I started with a big switch statement in C++, but I already knew I wanted to write the interpreter in Assembly. Some people don’t agree, but I find it quite fun!
  Going to assembly has many benefits that can lead to massive performance wins, for example:
  
       
  • Fine-grained control over the stack layout: no need for keeping a virtual stack in C++!   
  • Control over registers: Writing platform specific assembly means that I can use every single callee-saved register to keep around the data I constantly need to access.  
  I considered a two options when I was writing the interpreter:
  
       
  • JSC’s model: At the end of every opcode implementation, we check what is the next opcode and jump to its implementation.   
  • “Traditional model”: Still have a central loop, but optimise it by hand in assembly.  
  I started with JSC’s model, since it was what inspired me to write the VM in the first place, but there were a few downsides raised by other I talked to:
  
       
  • It makes the bytecode big - you need the opcodes to be the address of their actual implementation, which means that you need word-sized instructions   
  • It messes up branch prediction: at the of every opcode you jump to random address taken from a random location in the heap.  
  Upon hearing that, I thought I’d try switching to the traditional model and see whether it was actually faster.
  The loop would start by using the next opcode’s value to calculate the offset into a list of jumps that would follow. But that was not really efficient, since it’d take the pointer arithmetic + two jumps to get to the desired opcode.    source code  
  My next step was disassembling a switch and looking at how the compiler implemented jump tables, and it was much better than my code: Instead of a list of jumps, you add the relative addresses after the loop, use the value of the opcode to read the right address, and add that to current instruction pointer to get the absolute address of the function.    source code  
  I kept on battling, with some smaller optimisations (even though quite effective sometimes), such as    reordering methods based on “hotness”and    refactoring to remove some expensive instructions, but it still wasn’t fast enough…  
  In a desperate attempt I tried to    rollback to the JSC-style interpreter, and to my surprise it was    muchfaster. Of course I didn’t just throw away all the optimisations I worked on while using the traditional loop implementation, but switching back was what pushed the interpreter from being about as fast as JSC, to being faster.  
  Fast closures

      fib(40)is nothing more than just calling the same function over and over again,    billionsof times. It’d be great if calling a function was fast!  
  In my naive implementation, every function implemented in Verve would be represented in memory by the Closure object I mentioned above, that has a reference to the function and another to it’s enclosing scope. When you make a call, we lookup in the scope for the callee, check whether it’s a closure, and if so, we jump to C++ so it can “prepare the closure”.
  Preparing the closure means: checking whether it needs a new scope, and finding the implementation’s offset in the bytecode.
  But, if we can figure out at parsing time whether a closure captures it’s scope, we can be sure that we don’t need a new scope, all we need is the bytecode offset.
  The idea of fast closures is just a tagged pointer which only contains the offset of the function. For memory alignment reasons, some of the least significant bits will always be zero for a valid pointer, so we set the least significant bit to one in order to indicate that it’s not an actual pointer, and right next to it we add the functions offset in the bytecode, shifted to left by 1.
          Real closure:0x00FFF13320   
          Fast closure:0x0000000321      // the address is 0x0190 (0x0321 » 1)   
    As I wrote the previous paragraph I realised there’s a bug: fast closures should also not contain nested closures, otherwise the parent scope will be polluted. The type checker should catch it nonetheless, but you can bypass it, but, well…
  Benchmarking

  Given the following implementation of    fibin Verve  
     
  1. fn fib(n: int) -> int {
  2.   if (n < 2) n
  3.   else fib(n - 1) + fib(n - 2)
  4. }
  5. print(fib(40))
复制代码
   And the following implementation in JavaScript
     
  1. function fib(n) {
  2.   return n < 2
  3.     ? n
  4.     : fib(n - 1) + fib(n - 2)
  5. }
  6. print(fib(40))
复制代码
   Verve, on    this commit*, takes    8.247son an    avg of 5 runs.  
  JSC, version    602.1.50, with    JSC_useJIT=0takes    14.558son an    avg of 5 runs.  
  Both were tested on a Early-2016 MacBook running macOS 10.12.
      *This commit was picked as I stopped working on perf ever since. Lately I’ve been having more fun with making it a better language, with proper type checking instead.  
  Conclusion

  Optimising things can be fun, but be mindful when optimising real world projects. Most of the time optimisations are about trading readability and mantainability for speed, which doesn’t always pay off.
  All the optimisations mentioned here weren’t picked at random, even though it’s a side project, everything was carefully profiled and optimised, before and    afteroptimising - things don’t always go as you expected.  
  Other than that, I don’t mean to prove anthing - this is not a real benchmark, and there’s no way of comparing the Verve VM in its current state to any full-featured VM… I just wrote the post for the same reason I wrote the code: for fun, and hoping it might be helpful, or at least give some inspiration, for someone at some point.
  Thanks for reading! :)
友荐云推荐




上一篇:拥抱开源 - 云上元数据管理
下一篇:C++当中3种new的用法
酷辣虫提示酷辣虫禁止发表任何与中华人民共和国法律有抵触的内容!所有内容由用户发布,并不代表酷辣虫的观点,酷辣虫无法对用户发布内容真实性提供任何的保证,请自行验证并承担风险与后果。如您有版权、违规等问题,请通过"联系我们"或"违规举报"告知我们处理。

linchaoji21 发表于 2016-10-13 19:35:19
支持,楼下的跟上哈~
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读

扫码访问 @iTTTTT瑞翔 的微博
回页顶回复上一篇下一篇回列表手机版
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )|网站地图 酷辣虫

© 2001-2016 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表