The technical report ” The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V
” looks at how the RISC-V
ISA compares to ARM and Intel by analyzing the result from running SPEC CINT2006. One thing that surprised me was that 30% of the instructions executed when running the
(which compiles a few files using a modified GCC 3.2) is from doing
loop writes 16 bytes in 4 instructions
// RV64G, 4 instructions to move 16 bytes 4a3814: sd a1, 0(a4) 4a3818: sd a1, 8(a4) 4a381c: addi a4, a4, 16 4a3820: bltu a4, a3, 4a3814
which is somewhat less efficient compared to ARM and Intel that writes more data per instruction
// armv8, 6 instructions to move 64 bytes 6f0928: stp x7, x7, [x8,#16] 6f092c: stp x7, x7, [x8,#32] 6f0930: stp x7, x7, [x8,#48] 6f0934: stp x7, x7, [x8,#64]! 6f0938: subs x2, x2, #0x40 6f093c: b.ge 6f0928
so this should translate to about 10% of the executed ARM instructions doing
But that is still much more than what I would have guessed.
I do not have access to SPEC, and I have been too lazy to try to replicate with other data, but a quick literature search indicates that this is not as insane as I thought. The papers I have found look at the cost of clearing data in garbage collection implementations, and they seem to get a similar result for the cost. For example ” Why Nothing Matters: The Impact of Zeroing
We show that existing approaches of zero initialization are surprisingly expensive. On three modern IA32 architectures, the direct cost is around 2.7-4.5% on average and as much as 12.7% of all cycles, in a high-performance Java Virtual Machine (JVM), without accounting for indirect costs due to cache displacement and memory bandwidth consumption.