Simplexhc – Haskell to LLVM compiler (design phase)

综合技术 2017-06-25 阅读原文

Simplexhc - a STG to LLVM compiler

Link to Github repo

I’m trying to understand GHC’s design choices when it comes to compilation. The way GHC currently compiles is to first reduce Haskell to a Core
language, which is a minimal subset of Haskell, in some sense.

Next, it compiles Core
to STG
, which is an “abstract machine” that is Haskell’s view of what the hardware should
be like. STG
stands for the “spineless, tagless, G-machine” (Yep, a totally badass name). The best source for it is the original paper, “implementing functional programming languages on stock hardware”

Alas, reality is different: x86
is quite different from STG
, so we compile down STG
to C--
, an imperative C-like language.

Usually, C--
gets compiled down directly to assembly. However, there is a backend which converts this to LLVM
as well. However, this is a terrible way to generate LLVM
from STG
, as:

  1. Going to C--
    and then from there going to LLVM
    loses a lot of the semantic information that one can provide to the LLVM

  2. STG
    ’s lowering to C--
    and the decisions taken when implementing it were based in the 90’s. That’s not to say that the GHC team has done an awesome job keeping it up to date: they have! But, I wonder what a complete rewrite of this lowering would look like. Hence, I’m trying to experiment in this space and see what happens.

The current repo of simplexhc

currently contains the original STG
implementation as an interpreter
. However, I’ve started work on the LLVM
backend, so I decided to start blogging about it.

Some backround knowledge

Since I’ll be discussing ideas related to compilation, I’d like to discuss some of the related ideas.

1. Tail call optimisiation

Consider a function:

Original function

int factorial_(int n, int accum) {
    if(n == 1) return 1;
    return factorial_(n - 1, accum * n);

int factorial(int n) { return factorial_n(n, 1 };

Now, if we consider factorial(4)
, the call stack will look something like this:

Call stack

factorial_(1, 24)
factorial_(2, 12)
factorial_(3, 4)
factorial_(4, 1)
; bottom

However, we can do something interesting with the factorial_
function: rather than calling
again, we can simply replace n := n - 1
, and accum := accum * n
, and restart evaluating factorial_
. That is, in terms of code:

Optimised Function

int factorial_(int n, int accum) {
    if(n == 1) return 1;
    // value replacement.
    accum = accum * n;
    n = n - 1;
    jump begin; // jump to beginning of function with new variable values.

// unchanged
factorial(int n) { return factorial_(n, 1) };

Note that the reason this works is that the last statement in the function
is to call factorial_
. Hence, such a call is called a tail call
, and this optimisation is knows as tail call optimisation
or tail call elimination

2. STG
- Push/Enter

I will not be going into the details of STG, I refer interested readers to the paper :)
. However, I do want to provide some information about how STG works, at least for certain cases. Here is an excellent stackoverflow post covering intuitively what evaluation in a lazy language looks like
. For a deeper dive, there are the slides called A Haskell compiler

The point is that Haskell doesn’t have a “call stack” per se. It doesn’t have the notion of a stack of values left to be evaluated. Rather, Haskell (or rather, STG) uses tail calls as its primary method of “control flow”, due to the laziness. Hence, we need to be able to compile these.

Current work: Representing the main loop

The heart of STG lies in following continuations. As a motivating example, let’s take a dummy example and see what this would like in C:


void matcher(int i);
int globalSideEffect;

void function_one() {
    globalSideEffect = 0;

void function_two() {
    globalSideEffect = 42;

 * This is the "main loop" of STG. A function calls "matcher" with the
 * correct "tag" corresponding to the next function (continuation). The 
 * matcher matches on the tag and dispatches the call.
 * If we do not perform tail call elimination, the call stack will eventually blow,
 * since there are no returns anywhere!
void matcher(int i) {
    switch (i) {
        case 1: function_one(); break;
        case 2: function_two(); break;

// entry point
int main() {

The problem with this approach is that since none of matcher
, function_one
, function_two
ever return, the call stack will eventually blow up. So, we need to implement this some other way. I discuss two approaches: what is described in the original STG
paper, and the approach that I am considering for simplexhc

STG paper approach:

In the STG
paper, here is how they describe doing this:

function_one() {
   globalSideEffect = 0;
   return &function_two;

function_two() {
    globalSideEffect = 42;
    return &function_one;

main() {
    cont = &function_one;
    while(1) {
        cont_next = (*cont)();
        cont = cont_next;

The transformation is quite clever: They choose to return function pointers
. The main function becomes the “continuation follower”. main
calls the current continuation, cont
, which returns the next
continuation: cont_next
. Then, we assign cont
to cont_next
and repeat this process. Note that we don’t blow up our stack doing this, because we only use a constant amount of stack space.

However, this has a problems:

Problem #1: We cannot provide a type to the functions

Specifically, function_one
and function_two
. Their types are roughly:

type Cont = () -> Cont*

which you can’t directly typedef
in C
(or LLVM
). The workaround would be something along the lines of:

struct Continuation {
     typedef struct Continuation * (*ContFuncTy)();
     ContFuncTy cont;

However, this adds a struct indirection as well. I’m not sure how much of an impact this would have on compilation, but it just seems hacky.

Problem #2: returning function pointers wrecks inter-procedural optimisation

Since we have functions that returns function, the chances for optimisation will mostly need inter-procedural optimisation. However, LLVM’s inter-procedural optimisation story is not so great. It heavily relies on inlining, which is problematic if we generate code that looks what’s shown above with the function-pointer chasing.


If we reconsider the original solution with a couple of minor changes:

Annotated C code

void matcher(int i);
int globalSideEffect;

void function_one() alwaysinline {
    globalSideEffect = 0;
    tail call matcher(1); // "tail call" forces a call to be tail call optimised.

void function_two() alwaysinline {
    globalSideEffect = 42;
    tail call matcher(2);

void matcher(int i) {
    switch (i) {
        case 1: function_one(); break;
        case 2: function_two(); break;

Think of the alwaysinline
attributes as hints to the compiler (in this case, LLVM
) to always
inline these function calls.

Next, the tail call
is an annotation telling the compiler to force these calls to be tail calls.

In the current state (without inlining), they don’t look like tail calls. However, on inlining, we get code that looks like this:

Annotated C code (After inlining)

void matcher(int i);
int globalSideEffect;

void function_one() alwaysinline {
    globalSideEffect = 0;
    tail call matcher(1);

void function_two() alwaysinline {
    globalSideEffect = 42;
    tail call matcher(2);

void matcher(int i) {
    switch (i) {
        case 1: {
            globalSideEffect = 0;
            tail call matcher(1)
        }; break;

        case 2:  {
            globalSideEffect = 42;
            tail call matcher(42);
        }; break;

which can
be tail called, since intuitively, the “tail call” occurs at the end of the call site.

The problem with this approach is that the matcher
would need to cover every possible call in our Haskell program
. While this could be quite large, I have a feeling that agressive dead code elimination with LLVM’s decent ability to generate code for switch
shouldn’t lead to too much of a cache hit.

The reason this is absolutely wonderful is that now, the calls are explicit. There is no pointer-chasing that the compiler needs to perform, which should lead to better inter-procedu ral analysis.

Of course, it’s all conjecture at this point. I’ll come back with data 🙂

Future advantages because of LLVM

  • LLVM
    5 introduced support for coroutines
    . This can possibly be used for CPS as well. In fact, There was a discussion about adding explicit CPS support
    . However, it seems like there is still work do be done in the area.

  • LLVM
    constantly receives updates and performance benefits because there are a bunch of people who contribute to it. That should (theoretically) allow GHC to become faster with no effort of its own.

I’d love to hear thoughts and comments about this. Please do leave a comment, or e-mail me at siddharth.bhat at research dot iiit dot ac dot in

Back to posts

责编内容by:A weblog of sorts 【阅读原文】。感谢您的支持!


来自YSRC:孤挺花字符串混淆功能分析 2017年6月,我们启动了Gossamer Project,寓意国内的安全研究团队能像蛛网一样链接起来,互相分享和复用研究成果。同月,我们开放了孤挺花LL...
LLVM学习笔记(37) 从InstRW定义推导 InstRW为一组指令重新绑定SchedReadWrite。在前面CodeGenSchedModels的...
Haskell: restrict & ldquo; ... I have a tree intended to contain a tuple at each node: -- Should be initializ...
MKSANITIZER – bug detector software integrat... July 02, 2018 posted by Kamil Rytarowski I've finished the integration of...
LLVM Weekly – #124, May 16th 2016 Welcome to the one hundred and twenty-fourth issue of LLVM Weekly, a weekly new...