Adventures in Type Inference Convergence

Julia uses type inference to determine the types of program variables and generate fast, optimized code. But how does it really work? I recently redesigned the implementation of Julia’s type inference algorithm, and decided to blog what I’ve learned. If you’re already comfortable with these algorithms, I might not say anything new. But since only 26 people have committed any changes to Julia’s `inference.jl`

file, I suspect that this is an unfamiliar topic for most people.

However, I’m not going to focus on explaining the typical algorithms, but instead focus on some of the implementation challenges.

A high level view of type inference when approached as a data-flow problem (as Julia does) is that it involves running an “interpreter” on the program, but only looking at types instead of values. Unlike normal program execution, this process can be proven to always converge and terminate if certain criteria are met. A typical explanation of the algorithm might end with a statement such as “now iterate to convergence”. But this simple phrase masks a significant design complexity, which can impact the speed and precision of the algorithm. This is the component of the Julia algorithm that needed a significant redesign to address a particular recursion bug and make inference on self-recursive functions faster.

## Basic Algorithm

Many algorithms for type inference exist. For statically typed languages, the algorithm of choice is usually a flow-insensitive unification-based algorithm like Hindley-Milner. Julia’s usage of data-flow analysis, however, is better suited for dynamic typing. This approach allows, for example, variables to have different types at different points in the program, adding extra flexibility and convenience. Unlike static typing systems, type inference is used as an optimization, rather than to guarantee correctness at compile time. To succeed, it does not need to prove a unique tightest bound on types, so it is allowed to widen (over-approximate) types heuristically. As we will see, these heuristics have an interesting design space.

I have illustrated Julia’s algorithm below for a simple for loop. In the animation below, `pc`

is the “program counter” indicating the line currently being processed and `pc´`

is highlighted to indicate the lines that remain to be processed next:

In Julia’s data-flow algorithm, inference execution starts with the statements in a function and initializes several states:

- The list of the current instruction pointers or active lines is set to be the entry point of the function.
- The types of all variables in the function are set to be
`Union{}`

(aka`Bottom`

) or the type of the corresponding argument. - The function also has an estimated return type of
`Union{}`

(does not return).

Then the algorithm acts like a “virtual machine”, iterating the following process until there are no lines waiting to be examined:

- The virtual machine takes one statement from the list of current instruction pointers.
- The side-effects of this statement are computed. For example, a variable assignment will result in the type of that variable changing. If the expression is a call, then the virtual machine looks up or computes the current best guess at the return value of that call, given the types of the input arguments. By applying this recursively, a current guess of the types of all of the elements of that statement is produced.
- The next line of the function is then added to the list of currently active instruction pointers, and the types of all variables on that next line are merged with the types from the current line, modified by any changes made by the current statement. This is not an easy update to describe in text, since it is merging two sets of types while taking into account the changes made in the previous statement. The best description I can give is that it should do exactly what you would expect: return the set of types possible for the variables on the next line, which is the same as the set of types on the current line, plus any changes made on the current line.
- For any
`if`

statement or other control operation (`while`

,`goto`

, etc.), the previous operation is performed for*all*of the possible next statements, instead of just the next one. For example, given an`if`

-statement, inference will take both the`if-true`

and`if-false`

paths through the function (and merge the results at the end). - To avoid infinite looping, the next instruction is only added to the list of current instruction pointers if the type of any variable on that line changed. Otherwise, the algorithm loops repeatedly until all of the types stop changing. The specific algorithm used to widen the types at each iteration can have dramatic effects on the runtime of the algorithm in the worst-cases. For this reason, Julia’s type-merging function typically tries to widen its estimate very quickly.

As I shall discuss next, once convergence has been reached, inference is done. The return type and variable types are marked and cached for code generation and future reuse (for example, if the function is inlined). The function can also then potentially be better optimized now that all of the types are known.

## Convergence

Now we come to the real challenge of type inference: reaching convergence.

The data-flow algorithm can be proven to converge if two conditions are met: monotonicity and finiteness.

This first requirement means that merging two types must result in a less specific type, and that knowing less about type for an argument to a function call does not lead to knowing more about its return type. This condition is fairly easy to meet, since all we need to do is describe the behavior of language primitives in a well-behaved way.

The second requirement means that there must be limit on number of types that can be constructed. This condition is more interesting, since many useful type lattices, including Julia’s, do not obey it. Instead, for the purposes of reaching convergence, the lattice is “forced” to finite height using a technique called “widening”. In widening, heuristics are applied to make types grow much faster than they would in a naive implementation of the algorithm, leading to faster convergence. This ensures that Julia’s type-inference algorithm is guaranteed to reach convergence eventually.

That still leaves the practical challenge of determining how to reach convergence. In Julia, detecting convergence is handled in several parts: Julia has local convergence at the statement level within a function, convergence of `static_typeof`

variables (comprehensions), and global convergence of all functions in the entire call graph.

### Historical Perspective

Initially, Julia only had the first test. This made inference quite simple to implement as a recursive descent over the depth-first search on the call graph. At each call, the algorithm pauses inference on the current function to recur into the callee. Eventually a leaf function is reached (one which does not make any calls to an uninferred function) and then the whole stack can unravel, filling in the return type information as it goes, and continuing with iterating convergence on the current function.

There is a slight wrinkle with this plan however if recursion is encountered. If there is a recursion, there is no leaf function. Instead, when entering a new function, the call stack is first checked to see if this function had already been encountered. Then the algorithm needed to take several actions. First, it marked the function as being recursive ( `toprec`

) and marked every intervening caller as depending on a recursive function ( `rec`

). Then as the recursion unwound, those functions would record that their return type was an incomplete, under-approximation of the real return type and discard their incomplete inference work (marked `true`

in the `tfunc`

cache). The topmost function in the recursion would then iterate this process until its return type reached a fixed-point.

Over time, this test was made more complicated to handle cases where recursion in the call graph can form multiple loops. Until v0.5, the algorithm computed a cycle-id for each recursion detected, to attempt to separate the dependency graph into separate recursion loops. This was not entirely correct because the graph is not always separable, which led to bug #9770 and then PR #9926 to compute it more correctly (I never attempted to prove it was fully correct, just that it was more correct on that issue). Unfortunately, that PR implementation also discarded and repeated significantly more incomplete work while handling recursion. The result was that inference became significantly slower on most tasks (such as inferring a single key handler for the repl), so the PR was never merged. To overcome these limitations, a new approach is required to deal with this case.

I mentioned previously that cycles cause issues for the existing type inference algorithm. The root of the problem is that using recursion to solve for the return types makes it inherently difficult to pause and resume type inference to incorporate new information. This impairs computation of the simultaneous convergence of all of the functions involved a cycle and instead led to a solution where only the top-most function in the call-graph is converged at a time. The resulting rework is very time-expensive in cases with heavy recursion, such as running inference on the inference code itself.

The solution is allow pausing and restarting inference on any part of a function at any time. This could still be written recursively, but once the state has been encapsulated in this way, it is easier to represent the algorithm as pure iteration. Indeed, this has obvious symmetries to inference within a function, where the working state is represented via an integer set of lines to revisit. This should not be particularly surprising, since there exists an isomorphism between statements and functions allowing one to be rewritten in terms of the other (e.g. replacing recursion with iteration or inlining one function into another).

The remaining construct that affects convergence is the `static_typeof`

expression. This construct appears in the loweredcode for a comprehension is an unusual complication in Julia’s type inference algorithm. It has required special care to make it look like the result of comprehensions are predictable, even though the current design is not. That is intended to be fixed soon, so I’ve relegated an overview of this construct to the footnotes.

### Corrected Convergence

The corrected convergence algorithm in PR #15300 works iteratively over all functions in the call graph, adding new functions to the queue as it comes across them, until all of them have collectively reached global convergence. When asked for the return type of a function, inference first checks the list in the `.tfunc`

field of the LambdaInfo method definition for whether the return type has been computed previously or is currently being computed (the `.tfunc`

list is a dictionary that maps from a call signature to either the in-process `InferenceState`

object or a pair of inferred function body and return type). If the type signature is not found, it is added to the work queue to be inferred. In any case, the current best-guess is returned. If inference has not finished with the function, an edge is also added to the inference state for these two functions (caller and callee), which tracks the dependency of the return type of the callee on that line number in the caller.

There are several partially-overlapping sets (or states) that a function can be in. I’ll list them all here for your reference, but I won’t fully explain what my terms mean until later:

- active: any function that is waiting to be inferred (superset of 2-6)
- in-work-queue: any function that is the work queue (yeah, I know, very insightful), which means it is ready for more processing (3)
- processing (one function): intermediate state after leaving the work queue while inferring all active lines and before it is then moved to the appropriate follow-on state (4-7)
- stuck: not in the work-queue, with edges ≠ ∅ but active lines = ∅
- stuck´: not in the work-queue, with edges ≠ ∅ and active lines ≠ ∅
- fixed-point: part of a cycle that may have collectively reached a fixed point
- finished: done

Any number of functions could be added to the work queue before starting global inference; the process doesn’t necessarily need to be rooted in one caller. Once the loop is called to start inference on everything, one function is selectedfrom the work queue for the algorithm to make progress on. It is important that the algorithm is always making progress to ensure that it will eventually terminate (more on this later). The algorithm then iterates over the instructions in the current function until it runs out of lines that can make progress (the algorithm could pause work earlier, but restarting frequently is less efficient and makes it harder to compute the “making progress” metric). Each time the algorithm reaches a call to another function that is not yet inferred, it adds an edge to the inference state for this function.

Once all active lines are processed, the algorithm determines if it is completely finished by counting the number of unresolved edges. If the number is zero, all child functions have also finished inference and the current return type result is fully converged. In that case, the function can be removed from the inference queue and recorded in the `.tfunc`

mapping (moved from the `active`

set to the `finished`

set). Then all parent functions are re-added to the work queue (by following the list of back-edges) so they can check for their own convergence. This is then repeated until the entire queue is empty (*the queue can also become empty due to a recursion cycle – I’ll get to that shortly). Alternatively, the item is moved to the `stuck`

set until the functions that it called have been inferred and can provide their actual return type.

If all active lines have not been processed, the algorithm picks another function off of the work queue and repeats the process. Because each call-edge between two functions is explicitly tracked, each time the return value of a function changes the algorithm can easily re-mark the appropriate lines in all of its callers for revisiting by the type inference algorithm, by following the back-edges (thereby moving the caller from `stuck`

to `stuck´`

).

Now comes the tricky part: dealing with cycles.

I mentioned earlier that the work queue may also become empty due to recursion. When this happens, all of the unresolved edges of the function that was just moved – from `work queue`

to `stuck`

– contain cycles. There are several ways to deal with this situation, depending on how the implementation chooses to represent, detect, and resolve the cycle. The following is a description of the implementation proposed for Julia. In terms of efficiency, this design optimizes for the case where most of the functions in the graph converge quickly (versus the possibility of needing to iterate the cycle many times before reaching a fixed point). Since most Julia code is written to have predictable “type-stable” leaf return types (and inference widens type information very quickly), this is expected to work out well in practice.

When the work queue becomes empty, the algorithm marks the current function as having reached a `fixed-point cycle`

. It then adds all of its non- `fixed-point`

edges to the work queue and marks them as having reached a `fixed-point`

. To guarantee convergence, the following invariant is maintained on this set at all times:

Any function marked as

`fixed-point`

must either:a. Be in the

`work queue`

set of functions (or`processing`

)b. Have all of its edges in the workqueue and marked as

`fixed-point`

This makes it remarkably easy to track this property and maintain these invariants. If the return type of the function changes, it and all of its immediate back-edges are marked as not being at a fixed point (moving them to `stuck´`

). The back-edges of each of those back-edges are then recursively also unmarked (moving them to `stuck`

). If a function marked as `fixed-point`

is retired from `processing`

, it adds all of its edges to `fixed-point`

.

If the work queue is still empty at this point, then the edges of all functions in the `fixed-point`

set are also in the `fixed-point`

set, and the entire cycle has reached convergence (yay!), and can be moved en masse to the finished set.

Finally, if the work queue is still empty, then the `active`

set contained multiple independent call graphs (and all of the remaining ones contain cycles). A random one gets moved to `processing`

to kickstart continued cycle resolution.

Once the `active`

set is empty, inference is done. It is now the compiler’s job to decide what to do with all this information to maximize speed.

But wait! There’s one more complication: in addition to cycles on a function with the same signature, because we specialize each signature on the inferred type signature there can be a cycle for recursively calling a method where the type signature changes at each call.

In this case, various heuristics are required to ensure this can still terminate. The current heuristics place a limit on the complexity of the type signaturein the recursion to force it to to be finite. But a full discussion of these heuristics will have to wait for a future blog post.

### Addendum: Speed

After writing this post, testing of the algorithm above showed that it was unacceptably slower and more memory-hungry than the existing code. So after all of that work to remove all recursion and make it a simple work queue, I made one final tweak: I made it recursive. This made its behavior considerably more complex to analyze, but it pays for its complexity in the speed benefit of being able to infer the common case (non-cyclic functions) without creating a deferred edge. The fall-back, however, is still the corrected looping algorithm given previously. This change however demonstrates one advantage of the states chosen above for the algorithm: it is highly decoupled, so that only the act of making and finalizing an edge needs to be synchronized. The rest of the work can be executed in any order – in the future, it could even be run multi-threaded.

## References

My thanks to @carnaval for correcting my convergence algorithm, encouraging me to simplify my state transitions, and generally for teaching me about this algorithm.

Also, his paper on static analysis capabilities for Julia helped me better understand some of the future features that may be desirable for Julia’s inference algorithm, and how those would affect the convergence criteria.

[M02] Markus Mohnen. A Graph-Free Approach to Data-flow Analysis. In R. Horspool, editor, Compiler Construction, volume 2304 of Lecture Notes in Computer Science, pages 185–213. Springer Berlin / Heidelberg, 2002.

[CC77] Patrick Cousot, Radhia Cousot – Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints (SIGPLAN 1977)