Stack switching in OCaml 5
Delimited continuations are implemented with dynamically allocated
stack. The compiler runtime uses malloc
to allocate space for a stack
and sets stack pointer to it everytime a match_with
or try_with
is
called. This introduces a new challenge: stack overflow checks.
Before OCaml 5, the runtime used system alloted stacks to manage stack frames. With heap allocated stacks, the runtime has to insert stack overflow checks in the function prologues (function prologues and epilogues are assembly/low-level code emitted by a compiler at the start and the end of a compiled function). But first, let's get some basics out of the way.
All OCaml programs are just C programs linked with OCaml modules OCaml is a high level language - none of OCaml values are exactly what they appear. On 64-bit machines, integers are 63 bits wide ie. only 63-bit long integers can be represents when a values is declared/inferred as integer type. Other types, like records etc., that are allocated on the heap are really just addresses in the heap with special bits for the garbage collector. The Real World OCaml has a great chapter on how OCaml values are represented.
What creates all these values and operates on them? The OCaml runtime is written in C. I find that Ghuloum's tutorial on compilers is great introduction to understanding how programs created with languages like OCaml work in the runtime. The tutorial implements a lisp, but OCaml runtime is very similar in this regard.
Here's a high-level overview.
C program's main()
calls a chain of functions that initialise a lot
of things - garbage collection, domain (threads) initialisation, CLI
argument array santisation and other utils. Domain initialisation also
initialises stacks and we revisit this topic later on.
After initialisation, C program calls an externally defined function
caml_start_program
. This is defined in the assembly part of the
runtime and is architecture and operating system specific. Why in asm?
C is, after all a high level language. It doesn't let us manipulate
stack pointers and other registers. Being able to do this crucial as
we'll soon realise.
Why is caml_program
- the entrypoint - created dynamically? Because
OCaml doesn't force programmer to define a entrypoint like C does. So,
programmers are free to define global (or you may call it "top-level")
expressions. camlprogram is a collection of all such global
expressions run together. Along with global expressions, it'll also
run some other initialisation that happens in the OCaml layer.
Delimited Continuations in the runtime
In an implementation (no matter the language), a delimited continuations are just a bunch of independent stacks which the execution can switch as necessary.
-
Stack initialisation
Threads are called domains in OCaml because 'threads' itself is a very overloaded term. Each domain maintains some state - one of which is it's stack.
As we learnt, we have two stacks: one for OCaml and one for C. Since they both share the same set of registers, the runtime also has to save-restore stack states to respect each languages calling conventions. You can see this in the
arm64.S
file of the runtime. Keeping this in mind, let's see how the OCaml stack is allocated and managed. The C stack is managed by the C compiler and isn't the focus of this post. We will however note how we prepare the registers before handing off the control to the C layer.It all starts with
domain.c
where a new domain gets created. Among other things in it's state, you find the stacks.domain_state->current_stack = caml_alloc_main_stack(stack_wsize);
// ...other things\ndomain_state->c_stack = NULL;caml_alloc_main_stack
is a wrapper aroundcaml_alloc_stack_noexc
which callsalloc_size_class_stack_noexc
- the indirections comes from stack caches and an internal policy to not raise OCaml exceptions in function names ending with _noexc.caml_call_realloc_stack
allocations are not expensive and rawmalloc()
but cached ones - from a cache of recently freed stacks (often referred to as stack cache). You can read more about such strategies of stack allocation for continutations in "From folklore to fact: comparing implementations of stacks and continuations". -
Fibers
Now that we know OCaml programs dont use a single program stack like they used to, but more than one, let's understand how they're managed. Crucial concept here is Fibers - not to be confused with the concurrency primitive.
Fibers are defined in the "Retrofitting effects.." paper as a tuple of frames and a handler. In
fiber.h
, there are organised with the struct,stack_info
.struct stack_info {
void* sp; /* stack pointer of the OCaml stack when suspended */
void* exception_ptr; /* exception pointer of OCaml stack when suspended */
struct stack_handler* handler; /* effect handling state for the fiber */
int cache_bucket;
size_t size; /* only used when USE_MMAP_MAP_STACK is defined */
uintnat magic;
int64_t id;
};And stackhandler contains the handlers for any effects or exceptions thrown for this stack - making
stack_info
the structure representing the state for a fiber.struct stack_handler {
value handle_value;
value handle_exn;
value handle_effect;
struct stack_info* parent; /* parent OCaml stack if any */
}; -
When are Fibers allocated?
They are allocated when,
- Programmer has created a new delimited continutation with trywith() or matchwith()
- We have run out of space on the current stack
The function
caml_alloc_stack
(visible to OCaml layer in effects.ml via allocstack) callsalloc_size_class_stack_noexc
which is an internal runtime function returning a new fiber (iestack_info
pointer).Notice, how
caml_alloc_stack
returns an OCaml valuevalue caml_alloc_stack (value hval, value hexn, value heff);
This is because it's the underlying c function of
alloc_stack
OCaml function ineffects.ml
.One can think of it as a OCaml layer friendly version of
alloc_size_class_stack_noexc
. -
How do Fibers relate to Stacks?
Fibers, or the tuple
(frames, handlers)
are set up on the heap-allocated stacks. Here's a description of the fiber layout on a stack as documented infiber.h
/* Stack layout for native code. Stack grows downwards.
*
* +------------------------+
* | struct stack_handler |
* +------------------------+ <--- Stack_high
* | caml_runstack / |
* | caml_start_program |
* +------------------------+
* | |
* . OCaml frames . <--- sp
* | |
* +------------------------+ <--- Stack_threshold
* | |
* . Red Zone .
* | |
* +------------------------+ <--- Stack_base
* | struct stack_info |
* +------------------------+ <--- Caml_state->current_stack
*/ -
How are frames created and added?
How OCaml creates stack frames and allocates registers is a big topic in itself, but here are the bullet points.
- OCaml has unified representation - a machine word can represent both integers and pointers. And OCaml makes it's best attempt to store them in the registers before calling the function (without necessarily creating a stack frame)
- When there aren't enough registers, OCaml pushes these function
arguments, that is the actual arguments, into the stack frames and
makes the most of whatever registers are available. To get the full
details of the available registers, we will have to look into the
file
arm64/proc.ml
, which contains information about the available registers, calling conventions and other architecture specific stuff. - OCaml doesn't use frame pointers (rbp/x29) like in C (by default). Frame pointers are used very often to create stack traces, help debugger unwind frames during exceptions.
For exceptions, OCaml maintains a linked list of stack pointers and program counters. For debuggers, OCaml generates DWARF expressions
The compiler does have a
+fp
variant where frame pointers are pushed as a part of the stack frame, but this is opt-in.- Functions being called are free to clobber the registers for their needs. That is, no callee-saved registers. This is true for exception handling too, which we'll discuss shortly.
-
Exceptions
Functions are simple and easy to understand as a mechanism for control flow. C's goto's are simpler but makes it hard to reason about the codebase.
Quick summary of functions: they view the code as a tree and try to do a depth-first traversal. They use stacks to create frames that manage the state of a given function as it executes, and allocate all the static, i.e. ahead-of-time, known memory requirements. They grow as we go deeper into the function call-tree and shrink in size as the control flow goes back to the root of the program.
Exceptions introduce the first level of non-local control flow, and in order to do so, "cut" stacks when exceptions are raised. Let's take a look at how this happens:
Compiling exceptions has two parts to it:
- Installing the handlers when a try/catch block is encountered.
- Unwinding the stack when a raise/throw is encountered.
During the installation phase, the compiler records the stack state which we can fall back to when the program throws the exception. HereÕs an example of the assembly generated.
For reference, in
meander.ml
we has,let omain v =
let { a; b } = ocaml_fn v (v + 1) in
Printf.printf "%d" (a + b);
try (* h1 *)
(try (ocaml_to_c ()) (* h2 *)
with E2 -> 0)
with E1 -> print_endline "In E1"; vAnd the generated assembly around the try/catch looks like the following:
adr x16, L108
str x26, [sp, -16]!
str x16, [sp, #8]
mov x26, sp
adr x16, L111
str x26, [sp, -16]!
str x16, [sp, #8]
.cfi_adjust_cfa_offset 16
mov x26, sp
orr x0, xzr, #1
.loc 1 17
adrp x8, _ocaml_to_c@GOTPAGE
ldr x8, [x8, _ocaml_to_c@GOTPAGEOFF]
bl _caml_c_callIn arm64,
x26
is the trap pointer (you can find the complete reference inarm64/proc.ml
). Trap pointer here refers to the address where exception handlers reside - i.e. the code that analyzes the throw exception and decides what action to take.For context,
L108
is where the exception handler for outertry
resides.The compiler.
- Pushes the last trap pointer to the stack
(sp - 16)!
- Pushes current try blocks corresponding exception handler to stack
- Saves a copy of current stack pointer as current trap pointer
All this is for the outer try block. Then, for the inner try block, it again,
- Saves inner exception handler, at
L111
, to x16 (temp register) - Pushes current trap pointer to stack
- Pushes L111Õs address (loaded in x16) onto the stack.
- Make the current stack pointer, the current trap pointer.
We note that, the runtime,
- Only concerns itself with where to find the last entered try block's exception handler
- What was the stack pointer state as we entered the last try block
- It has a linked list of exception handlers on the stack. And it knows
where to find the stack pointer to cut to right next to this linked list's nodes.
All this information is used by the
raise_exception
function defined in the assembly part of the runtime. Why assembly? Because if written in C, it would create it's own stack and not let us control it. We need to craft stack frames to our liking!CFI_STARTPROC
/* Test if backtrace is active */
ldr TMP, Caml_state(backtrace_active)
cbnz TMP, 2f
1:
JUMP_TO_TRAP_PTR
2: /* Zero backtrace_pos */
str xzr, Caml_state(backtrace_pos)
L(caml_reraise_exn_stash):
/* Preserve exception bucket in callee-save register x19 */
mov x19, x0
/* Stash the backtrace */
/* arg1: exn bucket, already in x0 */
mov x1, x30 /* arg2: pc of raise */
mov x2, sp /* arg3: sp of raise */
mov x3, TRAP_PTR /* arg4: sp of handler */
/* Switch to C stack */
ldr TMP, Caml_state(c_stack)
mov sp, TMP
bl G(caml_stash_backtrace)
/* Restore exception bucket and raise */
mov x0, x19
b 1b
CFI_ENDPROC
END_FUNCTION(caml_raise_exn)- Checks if it needs to collect backtrace. We'll skip this and instead focus on the stack management.
JUMP_TO_TRAP_PTR
is a macro - it cuts the stack to the point where the nearest exception handler is.caml_stash_backtrace
is a C function defined inbacktrace_nat.c
Lines afterJUMP_TO_TRAP_PTR
prepare the registers and the stack for this C function, in accordance with C's calling conventions.
Let's now look at
JUMP_TO_TRAP_PTR
.macro JUMP_TO_TRAP_PTR
/* Cut stack at current trap handler */
mov sp, TRAP_PTR
/* Pop previous handler and jump to it */
ldr TMP, [sp, 8]
ldr TRAP_PTR, [sp], 16
br TMP
.endm- Simply set the stack pointer to state stack pointer would be when executing the exception handler.
- Jump to the exception handler which is present on the stack -
(sp + 8)
- Before making the jump, set the next exception handler to
sp + 16
because it contains the parent exception handler of this enclosing try/catch block.
What happens at the exception handler? Back to
meander.s
! Recall, the try/catch blocks looked like this,try (* h1 *)
(try (ocaml_to_c ()) (* h2 *)
with E2 -> 0)
with E1 -> print_endline "In E1"; v
Inner most exception handler looks like this,
L111:
adrp x15, _camlMeander@GOTPAGE
ldr x15, [x15, _camlMeander@GOTPAGEOFF]
ldr x19, [x15, #8]
cmp x0, x19
b.ne L110
orr x0, xzr, #1
b L109x0
, at this point, contains the result ofocaml_to_c
call. We know, as we wroteocaml_to_c
, that it raises an exception. So unlike regular functions, it isn't going to return and shrink the stack. Will callcaml_raise_exn
and cut the stack to it's handler. This is how control gets passed this block of asm (at L111).x0
contains the raised value,E1
. The handler compares the received value against the exception declaration present in the global data section with the help of a offset table. If true, handler executes the handling expression (E2 -> 0
)We notice, just like functions, exceptions are light-weight. Languages like C/C++ often need to respect calling-conventions and have callee-saved registers to do this. In C/C++ caller expects certain registers to be un-clobbered whereas OCaml doesn't enforce this, which makes installing and running an exception handlers lighter, least compared to C/C++, by generating less prologue/epilogue code the save/restore states.
Notice also how the compiler pushes the program counter and the current stack pointer into the stack every time it encounters a try/block: very similar to what it would do when it encounters a function in a manner of speaking. Try catch are very similar but very similar to functions with a key difference being the linearity or how the linearity of the flow: functions always return and exceptions don't. When we see a function
a
calling b, callingc
, we can only return toa
afterb
andc
have finished and this is what we mean by linearity this is what we are referring to by linearity over here. On the other hand as we know exceptions jump from one place to another. Therefore exceptions have to create frames just like functions too but also discard frames which a function never has to do because function frames are going to grow and shrink in the last in first out order. Exceptions therefore have to associate the nearest try catch block with the exception being raised and this happens by every raise function knowing where the catch handler is installed in the memory and OCaml has a special pointer called the trap pointer to keep track or keep track of this exception nearest exception handler.What happens when raise exception doesn't match with any handler? Re-raise!
Effects
Effects, as we saw earlier, setup new stacks on the heap and switch to
it. You can see this your self in stdlib/effect.ml
let match_with comp arg handler = let effc eff k last_fiber = match handler.effc eff with | Some f -> cont_set_last_fiber k last_fiber; f k | None -> reperform eff k last_fiber in let s = alloc_stack handler.retc handler.exnc effc in runstack s comp arg
The user provided effc
is augmented with a continuation, k
, and
, last_fiber
, both of which are provided by the runtime function,
alloc_stack
.
Once a stack is allocated, it is used to run the computation, comp
with it's arguments, on the new stack with runstack
To understand the runtime functions, alloc_stack
and runstack
,
let's compile a simple program and examine it's assembly.
Here's a program whose assembly output we'll examine
type _ Effect.t += Hello : unit Effect.t
let main () =
Effect.perform Hello
let () =
let retc = Fun.id in
let exnc = raise in
let effc : type c. c Effect.t -> ((c, 'a) Effect.Deep.continuation -> 'a) option = function
| Hello -> Some (fun k -> print_endline "Hello"; Effect.Deep.continue k ())
| _ -> None
in
Effect.Deep.match_with main () { retc; exnc; effc }
Starting with main
, which only performs an effect,
_camlMain.main_277:
.cfi_startproc
sub sp, sp, #16
.cfi_adjust_cfa_offset 16
.cfi_offset 30, -8
str x30, [sp, #8]
L100:
ldr x16, [x28, #0]
sub x27, x27, #24
cmp x27, x16
b.lo L103
L102: add x1, x27, #8
movz x2, #2293, lsl #0
str x2, [x1, #-8]
orr x3, xzr, #1
str x3, [x1, #0]
orr x4, xzr, #1
str x4, [x1, #8]
adrp x5, _camlMain@GOTPAGE
ldr x5, [x5, _camlMain@GOTPAGEOFF]
ldr x0, [x5, #0]
ldr x30, [sp, #8]
add sp, sp, #16
.cfi_adjust_cfa_offset -16
b _caml_perform
.cfi_adjust_cfa_offset 16
L103: bl _caml_call_gc
L101: b L102
.cfi_endproc
- Grow the stack to push a new pointer
- Push
x30
, the program counter (seeasmcomp/arm64/proc.ml
for the full reference) - Load the young gc limit found in domain state (stored in
x28
) Sinceyoung_limit
is the first field of the struct inruntime/caml/domain_state.tbl
, the offet is#0
- Subtract alloc pointer (
x27
) by 3 bytes to accomodate the effect - Compare alloc pointer and young limit to check if garbage collection is to be triggered.
Performing an effect happens in the C realm, in caml_perform
. So, the rest of the
instructions are setting up the registers as per calling conventions.
From the runtime source, arm64.S
, we have the following comment.
/* x0: effect to perform
x1: continuation
x2: old_stack
x3: last_fiber */
So, caml_perform
expects these registers. So, before FFI'ing into
the C layer's caml_perform
, the compiler must setup this register in
main
And it does so after pushing the return address to th stack.
- Set the effect in
x0
adrp x5, _camlMain@GOTPAGE
ldr x5, [x5, _camlMain@GOTPAGEOFF]
ldr x0, [x5, #0]
-
Set the continuation, parent stack and parent fiber in
x1
,x2
andx3
respectively -
Call
caml_perform
Performed effect is passed to the effc
in a new stack and stack is switched
str xzr, Handler_parent(x9) /* Set parent of performer to NULL */
ldr TMP, Handler_effect(x9)
mov x2, x3 /* x2 := last_fiber */
mov x3, TMP /* x3 := effect handler */
b G(caml_apply3)
which basically calls the following in stdlib/effect.ml
let effc eff k last_fiber =
match handler.effc eff with
| Some f ->
cont_set_last_fiber k last_fiber;
f k
| None -> reperform eff k last_fiber
in
If parent handler is null,
ldr 9, Stack_handler(x2) /* x9 := old stack -> handler */
ldr x10, Handler_parent(x9) /* x10 := parent stack */
cbz x10, 1f
Then,
str xzr, Handler_parent(x9) /* Set parent of performer to NULL */
ldr TMP, Handler_effect(x9)
mov x2, x3 /* x2 := last_fiber */
mov x3, TMP /* x3 := effect handler */
b G(caml_apply3)
1:
/* switch back to original performer before raising Effect.Unhandled
(no-op unless this is a reperform) */
ldr x10, [x1] /* load performer stack from continuation */
sub x10, x10, 1 /* x10 := Ptr_val(x10) */
ldr x9, Caml_state(current_stack)
SWITCH_OCAML_STACKS x9, x10
ie raise unhandled exception
Now let's examine the runtime functions, runstack
and alloc_stack
.
runstack
runs our main
function, so let's look at it first.
From arm64.S
,
stp x29, x30, [sp, -16]!
Pushes frame and link register to the stack.
add x29, sp, #0
Set the frame register to current stack pointer.
Some self explanatory code,
/* save old stack pointer and exception handler */
ldr x8, Caml_state(current_stack) /* x8 := old stack */
mov TMP, sp
str TMP, Stack_sp(x8)
str TRAP_PTR, Stack_exception(x8)
/* Load new stack pointer and set parent */
ldr TMP, Stack_handler(x0)
str x8, Handler_parent(TMP)
str x0, Caml_state(current_stack)
ldr x9, Stack_sp(x0) /* x9 := sp of new stack */
Because, we'll be switch stacks shortly.
/* Create an exception handler on the target stack
after 16byte DWARF & gc_regs block (which is unused here) */
sub x9, x9, 32
adr TMP, L(fiber_exn_handler)
str TMP, [x9, 8]
Each new stack needs its own global exception handler
Another documented assembly,
/* Call the function on the new stack */
mov x0, x2
blr x3
How was this new stack created as we saw in stdlib/effect.ml
?
let s = alloc_stack handler.retc handler.exnc effc in
CAMLprim value caml_alloc_stack(value hval, value hexn, value heff)
{
value* sp;
const int64_t id = atomic_fetch_add(&fiber_id, 1);
struct stack_info* stack =
alloc_size_class_stack_noexc(caml_fiber_wsz, 0 /* first bucket */,
hval, hexn, heff, id);
if (!stack) caml_raise_out_of_memory();
sp = Stack_high(stack);
sp -= 1;
sp[0] = Val_long(1);
stack->sp = sp;
return Val_ptr(stack);
}
which leads to alloc_size_class_stack_noexc
- it allocates a new
stack double the size quickly from a cache of stacks.