Calling Conventions

This article is Part 5 in a 5-Part Series.

Part 1 - Journey through the .NET internals - Sorting
Part 2 - List.Sort internals
Part 3 - Array.Sort && TrySZSort
Part 4 - Managed vs Unmanaged code and interop
Part 5 - This Article

Calling Convention
CPU, Machine code and instruction sets
Functions in assembly
CDECL and FASTCALL

In this blog post we will answer the question What is a calling convention?. A calling convention is like a contract that describes how the functions call each other, on the assembly level using cpu instructions`.

Calling Convention

It defines things like:

the way arguments are passed to a function
how values are returned
how the function name is decorated
who: caller or calle handles stack or registers clean up

It specifies how (at a low level) the compiler will pass input parameters to the function and retrieve its results once it’s been executed.¹

CPU, Machine code and instruction sets

If we go down to the lowest levels of code, there is a machine code².

8B542408 83FA0077 06B80000 0000C383
FA027706 B8010000 00C353BB 01000000
B9010000 008D0419 83FA0376 078BD989
C14AEBF1 5BC3

This BTW is a Fibonacci number generation code in machine code. I wouldn’t be able to write it that way, but what is important is that on this lowest level it really doesn’t matter if this code comes from C++, Java, Python or C#. It would an impossible task(almost) to write code that way. That is why we have a higher abstraction on top of machine code - assembly language.

Example below is the same fibonacci number generation code but in assembly.

fib:
    mov edx, [esp+8]
    cmp edx, 0
    ja @f
    mov eax, 0
    ret
    
    @@:
    cmp edx, 2
    ja @f
    mov eax, 1
    ret
    
    @@:
    push ebx
    mov ebx, 1
    mov ecx, 1
    
    @@:
        lea eax, [ebx+ecx]
        cmp edx, 3
        jbe @f
        mov ebx, ecx
        mov ecx, eax
        dec edx
    jmp @b
    
    @@:
    pop ebx
    ret

On this level which is still very low. We operate very close to the CPU using - registers, stacks, and CPU instructions like mov or jmp. Every CPU supports different registers and instructions³.

First micro processor⁴ had 46 instructions⁵. These days you can check this list ⁶, there are hundreds of them. It all started with simple instructions, which were used to generate more complex operations. As these operations become very common, CPU designer added them as new instructions, often designing CPUs to make them more optimized.

Then there is also a difference between (RISC)ARM and (CISC)x86 processors. The former have smaller number of instructions but require fewer transistors making them more power efficient⁷.

You can check the difference down below.

x86
-----

repe cmpsb         /* repeat while equal compare string byte-wise */

ARM
-----

top:
ldrb r2, [r0, #1]! /* load a byte from address in r0 into r2, increment r0 after */
ldrb r3, [r1, #1]! /* load a byte from address in r1 into r3, increment r1 after */
subs r2, r3, r2    /* subtract r2 from r3 and put result into r2      */
beq  top           /* branch(/jump) if result is zero                 */

It is the same code but on different CPU families with different instruction sets. Due to this difference you need to compile the code for a specific machine. If you are familiar with Linux world, it is pretty standard procedure to download source code of some program and build it itself on your machine for your machine specific context. More popular distributions have packages with already pre-compiled binaries. Usually when you go to a release page of some software - example (ripgrep ⁸) you will see different binaries, for different operating systems, Linux, kernels or families of CPUs. (BTW ripgrep is an amazing replacement of grep).

This is partially why virtual machine was created with platforms like JAVA or .NET. It helps with portability of software as instead of compiling your code to a specific instruction set. You compile it to intermediary language IL or Java Bytecode which is then compiled, usually lazily on the fly, by the Virtual machine to this machine specific context. It automates the whole process of building the code for your .

Functions in assembly

On this low level we operate with CPU instructions. The concept of function, argument, returning value from a function doesn’t exist. We can only use simple primitives like accumulator, registers, stack, label and CPU instructions. These primitives can be used to create more complex code and something similar to functions.

int sum(int a, int b) {
    return a + b;
}

This code is readable and it has concepts of types int, function, arguments, + operator, return and of course scope {}.

sum:                          <- label
  mov edx, DWORD PTR [esp+4]  <- move value a from stack to register
  mov eax, DWORD PTR [esp+8]  <- move value b from stack to register
  add eax, edx                <- add b to a
  ret                         <- return

When you compile this code to assembly. You get a different view with things like labels sum:, CPU instructions mov, add, ret, operation on stack [esp+4], stack pointer esp and registers edx, eax. It is a completely different world.

Looking at this code you might ask:

Ok I see ret function which I assume is return, but how does it work?
Which value is returned?
If I call it how will another function how to get the value?

And that is why we have calling conventions to create a contract with information for functions on how to call each other.

Calling conventions can differ in many ways:

where are the arguments stored - registers, stack
where do you put the result of the function call (stack, register, memory)
who is responsible for clean-up - caller or callee ( this makes a difference in assembly code size, if caller is cleaning up the stack - the compiler has to generate clean-up instructions next to the function call)
who is responsible for cleaning up registers and bringing them back to previous state (before the function was called)

You can check the list of x86 calling conventions here ⁹. We will use cdecl and fastcall as an example.

CDECL and FASTCALL

If one of the functions expects call using cdecl convention. It is expecting:

arguments to be on the stack
caller cleaning the stack

If we then call this function using fastcall convention both requirements won’t be met:

for fastcall first three (for Microsoft two) arguments are kept in the registers
stack won’t be cleaned up as fastcall assumes that callee is responsible for that.

__attribute__((cdecl)) int cdecl(int a, int b) {
      return a * b;
}

int caller() {
    return cdecl(2, 3);
}

Source code ¹⁰.

This simple function multiplies numbers. We have function cdecl which is marked with cdecl attribute to force this calling convention (this is actually default and this attribute is not needed).

I am compiling this code with these flags:

-m32 - forces 32 bit executable - without this flag calling conventions are ignored (couldn’t find why)
-O0 - I don’t want to optimize this code as with such a simple example -O1 in the caller puts a static value (2 * 3 = 6)
-fomit-frame-pointer - one optimization that removes frame pointers to make the asm code a bit simpler. (At the end of this post there is a example without this optimization explained if you are curious what is the difference).

-fomit-frame-pointer
Don’t keep the frame pointer in a register for functions that don’t need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions. It also makes debugging impossible on some machines. ¹¹

It removes these instructions.

  - push ebp      <- preserve the caller function entry point on the stack
  - mov ebp, esp  <- point ebp to this function stack frame pointer (create stack frame)
...
  - pop ebp       <- restore entry point from the stack of the calling function to be able to go back

This simplifis the code to this form.

cdecl:
  mov eax,  DWORD PTR [ebp+4] -> move value 'a' from the stack to  eax
  imul eax, DWORD PTR [ebp+8] -> multiply eax by 'b' from the stack
                              -> in cdecl called function expects arguments on the stack
  ret

caller:
  push 3                      -> push `a` to the stack
  push 2                      -> push `b` to the stack 
                              -> in cdecl arguments are pushed to the stack
  call cdecl
  add esp, 8                  -> 'clean up' the stack by moving the pointer
                              -> in cdecl caller cleans up the stack
  ret

For comparison lets look at fastcall.

__attribute__((fastcall)) int fastcall(int a, int b, int c) {
      return a * b * c;
}

int caller() {
    return fastcall(2, 3, 4);
}

Source code ¹².

I added third parameter to show that only first two arguments are passed through the registers.

fastcall:
  sub esp, 8                   -> reserve place on the stack
  mov DWORD PTR [esp+4], ecx   -> move `a` to the stack
  mov DWORD PTR [esp], edx     -> move `b` to the stack
  mov eax, DWORD PTR [esp+4]   -> move `a` from stack to eax
  imul eax, DWORD PTR [esp]    -> multiply `a` on eax by `b`
  imul eax, DWORD PTR [esp+12] -> multiply by `c` on the stack
  add esp, 8                   -> clean up the stack
  ret 4                        -> return to the caller and move stack pointer cleaning up `c`

For simplicity we can simplify this code to this.

fastcall:
  mov eax, ecx                 -> move `a` to eax
  imul eax, edx                -> multiply eax by `b`
  imul eax, DWORD PTR [esp]    -> multiply eax by `c`
  ret 4

There is no need to reserve place on the stack, move values from registers to the stack and then get values from the stack. Compiler potentially does it due to consistency.

Arguments are first saved in stack then fetched from stack, rather than be used directly. This is because the compiler wants a consistent way to use all arguments via stack access, not only one compiler does like that. ¹³

In the end we will analyse this code.

fastcall:
  mov eax, ecx              move `a` to eax 
  imul eax, edx             multiply `a` in the eax by `b` in edx
                            -> in fastcall called function expects arguments in the registers
  imul eax, DWORD PTR [esp] multiply `a*b` by `c` on the stack, esp is pointing at the top of stack
                            -> in fastcall third parameters is on the stack
  ret 4                     -> return to the caller and clean-up stack from `c`
                            -> in fastcall called function is cleaning up the stack
caller:
  push 4                    -> move `c` to the stack
                            -> in fastcall third argument is passed using stack
  mov edx, 3                -> move `a` to edx
  mov ecx, 2                -> move `b` to ecx
                            -> in fastcall first 2 arguments are passed by registers
  call fastcall   
  ret

So this is it. Examples of differences between fastcall and cdecl. What would happen then if we would mix conventions. Example below shows what happens when a caller and calle are not abiding to the same convention.

fastcall:
  mov eax, ecx    -> `fastcall` expects arguments on the registers
  imul eax, edx   
  ret
caller:
  push 3          -> but caller pushed arguments to the stack
  push 2         
  call fastcall
  add esp, 8 
  ret

fastcall still thinks that arguments were passed through registers and obviously there will be some data. It is not the data passed by the caller as he used cdecl conventions and passed arguments through the stack. This would generate an unexpected and hard to debug behaviour. That is why calling conventions are important. There is a long history behind them ¹⁴¹⁵¹⁶¹⁷

MichalFranc

Contents

Calling Convention

CPU, Machine code and instruction sets

Functions in assembly

CDECL and FASTCALL