理解Calling Conventions和Stack Frame

当参数中同时存在integer，pointer和float时，前两者按顺序使用dsdc89，后者按顺序使用xmm027。就这么简单！

所谓callee-save寄存器，表示这些寄存器的值，对于caller来说，不会被callee破坏，这些值在call指令前后不会变化，可以放心继续使用，不需要自己做push和pop。而caller-save寄存器，在call前后，不能保证其值不会变化，如果要在call之后继续使用，需要caller自己做push和pop。对于callee来说，如果是leaf，优先考虑使用non-saved寄存器，这样不用push和pop。对于用于传参的dsdc89，它们都是caller-save寄存器。callee-saved寄存器，有可能会被每一层push和pop。

Linux Stack Frame Layout on x64

According to the ABI, the first 6 integer or pointer arguments to a function are passed in registers. The first is placed in rdi, the second in rsi, the third in rdx, and then rcx, r8 and r9. Only the 7th argument and onwards are passed on the stack.

long myfunc(long a, long b, long c, long d,
            long e, long f, long g, long h)
{
    long xx = a * b * c * d * e * f * g * h;
    long yy = a + b + c + d + e + f + g + h;
    long zz = utilfunc(xx, yy, xx % yy);
    return zz + 20;
}

So the first 6 arguments are passed via registers. But other than that, this doesn't look very different from what happens on x86, except this strange "red zone". What is that all about?

Put simply, the red zone is an optimization. Code can assume that the 128 bytes below rsp will not be asynchronously clobbered by signals or interrupt handlers, and thus can use it for scratch data, without explicitly moving the stack pointer. The last sentence is where the optimization lays - decrementing rsp and restoring it are two instructions that can be saved when using the red zone for data.

However, keep in mind that the red zone will be clobbered by function calls, so it's usually most useful in leaf functions (functions that call no other functions).

long utilfunc(long a, long b, long c)
{
    long xx = a + 2;
    long yy = b + 3;
    long zz = c + 4;
    long sum = xx + yy + zz;

    return xx * yy * zz + sum;
}

Since utilfunc only has 3 arguments, calling it requires no stack usage since all the arguments fit into registers. In addition, since it's a leaf function, gcc chooses to use the red zone for all its local variables. Thus, rsp needs not be decremented (and later restored) to allocate space for this data.

The base pointer rbp (and its predecessor ebp on x86), being a stable "anchor" to the beginning of the stack frame throughout the execution of a function, is very convenient for manual assembly coding and for debugging. However, some time ago it was noticed that compiler-generated code doesn't really need it (the compiler can easily keep track of offsets from rsp), and the DWARF debugging format provides means (CFI) to access stack frames without the base pointer.

This is why some compilers started omitting the base pointer for aggressive optimizations, thus shortening the function prologue and epilogue, and providing an additional register for general-purpose use (which, recall, is quite useful on x86 with its limited set of GPRs).

gcc keeps the base pointer by default on x86, but allows the optimization with the -fomit-frame-pointer compilation flag. How recommended it is to use this flag is a debated issue - you may do some googling if this interests you.

Anyhow, one other "novelty" the AMD64 ABI introduced is making the base pointer explicitly optional, stating:

gcc adheres to this recommendation and by default omits the frame pointer on x64, when compiling with optimizations. It gives an option to preserve it by providing the -fno-omit-frame-pointer flag. For clarity's sake, the stack frames showed above were produced without omitting the frame pointer.

Stack的16字节对齐

64位CPU，并不意味着每次存取8字节才是最快的，64只是寄存器的大小。而且，已经存在128位的寄存器在x64的架构中。不对齐的后果是，效率低，很多call会发生segmentation fault（但不是所有）。

任何内存分配函数（malloc, calloc或realloc）生成的块起始地址都必须是16的倍数。

Windows在x64下的ABI

https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/x64-architecture

Unlike the x86, the C/C++ compiler only supports one calling convention on x64. This calling convention takes advantage of the increased number of registers available on x64:

The calling convention for C++ is very similar: the this pointer is passed as an implicit first parameter. The next three parameters are passed in remaining registers, while the rest are passed on the stack.

Windows on x64 implements an ABI of its own, which is somewhat different from the AMD64 ABI. I will only discuss the Windows x64 ABI briefly, mentioning how its stack frame layout differs from AMD64. These are the main differences:

Another important change that was made in the Windows x64 ABI is the cleanup of calling conventions. No more cdecl/stdcall/fastcall/thiscall/register/safecall madness - just a single "x64 calling convention". Cheers to that!

https://stackoverflow.com/questions/4429398/why-does-windows64-use-a-different-calling-convention-from-all-other-oses-on-x86

（过时）函数调用协议：cdecl，stdcall和fastcall

下面的内容，是我在学习上面的内容之前的笔记，暂时保留...

所谓函数调用协议，就是指在函数调用以及函数返回的时刻，编译器如何利用CPU中的寄存器，以及虚拟内存中的函数调用栈，来实现函数参数的传递和函数调用的返回。

cdecl是标准的C/C++编译器函数调用协议，stdcall是WinAPI函数的常见调用方式，fastcall主要见于Windows内核。这部分的知识，主要是编译器需要的，普通开发人员一般了解一下即可。

上世纪在70年代，美国人丹尼斯·里奇发明了C语言，并且使用C语言编写UNIX，由此他就成为了C语言之父和UNIX操作系统之父。UNIX操作系统非常高效，修改起来也很方便，这得益于使用了C语言来编写。随着UNIX操作系统的推广，C语言也变成了一个非常流行的语言。要让UNIX变得高效率，那么C语言的设计上，就要着眼于高效的设计。

C语言在函数调用时，需要传送多个参数。这些参数的传送可以通过寄存器或者栈来传送。那你也许会问，为什么不全都使用寄存器这一种方式呢？由于函数调用的参数比较多，比如达到7,8个。并且在发明C语言的那个时期，CPU的寄存器非常少，满足不了这个要求。不像目前ARM或MIPS的CPU，寄存器比较多，多达13个之多。这种场景下，全部使用寄存器来传送参数是基本可以解决问题了。（使用寄存器传递参数非常快，是因为访问寄存器的速度比访问内存要快许多倍）

在当时的环境下，C语言的编译器都是用栈的方式来传递函数调用的参数，这样不但可以解决寄存器少的问题，也可以解决另外一个问题，就是可以不限制或者动态地改变传递参数的个数。

使用stack传递参数，完美解决参数数量的问题，另外一个问题，是参数的入栈的顺序问题。

这个好比像学校里体育老师叫一个班的学生来排队，是从高到矮，还是从矮到高的选择。在入栈这个问题上，C语言也面临两个选择，一个跟代码的书写的顺序一样从左到右，另一个是从右到左。在考虑到动态参数个数的问题之后，C语言的设计者采用了从右到左的入栈方式，这种方式有两个优点：一是函数运行时，默认方式是从左到右，意味着出栈的方向应优先为栈顶的元素，这样可以提高运行效率；二是函数参数不定时，运行时分析字符串里出现需要的参数，每出现一个参数就弹出栈一次，跟运行分析的顺序一致。

以上介绍的是cdecl标准函数调用协议，完全使用stack，效率相对是较低的。现代gcc编译器，默认编译出来的x64汇编，貌似并没有完全使用stack来传递参数（后面有个测试代码展示）。

压栈，call函数，执行到最后，ret返回。x86的call指令，实际上就是将返回地址压栈，ret就是弹出返回地址。由于函数参数先压栈，返回地址后压栈，因此ret指令弹出返回地址后（实际上是修改了不可操作的eip寄存器的值），函数的调用者还需要一条指令，来将调用栈恢复到调用之前的状态，即栈顶指针esp要做个加法。这就是所谓的清栈（调整栈平衡）。

后来x86增加了ret n指令，不仅弹出返回地址，同时给esp做加法。

闲得无聊，用gcc做了个测试，发现在没有任何特别申明的情况下，gcc并不是使用压栈的方式在传递参数，请看下面这个有3个参数函数调用的示例：

$ cat test.c
#include <stdio.h>

int add(int a, int b, int c) {
    return a+b+c;
}

int main(void) {
    int a = 1;
    int b = 2;
    int c = 3;
    int d = add(a,b,c);
    printf("%d\n", d);
    return 0;
}

0000000000401126 <add>:
  401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       89 7d fc                mov    %edi,-0x4(%rbp)
  40112d:       89 75 f8                mov    %esi,-0x8(%rbp)
  401130:       89 55 f4                mov    %edx,-0xc(%rbp)
  401133:       8b 55 fc                mov    -0x4(%rbp),%edx
  401136:       8b 45 f8                mov    -0x8(%rbp),%eax
  401139:       01 c2                   add    %eax,%edx
  40113b:       8b 45 f4                mov    -0xc(%rbp),%eax
  40113e:       01 d0                   add    %edx,%eax
  401140:       5d                      pop    %rbp
  401141:       c3                      ret

0000000000401142 <main>:
  401142:       55                      push   %rbp
  401143:       48 89 e5                mov    %rsp,%rbp
  401146:       48 83 ec 10             sub    $0x10,%rsp
  40114a:       c7 45 fc 01 00 00 00    movl   $0x1,-0x4(%rbp)
  401151:       c7 45 f8 02 00 00 00    movl   $0x2,-0x8(%rbp)
  401158:       c7 45 f4 03 00 00 00    movl   $0x3,-0xc(%rbp)
  40115f:       8b 55 f4                mov    -0xc(%rbp),%edx
  401162:       8b 4d f8                mov    -0x8(%rbp),%ecx
  401165:       8b 45 fc                mov    -0x4(%rbp),%eax
  401168:       89 ce                   mov    %ecx,%esi
  40116a:       89 c7                   mov    %eax,%edi
  40116c:       e8 b5 ff ff ff          call   401126 <add>
  401171:       89 45 f0                mov    %eax,-0x10(%rbp)
  401174:       8b 45 f0                mov    -0x10(%rbp),%eax
  401177:       89 c6                   mov    %eax,%esi
  401179:       bf 10 20 40 00          mov    $0x402010,%edi
  40117e:       b8 00 00 00 00          mov    $0x0,%eax
  401183:       e8 a8 fe ff ff          call   401030 <printf@plt>
  401188:       b8 00 00 00 00          mov    $0x0,%eax
  40118d:       c9                      leave
  40118e:       c3                      ret

可以明显看出，这段汇编，使用了edi，esi和edx这3个寄存器传递参数。（想轻松学习汇编的同学，移步x86汇编基础）

我仔细想了想这个问题，我的答案是：对于不需要暴露出来给第三方使用的函数接口，编译器不管怎么编译，都是代码内部的事情，编译器当然有选择自由。但是，对于暴露出来的接口，比如动态链接库中的接口，势必要约定一种函数调用方式，这就是所谓的接口调用协议。

理解Calling Conventions和Stack Frame

x64 Linux Calling Conventions

Linux Stack Frame Layout on x64

Stack的16字节对齐

Windows在x64下的ABI

（过时）函数调用协议：cdecl，stdcall和fastcall