[转]Linux 内核使用浮点问题

一、硬浮点与软浮点

1. 硬浮点

编译器将代码直接编译成硬件浮点协处理器(浮点运算单元FPU)能识别的指令，这些指令在执行的时候ARM核直接把它转给协处理器执行。FPU 通常有一套额外的寄存器来完成浮点参数传递和运算。使用实际的硬件浮点运算单元(FPU)会带来性能的提升

2. 软浮点

编译器把浮点运算转成浮点运算的函数调用和库函数调用(即用整数运算模拟浮点运算)，没有FPU的指令调用，也没有浮点寄存器的参数传递。浮点参数的传递也是通过ARM寄存器或者堆栈完成。现在的Linux系统默认编译选择使用hard-float,如果系统没有任何浮点处理器单元，这就会产生非法指令和异常。因而一般的系统镜像都采用软浮点以兼容没有VFP的处理器。

3. ARM软浮点与硬浮点编译

出于低功耗、封装限制等种种原因，以前的一些ARM处理器没有独立的硬件浮点运算单元，需要用软件来实现浮点运算。随着技术发展，现在高端的ARM处理器基本都具备了硬件执行浮点操作的能力。新旧两种架构之间的差异，就产生了两个不同的接口 – 软浮点与硬浮点。

编译选项：

-mfpu =name（neon or vfp）指定FPU 单元

-mfloat-abi= name（soft、hard、 softfp）：指定软件浮点或硬件浮点或兼容软浮点调用接口

-mfpu可以指定使用FPU单元类型，以ARM A76为例有两种：VFP和Neon

VFP与Neon

VFP (vector floating-point)

VFP是一个按顺序工作的浮点协处理器, 它对一组输入执行一个操作并返回一个输出。目的是加快浮点计算,支持单精度和双精度浮点。
NEON

Neon是SIMD (Single Instruction Multiple Data) ，支持单指令多数据操作，意味着在执行一条指令期间，将对多达16个数据集并行执行相同的操作，支持整数和单精度浮点数向量化(并行)操作。
二者区别

VFP 支持单精度和双精度浮点，顺序执行，目的加快浮点计算

Neon 是SIMD，支持单指令多数据操作，支持整数和单精度浮点数向量化(并行)操作，不支持双浮点。Neon最大的好处是如果想要执行矢量操作，如对视频的编码/解码，它可以并行执行单精度浮点（float）操作。从而提高对视频编码/解码性能

VFP 和Neon是共用浮点寄存器，只是执行的指令不同。

soft、hard、 softfp

在有fpu的情况下，可以通过-mfloat-abi来指定使用哪种，有如下三种值：

soft：不用fpu计算，即使有fpu浮点运算单元也不用。
armel：(arm eabi little endian) 也即softfp，用fpu计算（即会将浮点运算编译成对应的浮点指令），但是传参数用普通寄存器传，这样中断的时候，只需要保存普通寄存器，中断负荷小，但是参数需要转换成浮点的再计算。
armhf：(arm hard float)也即hard，用fpu计算，传参数用fpu中的浮点寄存器传，省去了转换性能最好，但是中断负荷高。

而arm64，64位的arm默认就是hard float的，因此不需要hf的后缀。kernel、rootfs和app编译的时候，指定的必须保持一致才行。

使用softfp模式，会存在不必要的浮点到整数、整数到浮点的转换。而使用hard模式，在每次浮点相关函数调用时，平均能节省20个CPU周期。

4. ARM启用硬浮点运算

需要使用硬浮点，需要内核开启对硬浮点支持，同时编译时要使用上面的softfp或hard，才可以使用FPU/Neon进行计算。

二、在内核代码中使用浮点问题

1. ARM64的用户空间进程切换 ---- 浮点寄存器切换

ARM64，默认使用的是硬浮点，在进程切换时会涉及对浮点寄存器的操作。

对于ARM64而言，进程切换的context包括：

（1）通用寄存器

（2）浮点寄存器

（3）地址空间寄存器（ttbr0_el1和ttbr1_el1）

（4）其他寄存器（ASID、thread process ID register等）

Path:arch/arm64/kernel/process.c
/*
 * Thread switching.
 */
__notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
                struct task_struct *next)
{
    struct task_struct *last;

    fpsimd_thread_switch(next); 
    /*fp是float-point的意思，和浮点运算相关。fpsimd_thread_switch
    其实就是把当前FPSIMD的状态保存到了内存中（task.thread.fpsimd_state），
    从要切入的next进程描述符中获取FPSIMD状态，并加载到CPU上。*/
    tls_thread_switch(next);
    hw_breakpoint_thread_switch(next);
    contextidr_thread_switch(next);
    ........
    return last;
}
---------------------------------------
Path:arch/arm64/kernel/fpsimd.c
void fpsimd_thread_switch(struct task_struct *next)
{
    ......
    /* Save unsaved fpsimd state, if any: */
    fpsimd_save(); //这里将struct fpsimd_state保存到通用寄存器X0中
    ......
}
-----------------------------------------
Path:arch/arm64/kernel/entry-fpsimd.S
/* Save the FP registers. x0 - pointer to struct fpsimd_state*/
ENTRY(fpsimd_save_state)
    fpsimd_save x0, 8
    ret
ENDPROC(fpsimd_save_state)
1234567891011121314151617181920212223242526272829303132333435

2. 内核使用浮点

关于内核使用浮点， Robert Love’s 《Linux Kernel Development》(Linux内核设计与实现)书中提到

No (Easy) Use of Floating Point

When a user-space process uses floating-point instructions, the kernel manages the transition from integer to floating point mode. What the kernel has to do when using floating-point instructions varies by architecture, but the kernel normally catches a trap and then initiates the transition from integer to floating point mode.

Unlike user-space, the kernel does not have the luxury of seamless support for floating point because it cannot easily trap itself. Using a floating point inside the kernel requires manually saving and restoring the floating point registers, among other possible chores. The short answer is: Don’t do it! Except in the rare cases, no floating-point operations are in the kernel.

后面这句话提到如果内核使用浮点则需要保存恢复浮点寄存器等其他杂项，这样会导致内核性能下降，所以一般不建议使用浮点，除非特殊情况。

但是Robert Love’s在这里并没有讨论如何在内核中正确使用浮点，以及未正确使用浮点可能会出现什么问题。当你需要在内核中使用浮点，如果按照用户空间的写法可能会出现一些意想不到的情况。如：程序Crash，内存越界、访问非法内存地址，浮点计算出错等稀奇古怪的问题。出现这种问题的原因是：内核由于性能原因，在内核运行的代码，内核在进行上下文切换时，不会主动保存和恢复浮点寄存器。这样可能会导致内核在进行浮点运算时，可能会破坏此时用户空间的浮点寄存器状态，导致用户空间的fpsimd_state状态异常，随后程序的行为将变的不可控。

如何在内核使用浮点

在内核中使用浮点，在不同架构下会有不同操作流程。这部分需要查阅内核文档，如在X86上要用 kernel_fpu_begin()/kernel_fpu_end()，在arm上用 kernel_neon_begin()/ kernel_neon_end()。但并非在使用浮点时简单的使用上面代码就可以，以arm为例，内核文档里介绍了为何要使用该函数，以及如何使用

https://www.kernel.org/doc/Documentation/arm/kernel_mode_neon.rst
================
Kernel mode NEON
================

TL;DR summary
-------------
* Use only NEON instructions, or VFP instructions that don't rely on support
  code
* Isolate your NEON code in a separate compilation unit, and compile it with
  '-march=armv7-a -mfpu=neon -mfloat-abi=softfp'
* Put kernel_neon_begin() and kernel_neon_end() calls around the calls into your
  NEON code
* Don't sleep in your NEON code, and be aware that it will be executed with
  preemption disabled

Introduction
------------
It is possible to use NEON instructions (and in some cases, VFP instructions) in
code that runs in kernel mode. However, for performance reasons, the NEON/VFP
register file is not preserved and restored at every context switch or taken
exception like the normal register file is, so some manual intervention is
required. Furthermore, special care is required for code that may sleep [i.e.,
may call schedule()], as NEON or VFP instructions will be executed in a
non-preemptible section for reasons outlined below. 
//这里就说明了在内核运行的代码，内核在进行上下文切换时，不会主动保存和恢复浮点寄存器。

Interruptions in kernel mode
----------------------------
For reasons of performance and simplicity, it was decided that there shall be no
preserve/restore mechanism for the kernel mode NEON/VFP register contents. This
implies that interruptions of a kernel mode NEON section can only be allowed if
they are guaranteed not to touch the NEON/VFP registers. For this reason, the
following rules and restrictions apply in the kernel:
* NEON/VFP code is not allowed in interrupt context;
* NEON/VFP code is not allowed to sleep;
* NEON/VFP code is executed with preemption disabled.
1234567891011121314151617181920212223242526272829303132333435363738

这里提到在使用 kernel_neon_begin()/ kernel_neon_end()规则：

NEON/VFP code 不允许在中断中使用
NEON/VFP code 不允许睡眠
NEON/VFP code 执行时是禁止抢占的

ARM架构内核使用浮点示例

下面分别为三段计算代码，其执行的结果是相同的。
ARM64 寄存器
x0 - x30 是31个通用整形寄存器 , w0 ~ w30来访问这31个64位寄存器的低32位，写入时会将高32位清零
V0 - V31 浮点寄存器(向量寄存器)，用于处理SIMD和浮点运算。长度不同称谓也不同，b，h，s，d，q，分别代表byte(8位)，half(16位)，single(32位)，double(64位)，quad(128位)。
------------------
int kernl_float_test_1(int test,int ratio){
    int result;
    result = test*2.6/ratio - test/20;
    return result;
}
ARM汇编：
    4084:       1e620280        scvtf   d0, w20 //将寄存器w20中的定点数转化为浮点数，存储在浮点寄存器d0中
    4088:       fd400101        ldr     d1, [x8] //浮点乘. dx为浮点寄存器，长度64(double)
    408c:       529999a8        mov     w8, #0xcccd                     // #52429
    4090:       72b99988        movk    w8, #0xcccc, lsl #16
    4094:       9ba87e88        umull   x8, w20, w8
    4098:       1e610800        fmul    d0, d0, d1 
    409c:       1e620301        scvtf   d1, w24
    40a0:       d364fd08        lsr     x8, x8, #36
    40a4:       1e611800        fdiv    d0, d0, d1  //浮点除
    40a8:       1e620101        scvtf   d1, w8
    40ac:       1e613800        fsub    d0, d0, d1
    40b0:       1e780014        fcvtzs  w20, d0    //将浮点数转化为定点数

-------------------------------------------------------------------------
#include <asm/neon.h>
int kernl_float_test_2(int test,int ratio){
    int result;
    if (!may_use_simd()) {
     //判断是否可以使用浮点，如果为判断在执行kernel_neon_begin可能会死机
        return;
    }
    kernel_neon_begin();
    result = test*2.6/ratio - test/20;
    kernel_neon_end();
    return result;
}
--------------------------------------
Path:arch/arm64/kernel/fpsimd.c
/*
 * kernel_neon_begin(): obtain the CPU FPSIMD registers for use by the calling
 * context
 * Must not be called unless may_use_simd() returns true.
 * Task context in the FPSIMD registers is saved back to memory as necessary.
 * A matching call to kernel_neon_end() must be made before returning from the
 * calling context.
 * The caller may freely use the FPSIMD registers until kernel_neon_end() is
 * called.
 */
void kernel_neon_begin(void)
{
    if (WARN_ON(!system_supports_fpsimd()))
        return;
    BUG_ON(!may_use_simd());
    get_cpu_fpsimd_context();
    /* Save unsaved fpsimd state, if any: */
    fpsimd_save();
    /* Invalidate any task state remaining in the fpsimd regs: */
    fpsimd_flush_cpu_state();
}
EXPORT_SYMBOL(kernel_neon_begin);
----------------------------------------
ARM汇编：
    4088:       94000000        bl      0 <kernel_neon_begin>
    408c:       90000008        adrp    x8, 0 <>
    4090:       1e620280        scvtf   d0, w20
    4094:       fd400101        ldr     d1, [x8]
    4098:       529999a8        mov     w8, #0xcccd                     // #52429
    409c:       72b99988        movk    w8, #0xcccc, lsl #16
    40a0:       9ba87e88        umull   x8, w20, w8
    40a4:       1e610800        fmul    d0, d0, d1
    40a8:       1e620301        scvtf   d1, w24
    40ac:       d364fd08        lsr     x8, x8, #36
    40b0:       1e611800        fdiv    d0, d0, d1
    40b4:       1e620101        scvtf   d1, w8
    40b8:       1e613800        fsub    d0, d0, d1
    40bc:       1e780014        fcvtzs  w20, d0
    40c0:       94000000        bl      0 <kernel_neon_end>

0000000000001580 <kernel_neon_begin>:
    1580:       f81d0ff5        str     x21, [sp,#-48]!
    ............................
    15d4:       37a00628        tbnz    w8, #20, 1698 <kernel_neon_begin+0x118>
    15d8:       b9404a68        ldr     w8, [x19,#72]
    15dc:       90000014        adrp    x20, 8 <fpsimd_save+0x8>

0000000000000000 <fpsimd_save>:
       0:       a9be4ff4        stp     x20, x19, [sp,#-32]!
       4:       a9017bfd        stp     x29, x30, [sp,#16]
      ..............
      68:       37b80088        tbnz    w8, #23, 78 <fpsimd_save+0x78>
      6c:       aa1303e0        mov     x0, x19
      70:       94000000        bl      0 <fpsimd_save_state>

0000000000000000 <fpsimd_save_state>: //这里就保存了浮点寄存器状态到X0
   0:   ad000400        stp     q0, q1, [x0]
   4:   ad010c02        stp     q2, q3, [x0,#32]
   8:   ad021404        stp     q4, q5, [x0,#64]
   c:   ad031c06        stp     q6, q7, [x0,#96]
  10:   ad042408        stp     q8, q9, [x0,#128]
  14:   ad052c0a        stp     q10, q11, [x0,#160]
  18:   ad06340c        stp     q12, q13, [x0,#192]
  1c:   ad073c0e        stp     q14, q15, [x0,#224]
  20:   ad084410        stp     q16, q17, [x0,#256]
  24:   ad094c12        stp     q18, q19, [x0,#288]
  28:   ad0a5414        stp     q20, q21, [x0,#320]
  2c:   ad0b5c16        stp     q22, q23, [x0,#352]
  30:   ad0c6418        stp     q24, q25, [x0,#384]
  34:   ad0d6c1a        stp     q26, q27, [x0,#416]
  38:   ad0e741c        stp     q28, q29, [x0,#448]
  3c:   ad8f7c1e        stp     q30, q31, [x0,#480]!
  40:   d53b4428        mrs     x8, fpsr
  44:   b9002008        str     w8, [x0,#32]
  48:   d53b4408        mrs     x8, fpcr
  4c:   b9002408        str     w8, [x0,#36]
  50:   d65f03c0        ret

---------------------------------------------------------------------
int kernl_float_test_3(int test,int ratio){
    int result;
    result = (test*26)/(ratio*10) - test/20;
    return result;
}
ARM汇编：
    4084:       529999a9        mov     w9, #0xcccd                     // #52429
    4088:       72b99989        movk    w9, #0xcccc, lsl #16
    408c:       1b087e88        mul     w8, w20, w8
    4090:       0b180b0a        add     w10, w24, w24, lsl #2
    4094:       9ba97e89        umull   x9, w20, w9
    4098:       531f794a        lsl     w10, w10, #1
    409c:       1aca0908        udiv    w8, w8, w10
    40a0:       d364fd29        lsr     x9, x9, #36
    40a4:       4b090114        sub     w20, w8, w9
------------------
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134

上面三段计算代码，其结果是相同的，但是在用户空间和内核空间的执行产生效果可能不同，执行效率不同：

kernl_float_test_1 在用户空间和内核空间都可以编译通过，在用户空间使用没有任何问题，但是在内核虽然可以编译通过，但使用时可能会破坏用户空间的浮点寄存器的状态
kernl_float_test_2 内核使用浮点，必须按照这种方式使用，浮点计算安全，但是效率不高，可以看到汇编代码需要执行对浮点寄存器进行保存，相比较于保存恢复浮点寄存器状态开销，使用浮点运算有点不划算。
kernl_float_test_3 内核和用户空间使用都没有问题，把浮点转化为整型计算，执行效率高于前两种

在什么情况下适合启用内核Neon/VFP

由前面可以了解到Neon是SIMD (Single Instruction Multiple Data) ，支持单指令多数据操作。也就是说不仅仅是浮点运算可以使用Neon，当你使用大量连续计算时，可以使用Neon对算法加速。其实内核中的加密算法基本都是了Neon加速，如sha1、sha2…sha512、aes、chacha20等。加密源码在 arch/arm64/crypto/