YUV420 转为 RGB 的性能

某位朋友说，他做视频处理，发现 YUV420 转为 RGB 成为了性能瓶颈。
我的第一反应是不信的。
然后他为了抠性能，把 CPU 算法改为 GPU（采用 OpenGL shader），但又因为他对 OpenGL 不熟悉，结果改了之后更慢了。情况逐渐尴尬。

于是我准备动手测试。
我确认了他的数据规模：16 路视频，帧率 15，所有视频加起来分辨率 1920x1080。因此平均每路视频大小为 480x270。用的 CPU 是 Intel i5 8500，6 核 6 线程。
然后我的测试结果如下（我的 CPU 是 Intel i7 8700，6 核 12 线程）。

单张 1920x1080 图片，转换 1000 次的耗时（单位：秒）

线程数量	简单版本	整数版本	SSE2 版本
1	10.6	6.09	2.27
2	5.54	3.43	1.41
3	4.47	2.63	1.14
4	3.75	2.04	0.9
6	2.71	1.7	0.9
8	2.18	1.41	0.91
12	1.82	1.25	1.03

单张 480x270 图片，转换 1000 次的耗时（单位：秒）

线程数量	简单版本	整数版本	SSE2 版本
1	0.65	0.38	0.14
2	0.33	0.23	0.09
3	0.26	0.14	0.09
4	0.21	0.12	0.09
6	0.18	0.12	0.08
8	0.14	0.09	0.09
12	0.12	0.08	0.08

观察发现，单张 1920x1080 或者 16 张 480x270，耗时接近，没有倍数差异。

这位朋友是用 ffmpeg，我有点懒得去做测试。从网上找了别人的测试结果：
https://stackoom.com/question/3ejar/sws-scale-vs-libyuv%E6%80%A7%E8%83%BD-%E9%80%9F%E5%BA%A6
他用 NV12 实际上就是 YUV420P 的变种，把 U 和 V 放到一起，提高了缓存友好度，性能有一定优势。
而我是直接用的 YUV420P。
根据他的耗时结果，我所用的图片像素数是他的 (1920*1080) / (1280*692) = 2.34 倍。
假定他的电脑配置跟我差不多，并且他使用单线程。则需要把他的耗时乘以 2.34。
得到：

单张 1920x1080 图片，转换 1000 次的耗时（此为推算结果，不一定准确，单位：秒）

线程数量	ffmpeg	libyuv
1	`0.003 * 2.34 * 1000 = 7.02`	`0.045 * 2.34 * 1000 = 105.3`

平均来说，ffmpeg 单次耗时 7 毫秒，每秒 15 帧耗时 105 毫秒，这大致就是单核 10% 的 CPU 占用率。
开销不低。
而我的代码如果用 SSE2 会比 ffmpeg 快，大致是单核 3.5% 的 CPU 占用率。

于是总体结论如下：

单张 1920x1080 或者 16 张 480x270，耗时接近，没有倍数差异。
我的代码似乎比 ffmpeg 和 libyuv 更快。
三个版本的性能对比比较明显，尤其是线程数量较少时。
我对 SSE2 的经验也不多，经验丰富的话应该可以再优化加快。
在 SSE2 版本中，线程太多其实没有多少帮助了。可考虑每一路视频使用独立的一个线程。

测试过程如下。

简单起见，使用 OpenMP 来做多线程
原始版本：yuv420_to_rgb 函数
简单版本：yuv420_to_rgb_openmp 函数
整数版本：yuv420_to_rgb_openmp_int 函数
SSE2 版本：yuv420_to_rgb_openmp_sse 函数
代码流程主要在 init 函数，其内容为
1. 先捏造一段 RGBA 数据
2. 然后转为 YUV420P
3. 然后反复调用上述各个版本的函数 YUV420P 转为 RGBA，统计耗时
4. 为验证正确性，又把得到的 RGBA 显示到屏幕（通过 OpenGL）

测试用的完整代码：

#include <GL/freeglut.h>
#define GL_CLAMP_TO_EDGE 0x812F

#ifndef _OPENMP
# error "OpenMP is required"
#endif

#include <omp.h>
#include <emmintrin.h>

#include <ctime>
#include <cstdio>
#include <cstdlib>
#include <cstdint>


uint8_t clamp_float_to_byte(float val)
{
    val += 0.5f;
    if (val >= 255.0f) return 255;
    if (val <= 0.0f) return 0;
    return (uint8_t)val;
}


uint8_t clamp_int_to_byte(int val)
{
    if (val >= 255) return 255;
    if (val <= 0) return 0;
    return (uint8_t)val;
}


void rgb_to_yuv420(int width, int height,
    const uint8_t *rgba_data, int rgba_stride,
    uint8_t *y_data, int y_stride,
    uint8_t *u_data, int u_stride,
    uint8_t *v_data, int v_stride)
{
    for (int pos_y = 0; pos_y < height; ++pos_y)
    {
        const uint8_t *rgba_line = rgba_data + pos_y * rgba_stride;
        uint8_t *y_line = y_data + pos_y * y_stride;
        uint8_t *u_line = u_data + pos_y / 2 * u_stride;
        uint8_t *v_line = v_data + pos_y / 2 * v_stride;

        for (int pos_x = 0; pos_x < width; ++pos_x)
        {
            float r = rgba_line[pos_x * 4];
            float g = rgba_line[pos_x * 4 + 1];
            float b = rgba_line[pos_x * 4 + 2];
            float a = rgba_line[pos_x * 4 + 3];

            float y = 0.299f * r + 0.587f * g + 0.114f * b;
            float u = -0.169f * r + -0.331f * g + 0.5f * b + 128.0f;
            float v = 0.5f * r + -0.419f * g + -0.081f * b + 128.0f;

            y_line[pos_x] = clamp_float_to_byte(y);
            u_line[pos_x / 2] = clamp_float_to_byte(u); // 其实应该是相邻四个像素的值平均一下。但这里懒得搞了。反正测试代码
            v_line[pos_x / 2] = clamp_float_to_byte(v);
        }
    }
}


// 第一版：最简单的实现。直接套公式
void yuv420_to_rgb(int width, int height,
    const uint8_t *y_data, int y_stride,
    const uint8_t *u_data, int u_stride,
    const uint8_t *v_data, int v_stride,
    uint8_t *rgba_data, int rgba_stride)
{
    for (int pos_y = 0; pos_y < height; ++pos_y)
    {
        const uint8_t *y_line = y_data + pos_y * y_stride;
        const uint8_t *u_line = u_data + pos_y / 2 * u_stride;
        const uint8_t *v_line = v_data + pos_y / 2 * v_stride;
        uint8_t *rgba_line = rgba_data + pos_y * rgba_stride;

        for (int pos_x = 0; pos_x < width; ++pos_x)
        {
            float y = y_line[pos_x];
            float u = u_line[pos_x / 2] - 128.0f;
            float v = v_line[pos_x / 2] - 128.0f;

            float r = y + 1.402f * v;
            float g = y + -0.344f * u - 0.714f * v;
            float b = y + 1.77f * u;

            rgba_line[pos_x * 4] = clamp_float_to_byte(r);
            rgba_line[pos_x * 4 + 1] = clamp_float_to_byte(g);
            rgba_line[pos_x * 4 + 2] = clamp_float_to_byte(b);
            rgba_line[pos_x * 4 + 3] = 255;
        }
    }
}


// 第二版：多线程（利用 OpenMP 实现）每个线程处理连续的几行像素
void yuv420_to_rgb_openmp(int width, int height,
    const uint8_t *y_data, int y_stride,
    const uint8_t *u_data, int u_stride,
    const uint8_t *v_data, int v_stride,
    uint8_t *rgba_data, int rgba_stride)
{
    // 一共 thread_count 个线程。每个线程计算 height_of_thread 行像素
    const int thread_count = omp_get_max_threads();
    const int height_of_thread = (height + thread_count - 1) / thread_count;

#pragma omp parallel for
    for (int i_thread = 0; i_thread < thread_count; ++i_thread)
    {
        int pos_y_start = i_thread * height_of_thread;
        int pos_y_end = (i_thread + 1) * height_of_thread;
        if (pos_y_end > height)
        {
            pos_y_end = height;
        }

        for (int pos_y = pos_y_start; pos_y < pos_y_end; ++pos_y)
        {
            const uint8_t *y_line = y_data + pos_y * y_stride;
            const uint8_t *u_line = u_data + pos_y / 2 * u_stride;
            const uint8_t *v_line = v_data + pos_y / 2 * v_stride;
            uint8_t *rgba_line = rgba_data + pos_y * rgba_stride;

            for (int pos_x = 0; pos_x < width; ++pos_x)
            {
                float y = y_line[pos_x];
                float u = u_line[pos_x / 2] - 128.0f;
                float v = v_line[pos_x / 2] - 128.0f;

                float r = y + 1.402f * v;
                float g = y + -0.344f * u - 0.714f * v;
                float b = y + 1.77f * u;

                rgba_line[pos_x * 4] = clamp_float_to_byte(r);
                rgba_line[pos_x * 4 + 1] = clamp_float_to_byte(g);
                rgba_line[pos_x * 4 + 2] = clamp_float_to_byte(b);
                rgba_line[pos_x * 4 + 3] = 255;
            }
        }
    }
}


// 第三版：去掉浮点运算（会有轻微的精度损失）
// 当计算 v * 1.402，可以改为 (v * (1.402 * 8192)) / 8192
//   其中 (1.402 * 8192) 是常数，且取整之后精度损失并不多
//   而除以 8192 可以改为位移运算
void yuv420_to_rgb_openmp_int(int width, int height,
    const uint8_t *y_data, int y_stride,
    const uint8_t *u_data, int u_stride,
    const uint8_t *v_data, int v_stride,
    uint8_t *rgba_data, int rgba_stride)
{
    // 一共 thread_count 个线程。每个线程计算 height_of_thread 行像素
    const int thread_count = omp_get_max_threads();
    const int height_of_thread = (height + thread_count - 1) / thread_count;

    const int bits = 13;
    const int c_1402 = (int)(1.402f * (1 << bits));
    const int c_0344 = (int)(0.344f * (1 << bits));
    const int c_0714 = (int)(0.714f * (1 << bits));
    const int c_1770 = (int)(1.770f * (1 << bits));

#pragma omp parallel for
    for (int i_thread = 0; i_thread < thread_count; ++i_thread)
    {
        int pos_y_start = i_thread * height_of_thread;
        int pos_y_end = (i_thread + 1) * height_of_thread;
        if (pos_y_end > height)
        {
            pos_y_end = height;
        }

        for (int pos_y = pos_y_start; pos_y < pos_y_end; ++pos_y)
        {
            const uint8_t *y_line = y_data + pos_y * y_stride;
            const uint8_t *u_line = u_data + pos_y / 2 * u_stride;
            const uint8_t *v_line = v_data + pos_y / 2 * v_stride;
            uint8_t *rgba_line = rgba_data + pos_y * rgba_stride;

            for (int pos_x = 0; pos_x < width; ++pos_x)
            {
                int y = y_line[pos_x];
                int u = u_line[pos_x / 2] - 128;
                int v = v_line[pos_x / 2] - 128;

                int r = y + ((c_1402 * v) >> bits);
                int g = y - ((c_0344 * u + c_0714 * v) >> bits);
                int b = y + ((c_1770 * u) >> bits);

                rgba_line[pos_x * 4] = clamp_int_to_byte(r);
                rgba_line[pos_x * 4 + 1] = clamp_int_to_byte(g);
                rgba_line[pos_x * 4 + 2] = clamp_int_to_byte(b);
                rgba_line[pos_x * 4 + 3] = 255;
            }
        }
    }
}


// 使用 SSE 的 YUV 转为 RGB。一次可处理四条数据
// 其中 yi, ui, vi 都是四个 epi32
// 结果有 16 字节会写入 result，需要确保 result 是 16 字节对齐
inline void sse_yuv_to_rgb(__m128i &y, __m128i &u, __m128i &v, void *result)
{
    __m128 yf = _mm_cvtepi32_ps(y);
    __m128 uf = _mm_cvtepi32_ps(u);
    uf = _mm_add_ps(uf, _mm_set_ps1(-128.0f));
    __m128 vf = _mm_cvtepi32_ps(v);
    vf = _mm_add_ps(vf, _mm_set_ps1(-128.0f));

    __m128 r, g, b;

    r = _mm_add_ps(yf, _mm_mul_ps(_mm_set_ps1(1.402f), vf));
    g = _mm_add_ps(_mm_add_ps(yf, _mm_mul_ps(_mm_set_ps1(-0.344f), uf)), _mm_mul_ps(_mm_set_ps1(-0.714f), vf));
    b = _mm_add_ps(yf, _mm_mul_ps(_mm_set_ps1(1.77f), uf));
    r = _mm_max_ps(_mm_min_ps(_mm_add_ps(r, _mm_set_ps1(0.5f)), _mm_set_ps1(255.0f)), _mm_set_ps1(0.0f));
    g = _mm_max_ps(_mm_min_ps(_mm_add_ps(g, _mm_set_ps1(0.5f)), _mm_set_ps1(255.0f)), _mm_set_ps1(0.0f));
    b = _mm_max_ps(_mm_min_ps(_mm_add_ps(b, _mm_set_ps1(0.5f)), _mm_set_ps1(255.0f)), _mm_set_ps1(0.0f));
    __m128i ri = _mm_cvtps_epi32(r);
    __m128i gi = _mm_cvtps_epi32(g);
    gi = _mm_slli_epi32(gi, 8);
    __m128i bi = _mm_cvtps_epi32(b);
    bi = _mm_slli_epi32(bi, 16);

    __m128i rgba_i = _mm_or_si128(_mm_or_si128(_mm_or_si128(ri, gi), bi), _mm_set1_epi32(0xFF000000));
    _mm_store_si128((__m128i*)result, rgba_i);
}


// 第四版，用 SSE
// 为了兼容最旧的硬件，选择了 SSE 2.0，于是没有 _mm_mullo_epi32 只有 _mm_mullo_epi16
void yuv420_to_rgb_openmp_sse(int width, int height,
    const uint8_t *y_data, int y_stride,
    const uint8_t *u_data, int u_stride,
    const uint8_t *v_data, int v_stride,
    uint8_t *rgba_data, int rgba_stride)
{
    // 一共 thread_count 个线程。每个线程计算 height_of_thread 行像素
    const int thread_count = omp_get_max_threads();
    const int height_of_thread = (height + thread_count - 1) / thread_count;

    const int bits = 16;
    const int c_1402 = (int)(1.402f * (1 << 16));
    const int c_0344 = (int)(0.344f * (1 << 16));
    const int c_0714 = (int)(0.714f * (1 << 16));
    const int c_1770 = (int)(1.770f * (1 << 16));

    memset(rgba_data, 0, rgba_stride * height);

#pragma omp parallel for
    for (int i_thread = 0; i_thread < thread_count; ++i_thread)
    {
        int pos_y_start = i_thread * height_of_thread;
        int pos_y_end = (i_thread + 1) * height_of_thread;
        if (pos_y_end > height)
        {
            pos_y_end = height;
        }

        const __m128i zeros = _mm_set1_epi32(0);

        for (int pos_y = pos_y_start; pos_y < pos_y_end; ++pos_y)
        {
            const uint8_t *y_line = y_data + pos_y * y_stride;
            const uint8_t *u_line = u_data + pos_y / 2 * u_stride;
            const uint8_t *v_line = v_data + pos_y / 2 * v_stride;
            uint8_t *rgba_line = rgba_data + pos_y * rgba_stride;

            // 一次循环处理 32 个像素
            int pos_x = 0;
            for (pos_x = 0; pos_x + 32 <= width; pos_x += 32)
            {
                __m128i u_0_15 = _mm_load_si128((const __m128i*)(u_line + pos_x / 2));
                __m128i v_0_15 = _mm_load_si128((const __m128i*)(v_line + pos_x / 2));
                __m128i y_0_15 = _mm_load_si128((const __m128i*)(y_line + pos_x));

                __m128i y, u, v;

                // 第 0~3 像素
                y = _mm_unpacklo_epi16(_mm_unpacklo_epi8(y_0_15, zeros), zeros);
                u = _mm_unpacklo_epi16(_mm_unpacklo_epi8(_mm_unpacklo_epi8(u_0_15, u_0_15), zeros), zeros);
                v = _mm_unpacklo_epi16(_mm_unpacklo_epi8(_mm_unpacklo_epi8(v_0_15, v_0_15), zeros), zeros);
                sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x) * 4);

                // 第 4~7 像素
                y = _mm_unpackhi_epi16(_mm_unpacklo_epi8(y_0_15, zeros), zeros);
                u = _mm_unpackhi_epi16(_mm_unpacklo_epi8(_mm_unpacklo_epi8(u_0_15, u_0_15), zeros), zeros);
                v = _mm_unpackhi_epi16(_mm_unpacklo_epi8(_mm_unpacklo_epi8(v_0_15, v_0_15), zeros), zeros);
                sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 4) * 4);

                // 第 8~11 像素
                y = _mm_unpacklo_epi16(_mm_unpackhi_epi8(y_0_15, zeros), zeros);
                u = _mm_unpacklo_epi16(_mm_unpackhi_epi8(_mm_unpacklo_epi8(u_0_15, u_0_15), zeros), zeros);
                v = _mm_unpacklo_epi16(_mm_unpackhi_epi8(_mm_unpacklo_epi8(v_0_15, v_0_15), zeros), zeros);
                sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 8) * 4);

                // 第 12~15 像素
                y = _mm_unpackhi_epi16(_mm_unpackhi_epi8(y_0_15, zeros), zeros);
                u = _mm_unpackhi_epi16(_mm_unpackhi_epi8(_mm_unpacklo_epi8(u_0_15, u_0_15), zeros), zeros);
                v = _mm_unpackhi_epi16(_mm_unpackhi_epi8(_mm_unpacklo_epi8(v_0_15, v_0_15), zeros), zeros);
                sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 12) * 4);

                __m128i y_16_31 = _mm_loadu_si128((const __m128i*)(y_line + pos_x + 16));

                // 第 16~19 像素
                y = _mm_unpacklo_epi16(_mm_unpacklo_epi8(y_16_31, zeros), zeros);
                u = _mm_unpacklo_epi16(_mm_unpacklo_epi8(_mm_unpackhi_epi8(u_0_15, u_0_15), zeros), zeros);
                v = _mm_unpacklo_epi16(_mm_unpacklo_epi8(_mm_unpackhi_epi8(v_0_15, v_0_15), zeros), zeros);
                sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 16) * 4);

                // 第 20~23 像素
                y = _mm_unpackhi_epi16(_mm_unpacklo_epi8(y_16_31, zeros), zeros);
                u = _mm_unpackhi_epi16(_mm_unpacklo_epi8(_mm_unpackhi_epi8(u_0_15, u_0_15), zeros), zeros);
                v = _mm_unpackhi_epi16(_mm_unpacklo_epi8(_mm_unpackhi_epi8(v_0_15, v_0_15), zeros), zeros);
                sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 20) * 4);

                // 第 24~27 像素
                y = _mm_unpacklo_epi16(_mm_unpackhi_epi8(y_16_31, zeros), zeros);
                u = _mm_unpacklo_epi16(_mm_unpackhi_epi8(_mm_unpackhi_epi8(u_0_15, u_0_15), zeros), zeros);
                v = _mm_unpacklo_epi16(_mm_unpackhi_epi8(_mm_unpackhi_epi8(v_0_15, v_0_15), zeros), zeros);
                sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 24) * 4);

                // 第 28~31 像素
                y = _mm_unpackhi_epi16(_mm_unpackhi_epi8(y_16_31, zeros), zeros);
                u = _mm_unpackhi_epi16(_mm_unpackhi_epi8(_mm_unpackhi_epi8(u_0_15, u_0_15), zeros), zeros);
                v = _mm_unpackhi_epi16(_mm_unpackhi_epi8(_mm_unpackhi_epi8(v_0_15, v_0_15), zeros), zeros);
                sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 28) * 4);
            }

            // 剩下不是 32 的倍数，不用 SSE 了，改为常规方法处理
            for (; pos_x < width; ++pos_x)
            {
                int y = y_line[pos_x];
                int u = u_line[pos_x / 2] - 128;
                int v = v_line[pos_x / 2] - 128;

                int r = y + ((c_1402 * v) >> bits);
                int g = y - ((c_0344 * u + c_0714 * v) >> bits);
                int b = y + ((c_1770 * u) >> bits);

                rgba_line[pos_x * 4] = clamp_int_to_byte(r);
                rgba_line[pos_x * 4 + 1] = clamp_int_to_byte(g);
                rgba_line[pos_x * 4 + 2] = clamp_int_to_byte(b);
                rgba_line[pos_x * 4 + 3] = 255;
            }
        }
    }
}


void init()
{
#if _DEBUG
    omp_set_num_threads(1);
#endif

    const int thread_count = omp_get_max_threads();
    printf("OpenMP thread count: %d\n", thread_count);

    const int width = 1920;
    const int height = 1080;

    const int half_width = width / 2;
    const int half_height = height / 2;

#if _DEBUG
    {
        float r1 = 0.0f;
        float g1 = 255.0f;
        float b1 = 91.0f;

        float y1 = 0.299f * r1 + 0.587f * g1 + 0.114f * b1;
        float u1 = -0.169f * r1 + -0.331f * g1 + 0.5f * b1 + 128.0f;
        float v1 = 0.5f * r1 + -0.419f * g1 + -0.081f * b1 + 128.0f;

        float r1_ = y1 + 1.402f * (v1 - 128.0f);
        float g1_ = y1 + -0.344f * (u1 - 128.0f) + -0.714f * (v1 - 128.0f);
        float b1_ = y1 + 1.77f * (u1 - 128.0f);

        __m128i y = _mm_set1_epi32(clamp_float_to_byte(y1));
        __m128i u = _mm_set1_epi32(clamp_float_to_byte(u1));
        __m128i v = _mm_set1_epi32(clamp_float_to_byte(v1));

        uint8_t *pixel = (uint8_t*)_aligned_malloc(32, 16);
        sse_yuv_to_rgb(y, u, v, pixel);
        _aligned_free(pixel);
    }
#endif

    // SSE 要求 16 字节对齐
    const int alignment = 16;
    const int y_stride = (width + alignment - 1) / alignment * alignment;
    const int u_stride = (half_width + alignment - 1) / alignment * alignment;
    const int v_stride = u_stride;
    const int rgba_stride = (width * 4 + alignment - 1) / alignment * alignment;

    // 捏造一些 RGBA 数据作为测试
    uint8_t *input_rgba_data = (uint8_t*)_aligned_malloc(rgba_stride * height, alignment);
    for (int pos_y = 0; pos_y < height; ++pos_y)
    {
        uint8_t *rgba_line = input_rgba_data + pos_y * rgba_stride;
        for (int pos_x = 0; pos_x < width; ++pos_x)
        {
            rgba_line[pos_x * 4] = clamp_float_to_byte(((float)pos_x / width) * 255.0f);                            // R
            rgba_line[pos_x * 4 + 1] = clamp_float_to_byte(((float)pos_y / height) * 255.0f);                       // G
            rgba_line[pos_x * 4 + 2] = clamp_float_to_byte(((float)(pos_x + pos_y) / (width + height)) * 255.0f);   // B
            rgba_line[pos_x * 4 + 3] = 255;
        }
    }

    uint8_t *y_data = (uint8_t*)_aligned_malloc(y_stride * height, alignment);
    uint8_t *u_data = (uint8_t*)_aligned_malloc(u_stride * half_height, alignment);
    uint8_t *v_data = (uint8_t*)_aligned_malloc(v_stride * half_height, alignment);
    uint8_t *rgba_data = (uint8_t*)_aligned_malloc(rgba_stride * height, alignment);

    rgb_to_yuv420(width, height, input_rgba_data, rgba_stride, y_data, y_stride, u_data, u_stride, v_data, v_stride);

#ifdef _DEBUG
    const int TEST_COUNT = 1;
#else
    const int TEST_COUNT = 1000;
#endif

    clock_t t1, t2;

    t1 = clock();
    for (int test = 0; test < TEST_COUNT; ++test)
    {
        // if (test % 10 == 0) printf("yuv420_to_rgb_openmp %d\n", test);
        yuv420_to_rgb_openmp(width, height, y_data, y_stride, u_data, u_stride, v_data, v_stride, rgba_data, rgba_stride);
    }
    t2 = clock();
    printf("yuv420_to_rgb_openmp %d times cost: %.2f seconds\n", TEST_COUNT, (t2 - t1) / (double)CLOCKS_PER_SEC);

    t1 = clock();
    for (int test = 0; test < TEST_COUNT; ++test)
    {
        // if (test % 10 == 0) printf("yuv420_to_rgb_openmp_int %d\n", test);
        yuv420_to_rgb_openmp_int(width, height, y_data, y_stride, u_data, u_stride, v_data, v_stride, rgba_data, rgba_stride);
    }
    t2 = clock();
    printf("yuv420_to_rgb_openmp_int %d times cost: %.2f seconds\n", TEST_COUNT, (t2 - t1) / (double)CLOCKS_PER_SEC);

    t1 = clock();
    for (int test = 0; test < TEST_COUNT; ++test)
    {
        // if (test % 10 == 0) printf("yuv420_to_rgb_openmp_sse %d\n", test);
        yuv420_to_rgb_openmp_sse(width, height, y_data, y_stride, u_data, u_stride, v_data, v_stride, rgba_data, rgba_stride);
    }
    t2 = clock();
    printf("yuv420_to_rgb_openmp_sse %d times cost: %.2f seconds\n", TEST_COUNT, (t2 - t1) / (double)CLOCKS_PER_SEC);

    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, rgba_data);
    glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
    glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
    glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
    glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);

    _aligned_free(rgba_data);
    _aligned_free(input_rgba_data);
    _aligned_free(y_data);
    _aligned_free(u_data);
    _aligned_free(v_data);
}


void display() {
    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT);

    glEnable(GL_TEXTURE_2D);
    glColor3f(1.0f, 1.0f, 1.0f);
    glBegin(GL_QUADS);
        glTexCoord2f(0.0f, 0.0f);   glVertex2f(-1.0f, -1.0f);
        glTexCoord2f(1.0f, 0.0f);   glVertex2f( 1.0f, -1.0f);
        glTexCoord2f(1.0f, 1.0f);   glVertex2f( 1.0f,  1.0f);
        glTexCoord2f(0.0f, 1.0f);   glVertex2f(-1.0f,  1.0f);
    glEnd();

    glutSwapBuffers();
}

int main(int argc, char* argv[]) {
    glutInit(&argc, argv);
    glutInitDisplayMode(GLUT_DOUBLE | GLUT_RGBA | GLUT_DEPTH | GLUT_STENCIL);
    glutInitWindowSize(800, 600);
    glutInitWindowPosition(100, 100);
    glutCreateWindow("");

    init();
    glutDisplayFunc(display);

    glutMainLoop();

    return 0;
}

YUV420 转为 RGB 的性能

推荐阅读更多精彩内容