某位朋友说,他做视频处理,发现 YUV420 转为 RGB 成为了性能瓶颈。
我的第一反应是不信的。
然后他为了抠性能,把 CPU 算法改为 GPU(采用 OpenGL shader),但又因为他对 OpenGL 不熟悉,结果改了之后更慢了。情况逐渐尴尬。
于是我准备动手测试。
我确认了他的数据规模:16 路视频,帧率 15,所有视频加起来分辨率 1920x1080。因此平均每路视频大小为 480x270。用的 CPU 是 Intel i5 8500,6 核 6 线程。
然后我的测试结果如下(我的 CPU 是 Intel i7 8700,6 核 12 线程)。
单张 1920x1080 图片,转换 1000 次的耗时(单位:秒)
线程数量 | 简单版本 | 整数版本 | SSE2 版本 |
---|---|---|---|
1 | 10.6 | 6.09 | 2.27 |
2 | 5.54 | 3.43 | 1.41 |
3 | 4.47 | 2.63 | 1.14 |
4 | 3.75 | 2.04 | 0.9 |
6 | 2.71 | 1.7 | 0.9 |
8 | 2.18 | 1.41 | 0.91 |
12 | 1.82 | 1.25 | 1.03 |
单张 480x270 图片,转换 1000 次的耗时(单位:秒)
线程数量 | 简单版本 | 整数版本 | SSE2 版本 |
---|---|---|---|
1 | 0.65 | 0.38 | 0.14 |
2 | 0.33 | 0.23 | 0.09 |
3 | 0.26 | 0.14 | 0.09 |
4 | 0.21 | 0.12 | 0.09 |
6 | 0.18 | 0.12 | 0.08 |
8 | 0.14 | 0.09 | 0.09 |
12 | 0.12 | 0.08 | 0.08 |
观察发现,单张 1920x1080 或者 16 张 480x270,耗时接近,没有倍数差异。
这位朋友是用 ffmpeg,我有点懒得去做测试。从网上找了别人的测试结果:
https://stackoom.com/question/3ejar/sws-scale-vs-libyuv%E6%80%A7%E8%83%BD-%E9%80%9F%E5%BA%A6
他用 NV12 实际上就是 YUV420P 的变种,把 U 和 V 放到一起,提高了缓存友好度,性能有一定优势。
而我是直接用的 YUV420P。
根据他的耗时结果,我所用的图片像素数是他的 (1920*1080) / (1280*692) = 2.34
倍。
假定他的电脑配置跟我差不多,并且他使用单线程。则需要把他的耗时乘以 2.34。
得到:
单张 1920x1080 图片,转换 1000 次的耗时(此为推算结果,不一定准确,单位:秒)
线程数量 | ffmpeg | libyuv |
---|---|---|
1 | 0.003 * 2.34 * 1000 = 7.02 |
0.045 * 2.34 * 1000 = 105.3 |
平均来说,ffmpeg 单次耗时 7 毫秒,每秒 15 帧耗时 105 毫秒,这大致就是单核 10% 的 CPU 占用率。
开销不低。
而我的代码如果用 SSE2 会比 ffmpeg 快,大致是单核 3.5% 的 CPU 占用率。
于是总体结论如下:
- 单张 1920x1080 或者 16 张 480x270,耗时接近,没有倍数差异。
- 我的代码似乎比 ffmpeg 和 libyuv 更快。
- 三个版本的性能对比比较明显,尤其是线程数量较少时。
我对 SSE2 的经验也不多,经验丰富的话应该可以再优化加快。 - 在 SSE2 版本中,线程太多其实没有多少帮助了。可考虑每一路视频使用独立的一个线程。
测试过程如下。
- 简单起见,使用 OpenMP 来做多线程
- 原始版本:yuv420_to_rgb 函数
- 简单版本:yuv420_to_rgb_openmp 函数
- 整数版本:yuv420_to_rgb_openmp_int 函数
- SSE2 版本:yuv420_to_rgb_openmp_sse 函数
- 代码流程主要在 init 函数,其内容为
- 先捏造一段 RGBA 数据
- 然后转为 YUV420P
- 然后反复调用上述各个版本的函数 YUV420P 转为 RGBA,统计耗时
- 为验证正确性,又把得到的 RGBA 显示到屏幕(通过 OpenGL)
测试用的完整代码:
#include <GL/freeglut.h>
#define GL_CLAMP_TO_EDGE 0x812F
#ifndef _OPENMP
# error "OpenMP is required"
#endif
#include <omp.h>
#include <emmintrin.h>
#include <ctime>
#include <cstdio>
#include <cstdlib>
#include <cstdint>
uint8_t clamp_float_to_byte(float val)
{
val += 0.5f;
if (val >= 255.0f) return 255;
if (val <= 0.0f) return 0;
return (uint8_t)val;
}
uint8_t clamp_int_to_byte(int val)
{
if (val >= 255) return 255;
if (val <= 0) return 0;
return (uint8_t)val;
}
void rgb_to_yuv420(int width, int height,
const uint8_t *rgba_data, int rgba_stride,
uint8_t *y_data, int y_stride,
uint8_t *u_data, int u_stride,
uint8_t *v_data, int v_stride)
{
for (int pos_y = 0; pos_y < height; ++pos_y)
{
const uint8_t *rgba_line = rgba_data + pos_y * rgba_stride;
uint8_t *y_line = y_data + pos_y * y_stride;
uint8_t *u_line = u_data + pos_y / 2 * u_stride;
uint8_t *v_line = v_data + pos_y / 2 * v_stride;
for (int pos_x = 0; pos_x < width; ++pos_x)
{
float r = rgba_line[pos_x * 4];
float g = rgba_line[pos_x * 4 + 1];
float b = rgba_line[pos_x * 4 + 2];
float a = rgba_line[pos_x * 4 + 3];
float y = 0.299f * r + 0.587f * g + 0.114f * b;
float u = -0.169f * r + -0.331f * g + 0.5f * b + 128.0f;
float v = 0.5f * r + -0.419f * g + -0.081f * b + 128.0f;
y_line[pos_x] = clamp_float_to_byte(y);
u_line[pos_x / 2] = clamp_float_to_byte(u); // 其实应该是相邻四个像素的值平均一下。但这里懒得搞了。反正测试代码
v_line[pos_x / 2] = clamp_float_to_byte(v);
}
}
}
// 第一版:最简单的实现。直接套公式
void yuv420_to_rgb(int width, int height,
const uint8_t *y_data, int y_stride,
const uint8_t *u_data, int u_stride,
const uint8_t *v_data, int v_stride,
uint8_t *rgba_data, int rgba_stride)
{
for (int pos_y = 0; pos_y < height; ++pos_y)
{
const uint8_t *y_line = y_data + pos_y * y_stride;
const uint8_t *u_line = u_data + pos_y / 2 * u_stride;
const uint8_t *v_line = v_data + pos_y / 2 * v_stride;
uint8_t *rgba_line = rgba_data + pos_y * rgba_stride;
for (int pos_x = 0; pos_x < width; ++pos_x)
{
float y = y_line[pos_x];
float u = u_line[pos_x / 2] - 128.0f;
float v = v_line[pos_x / 2] - 128.0f;
float r = y + 1.402f * v;
float g = y + -0.344f * u - 0.714f * v;
float b = y + 1.77f * u;
rgba_line[pos_x * 4] = clamp_float_to_byte(r);
rgba_line[pos_x * 4 + 1] = clamp_float_to_byte(g);
rgba_line[pos_x * 4 + 2] = clamp_float_to_byte(b);
rgba_line[pos_x * 4 + 3] = 255;
}
}
}
// 第二版:多线程(利用 OpenMP 实现)每个线程处理连续的几行像素
void yuv420_to_rgb_openmp(int width, int height,
const uint8_t *y_data, int y_stride,
const uint8_t *u_data, int u_stride,
const uint8_t *v_data, int v_stride,
uint8_t *rgba_data, int rgba_stride)
{
// 一共 thread_count 个线程。每个线程计算 height_of_thread 行像素
const int thread_count = omp_get_max_threads();
const int height_of_thread = (height + thread_count - 1) / thread_count;
#pragma omp parallel for
for (int i_thread = 0; i_thread < thread_count; ++i_thread)
{
int pos_y_start = i_thread * height_of_thread;
int pos_y_end = (i_thread + 1) * height_of_thread;
if (pos_y_end > height)
{
pos_y_end = height;
}
for (int pos_y = pos_y_start; pos_y < pos_y_end; ++pos_y)
{
const uint8_t *y_line = y_data + pos_y * y_stride;
const uint8_t *u_line = u_data + pos_y / 2 * u_stride;
const uint8_t *v_line = v_data + pos_y / 2 * v_stride;
uint8_t *rgba_line = rgba_data + pos_y * rgba_stride;
for (int pos_x = 0; pos_x < width; ++pos_x)
{
float y = y_line[pos_x];
float u = u_line[pos_x / 2] - 128.0f;
float v = v_line[pos_x / 2] - 128.0f;
float r = y + 1.402f * v;
float g = y + -0.344f * u - 0.714f * v;
float b = y + 1.77f * u;
rgba_line[pos_x * 4] = clamp_float_to_byte(r);
rgba_line[pos_x * 4 + 1] = clamp_float_to_byte(g);
rgba_line[pos_x * 4 + 2] = clamp_float_to_byte(b);
rgba_line[pos_x * 4 + 3] = 255;
}
}
}
}
// 第三版:去掉浮点运算(会有轻微的精度损失)
// 当计算 v * 1.402,可以改为 (v * (1.402 * 8192)) / 8192
// 其中 (1.402 * 8192) 是常数,且取整之后精度损失并不多
// 而除以 8192 可以改为位移运算
void yuv420_to_rgb_openmp_int(int width, int height,
const uint8_t *y_data, int y_stride,
const uint8_t *u_data, int u_stride,
const uint8_t *v_data, int v_stride,
uint8_t *rgba_data, int rgba_stride)
{
// 一共 thread_count 个线程。每个线程计算 height_of_thread 行像素
const int thread_count = omp_get_max_threads();
const int height_of_thread = (height + thread_count - 1) / thread_count;
const int bits = 13;
const int c_1402 = (int)(1.402f * (1 << bits));
const int c_0344 = (int)(0.344f * (1 << bits));
const int c_0714 = (int)(0.714f * (1 << bits));
const int c_1770 = (int)(1.770f * (1 << bits));
#pragma omp parallel for
for (int i_thread = 0; i_thread < thread_count; ++i_thread)
{
int pos_y_start = i_thread * height_of_thread;
int pos_y_end = (i_thread + 1) * height_of_thread;
if (pos_y_end > height)
{
pos_y_end = height;
}
for (int pos_y = pos_y_start; pos_y < pos_y_end; ++pos_y)
{
const uint8_t *y_line = y_data + pos_y * y_stride;
const uint8_t *u_line = u_data + pos_y / 2 * u_stride;
const uint8_t *v_line = v_data + pos_y / 2 * v_stride;
uint8_t *rgba_line = rgba_data + pos_y * rgba_stride;
for (int pos_x = 0; pos_x < width; ++pos_x)
{
int y = y_line[pos_x];
int u = u_line[pos_x / 2] - 128;
int v = v_line[pos_x / 2] - 128;
int r = y + ((c_1402 * v) >> bits);
int g = y - ((c_0344 * u + c_0714 * v) >> bits);
int b = y + ((c_1770 * u) >> bits);
rgba_line[pos_x * 4] = clamp_int_to_byte(r);
rgba_line[pos_x * 4 + 1] = clamp_int_to_byte(g);
rgba_line[pos_x * 4 + 2] = clamp_int_to_byte(b);
rgba_line[pos_x * 4 + 3] = 255;
}
}
}
}
// 使用 SSE 的 YUV 转为 RGB。一次可处理四条数据
// 其中 yi, ui, vi 都是四个 epi32
// 结果有 16 字节会写入 result,需要确保 result 是 16 字节对齐
inline void sse_yuv_to_rgb(__m128i &y, __m128i &u, __m128i &v, void *result)
{
__m128 yf = _mm_cvtepi32_ps(y);
__m128 uf = _mm_cvtepi32_ps(u);
uf = _mm_add_ps(uf, _mm_set_ps1(-128.0f));
__m128 vf = _mm_cvtepi32_ps(v);
vf = _mm_add_ps(vf, _mm_set_ps1(-128.0f));
__m128 r, g, b;
r = _mm_add_ps(yf, _mm_mul_ps(_mm_set_ps1(1.402f), vf));
g = _mm_add_ps(_mm_add_ps(yf, _mm_mul_ps(_mm_set_ps1(-0.344f), uf)), _mm_mul_ps(_mm_set_ps1(-0.714f), vf));
b = _mm_add_ps(yf, _mm_mul_ps(_mm_set_ps1(1.77f), uf));
r = _mm_max_ps(_mm_min_ps(_mm_add_ps(r, _mm_set_ps1(0.5f)), _mm_set_ps1(255.0f)), _mm_set_ps1(0.0f));
g = _mm_max_ps(_mm_min_ps(_mm_add_ps(g, _mm_set_ps1(0.5f)), _mm_set_ps1(255.0f)), _mm_set_ps1(0.0f));
b = _mm_max_ps(_mm_min_ps(_mm_add_ps(b, _mm_set_ps1(0.5f)), _mm_set_ps1(255.0f)), _mm_set_ps1(0.0f));
__m128i ri = _mm_cvtps_epi32(r);
__m128i gi = _mm_cvtps_epi32(g);
gi = _mm_slli_epi32(gi, 8);
__m128i bi = _mm_cvtps_epi32(b);
bi = _mm_slli_epi32(bi, 16);
__m128i rgba_i = _mm_or_si128(_mm_or_si128(_mm_or_si128(ri, gi), bi), _mm_set1_epi32(0xFF000000));
_mm_store_si128((__m128i*)result, rgba_i);
}
// 第四版,用 SSE
// 为了兼容最旧的硬件,选择了 SSE 2.0,于是没有 _mm_mullo_epi32 只有 _mm_mullo_epi16
void yuv420_to_rgb_openmp_sse(int width, int height,
const uint8_t *y_data, int y_stride,
const uint8_t *u_data, int u_stride,
const uint8_t *v_data, int v_stride,
uint8_t *rgba_data, int rgba_stride)
{
// 一共 thread_count 个线程。每个线程计算 height_of_thread 行像素
const int thread_count = omp_get_max_threads();
const int height_of_thread = (height + thread_count - 1) / thread_count;
const int bits = 16;
const int c_1402 = (int)(1.402f * (1 << 16));
const int c_0344 = (int)(0.344f * (1 << 16));
const int c_0714 = (int)(0.714f * (1 << 16));
const int c_1770 = (int)(1.770f * (1 << 16));
memset(rgba_data, 0, rgba_stride * height);
#pragma omp parallel for
for (int i_thread = 0; i_thread < thread_count; ++i_thread)
{
int pos_y_start = i_thread * height_of_thread;
int pos_y_end = (i_thread + 1) * height_of_thread;
if (pos_y_end > height)
{
pos_y_end = height;
}
const __m128i zeros = _mm_set1_epi32(0);
for (int pos_y = pos_y_start; pos_y < pos_y_end; ++pos_y)
{
const uint8_t *y_line = y_data + pos_y * y_stride;
const uint8_t *u_line = u_data + pos_y / 2 * u_stride;
const uint8_t *v_line = v_data + pos_y / 2 * v_stride;
uint8_t *rgba_line = rgba_data + pos_y * rgba_stride;
// 一次循环处理 32 个像素
int pos_x = 0;
for (pos_x = 0; pos_x + 32 <= width; pos_x += 32)
{
__m128i u_0_15 = _mm_load_si128((const __m128i*)(u_line + pos_x / 2));
__m128i v_0_15 = _mm_load_si128((const __m128i*)(v_line + pos_x / 2));
__m128i y_0_15 = _mm_load_si128((const __m128i*)(y_line + pos_x));
__m128i y, u, v;
// 第 0~3 像素
y = _mm_unpacklo_epi16(_mm_unpacklo_epi8(y_0_15, zeros), zeros);
u = _mm_unpacklo_epi16(_mm_unpacklo_epi8(_mm_unpacklo_epi8(u_0_15, u_0_15), zeros), zeros);
v = _mm_unpacklo_epi16(_mm_unpacklo_epi8(_mm_unpacklo_epi8(v_0_15, v_0_15), zeros), zeros);
sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x) * 4);
// 第 4~7 像素
y = _mm_unpackhi_epi16(_mm_unpacklo_epi8(y_0_15, zeros), zeros);
u = _mm_unpackhi_epi16(_mm_unpacklo_epi8(_mm_unpacklo_epi8(u_0_15, u_0_15), zeros), zeros);
v = _mm_unpackhi_epi16(_mm_unpacklo_epi8(_mm_unpacklo_epi8(v_0_15, v_0_15), zeros), zeros);
sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 4) * 4);
// 第 8~11 像素
y = _mm_unpacklo_epi16(_mm_unpackhi_epi8(y_0_15, zeros), zeros);
u = _mm_unpacklo_epi16(_mm_unpackhi_epi8(_mm_unpacklo_epi8(u_0_15, u_0_15), zeros), zeros);
v = _mm_unpacklo_epi16(_mm_unpackhi_epi8(_mm_unpacklo_epi8(v_0_15, v_0_15), zeros), zeros);
sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 8) * 4);
// 第 12~15 像素
y = _mm_unpackhi_epi16(_mm_unpackhi_epi8(y_0_15, zeros), zeros);
u = _mm_unpackhi_epi16(_mm_unpackhi_epi8(_mm_unpacklo_epi8(u_0_15, u_0_15), zeros), zeros);
v = _mm_unpackhi_epi16(_mm_unpackhi_epi8(_mm_unpacklo_epi8(v_0_15, v_0_15), zeros), zeros);
sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 12) * 4);
__m128i y_16_31 = _mm_loadu_si128((const __m128i*)(y_line + pos_x + 16));
// 第 16~19 像素
y = _mm_unpacklo_epi16(_mm_unpacklo_epi8(y_16_31, zeros), zeros);
u = _mm_unpacklo_epi16(_mm_unpacklo_epi8(_mm_unpackhi_epi8(u_0_15, u_0_15), zeros), zeros);
v = _mm_unpacklo_epi16(_mm_unpacklo_epi8(_mm_unpackhi_epi8(v_0_15, v_0_15), zeros), zeros);
sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 16) * 4);
// 第 20~23 像素
y = _mm_unpackhi_epi16(_mm_unpacklo_epi8(y_16_31, zeros), zeros);
u = _mm_unpackhi_epi16(_mm_unpacklo_epi8(_mm_unpackhi_epi8(u_0_15, u_0_15), zeros), zeros);
v = _mm_unpackhi_epi16(_mm_unpacklo_epi8(_mm_unpackhi_epi8(v_0_15, v_0_15), zeros), zeros);
sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 20) * 4);
// 第 24~27 像素
y = _mm_unpacklo_epi16(_mm_unpackhi_epi8(y_16_31, zeros), zeros);
u = _mm_unpacklo_epi16(_mm_unpackhi_epi8(_mm_unpackhi_epi8(u_0_15, u_0_15), zeros), zeros);
v = _mm_unpacklo_epi16(_mm_unpackhi_epi8(_mm_unpackhi_epi8(v_0_15, v_0_15), zeros), zeros);
sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 24) * 4);
// 第 28~31 像素
y = _mm_unpackhi_epi16(_mm_unpackhi_epi8(y_16_31, zeros), zeros);
u = _mm_unpackhi_epi16(_mm_unpackhi_epi8(_mm_unpackhi_epi8(u_0_15, u_0_15), zeros), zeros);
v = _mm_unpackhi_epi16(_mm_unpackhi_epi8(_mm_unpackhi_epi8(v_0_15, v_0_15), zeros), zeros);
sse_yuv_to_rgb(y, u, v, rgba_line + (pos_x + 28) * 4);
}
// 剩下不是 32 的倍数,不用 SSE 了,改为常规方法处理
for (; pos_x < width; ++pos_x)
{
int y = y_line[pos_x];
int u = u_line[pos_x / 2] - 128;
int v = v_line[pos_x / 2] - 128;
int r = y + ((c_1402 * v) >> bits);
int g = y - ((c_0344 * u + c_0714 * v) >> bits);
int b = y + ((c_1770 * u) >> bits);
rgba_line[pos_x * 4] = clamp_int_to_byte(r);
rgba_line[pos_x * 4 + 1] = clamp_int_to_byte(g);
rgba_line[pos_x * 4 + 2] = clamp_int_to_byte(b);
rgba_line[pos_x * 4 + 3] = 255;
}
}
}
}
void init()
{
#if _DEBUG
omp_set_num_threads(1);
#endif
const int thread_count = omp_get_max_threads();
printf("OpenMP thread count: %d\n", thread_count);
const int width = 1920;
const int height = 1080;
const int half_width = width / 2;
const int half_height = height / 2;
#if _DEBUG
{
float r1 = 0.0f;
float g1 = 255.0f;
float b1 = 91.0f;
float y1 = 0.299f * r1 + 0.587f * g1 + 0.114f * b1;
float u1 = -0.169f * r1 + -0.331f * g1 + 0.5f * b1 + 128.0f;
float v1 = 0.5f * r1 + -0.419f * g1 + -0.081f * b1 + 128.0f;
float r1_ = y1 + 1.402f * (v1 - 128.0f);
float g1_ = y1 + -0.344f * (u1 - 128.0f) + -0.714f * (v1 - 128.0f);
float b1_ = y1 + 1.77f * (u1 - 128.0f);
__m128i y = _mm_set1_epi32(clamp_float_to_byte(y1));
__m128i u = _mm_set1_epi32(clamp_float_to_byte(u1));
__m128i v = _mm_set1_epi32(clamp_float_to_byte(v1));
uint8_t *pixel = (uint8_t*)_aligned_malloc(32, 16);
sse_yuv_to_rgb(y, u, v, pixel);
_aligned_free(pixel);
}
#endif
// SSE 要求 16 字节对齐
const int alignment = 16;
const int y_stride = (width + alignment - 1) / alignment * alignment;
const int u_stride = (half_width + alignment - 1) / alignment * alignment;
const int v_stride = u_stride;
const int rgba_stride = (width * 4 + alignment - 1) / alignment * alignment;
// 捏造一些 RGBA 数据作为测试
uint8_t *input_rgba_data = (uint8_t*)_aligned_malloc(rgba_stride * height, alignment);
for (int pos_y = 0; pos_y < height; ++pos_y)
{
uint8_t *rgba_line = input_rgba_data + pos_y * rgba_stride;
for (int pos_x = 0; pos_x < width; ++pos_x)
{
rgba_line[pos_x * 4] = clamp_float_to_byte(((float)pos_x / width) * 255.0f); // R
rgba_line[pos_x * 4 + 1] = clamp_float_to_byte(((float)pos_y / height) * 255.0f); // G
rgba_line[pos_x * 4 + 2] = clamp_float_to_byte(((float)(pos_x + pos_y) / (width + height)) * 255.0f); // B
rgba_line[pos_x * 4 + 3] = 255;
}
}
uint8_t *y_data = (uint8_t*)_aligned_malloc(y_stride * height, alignment);
uint8_t *u_data = (uint8_t*)_aligned_malloc(u_stride * half_height, alignment);
uint8_t *v_data = (uint8_t*)_aligned_malloc(v_stride * half_height, alignment);
uint8_t *rgba_data = (uint8_t*)_aligned_malloc(rgba_stride * height, alignment);
rgb_to_yuv420(width, height, input_rgba_data, rgba_stride, y_data, y_stride, u_data, u_stride, v_data, v_stride);
#ifdef _DEBUG
const int TEST_COUNT = 1;
#else
const int TEST_COUNT = 1000;
#endif
clock_t t1, t2;
t1 = clock();
for (int test = 0; test < TEST_COUNT; ++test)
{
// if (test % 10 == 0) printf("yuv420_to_rgb_openmp %d\n", test);
yuv420_to_rgb_openmp(width, height, y_data, y_stride, u_data, u_stride, v_data, v_stride, rgba_data, rgba_stride);
}
t2 = clock();
printf("yuv420_to_rgb_openmp %d times cost: %.2f seconds\n", TEST_COUNT, (t2 - t1) / (double)CLOCKS_PER_SEC);
t1 = clock();
for (int test = 0; test < TEST_COUNT; ++test)
{
// if (test % 10 == 0) printf("yuv420_to_rgb_openmp_int %d\n", test);
yuv420_to_rgb_openmp_int(width, height, y_data, y_stride, u_data, u_stride, v_data, v_stride, rgba_data, rgba_stride);
}
t2 = clock();
printf("yuv420_to_rgb_openmp_int %d times cost: %.2f seconds\n", TEST_COUNT, (t2 - t1) / (double)CLOCKS_PER_SEC);
t1 = clock();
for (int test = 0; test < TEST_COUNT; ++test)
{
// if (test % 10 == 0) printf("yuv420_to_rgb_openmp_sse %d\n", test);
yuv420_to_rgb_openmp_sse(width, height, y_data, y_stride, u_data, u_stride, v_data, v_stride, rgba_data, rgba_stride);
}
t2 = clock();
printf("yuv420_to_rgb_openmp_sse %d times cost: %.2f seconds\n", TEST_COUNT, (t2 - t1) / (double)CLOCKS_PER_SEC);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, rgba_data);
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
_aligned_free(rgba_data);
_aligned_free(input_rgba_data);
_aligned_free(y_data);
_aligned_free(u_data);
_aligned_free(v_data);
}
void display() {
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT);
glEnable(GL_TEXTURE_2D);
glColor3f(1.0f, 1.0f, 1.0f);
glBegin(GL_QUADS);
glTexCoord2f(0.0f, 0.0f); glVertex2f(-1.0f, -1.0f);
glTexCoord2f(1.0f, 0.0f); glVertex2f( 1.0f, -1.0f);
glTexCoord2f(1.0f, 1.0f); glVertex2f( 1.0f, 1.0f);
glTexCoord2f(0.0f, 1.0f); glVertex2f(-1.0f, 1.0f);
glEnd();
glutSwapBuffers();
}
int main(int argc, char* argv[]) {
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_DOUBLE | GLUT_RGBA | GLUT_DEPTH | GLUT_STENCIL);
glutInitWindowSize(800, 600);
glutInitWindowPosition(100, 100);
glutCreateWindow("");
init();
glutDisplayFunc(display);
glutMainLoop();
return 0;
}