Maximum FLOPS
the maximum FLOP of a GPU can be found out by the following fomula:
maximum_flop = CUDA_core_number * clock_speed *2
let's take RTX3070 as example.
RTX3070 has two types of clock speed:
base clock speed: 1500MHz
boost clock speed: 1725 MHz
and RTX3070 has 5888 cuda core.
for single-precision float32, its maximum_flop = 1725 MHz * 5888 * 2 = 20.32T FLOP/s
why multiply by 2?
CUDA core can perform two floating point operations in each clock cycle. specifically, CUDA core can perform one fused multiply-add(FMA) operations and one addition.
Maximum Bandwidth
the maximum Bandwidth of a GPU can be found out by the following fomula:
maximum_bandwidth = memory_clock_speed * memory_interface_width / 8
RTX3070 has following specification:
- memory clock speed:14Gbps
- memory_interface_width: 256 bit
then maximum_bandwidth = 14 Gbps * 256 / 8 = 448GB/s
Computing Memory Ratio算存比
computing_memory_ratio = max_flop / max_bandwidth
for rtx3070 on dealing with 32-bit floating point, its computing_memory_ratio = (20.32 T FLOP) / (0.448 TB / 4) = 181.4
which means for each memory accessing, we can perform 181 computing operations.
any operation with computing_memory_ratio exceeds 181.4 is a computing-bound operation, otherwise is a memory-bound operation
refer:CUDA: From Correctness to Performance