专栏名称: GiantPandaLLM

专注于机器学习、深度学习、计算机视觉、图像处理等多个方向技术分享。团队由一群热爱技术且热衷于分享的小伙伴组成。我们坚持原创，每天一到两篇原创技术分享。希望在传播知识、分享知识的同时能够启发你，大家一起共同进步(･ω<)☆

[Triton编程][基础] Triton极简入门: Triton Vector Add

GiantPandaLLM · 公众号 · 3D · 2025-05-24 10:21

正文

请到「今天看啥」查看全文

因此，哪怕是CUDA熟练工，也得花费不少的精力，才能写出一个性能接近理论峰值的Kernel。Triton 的出现，降低了CUDA Kernel编写的难度，它将一些需要精心设计的优化策略进行自动化，比如内存事务合并、SRAM分配和管理、流水线优化等，从而使得编程人员可以将更多的精力放在算法本身。

从官方放出的这个表格中，我们可以看到，如果使用Triton，内存事务合并、SRAM管理以及SM内的线程调度都是自动进行的，我们只需要把精力花在SM之间管理即可，这也就是说， Triton的编程粒度是Block （每个Block只会被调度到一个SM上），而不是Thread。我们只需要考虑每个Block需要做什么，至于Thread/Warp的分布和调度，Triton自动给我们处理了。那么，Block这个概念，在Triton中通过什么进行表达呢？答案是： program 。

block -> program，在Triton中，使用 program_id 来标识一个唯一的program。编程人员只需要考虑一个program(block)内的编程逻辑，比如这个最简单的add_kernel。 x_ptr , y_ptr , 和 output_ptr 分别是指向第一个输入向量、第二个输入向量和输出向量的指针。这些向量存储在 GPU 的内存中。比较常见的就是PyTorch和Triton一起使用，Triton将会传入的Tensor当成指针来处理，而非数据张量。 BLOCK_SIZE: tl.constexpr 表示一个triton的编译时常量，表示每个 block需要处理的元素数量。 mask = offsets < n_elements 表示创建一个mask以防止内存操作超出范围。tl.load和tl.store分表表示triton中的数据加载和写入的操作，这也是需要注意的，Triton为了能更好地进行性能优化，它是在指针级别上做操作的，而非数据Tensor级别。

0x02 Triton Vector Add

importtriton
importtriton.languageastl

@triton.jit
defadd_kernel(x_ptr,# *Pointer* to first input vector.
y_ptr,# *Pointer* to second input vector.
output_ptr,# *Pointer* to output vector.
n_elements,# Size of the vector.
BLOCK_SIZE:tl.constexpr,# Number of elements each program should process.
# NOTE: `constexpr` so it can be used as a shape value.
):
# There are multiple 'programs' processing different data. We identify which program
# we are here:
#  有多个'程序'（也就是block）处理不同的数据。我们在这里标识我们是哪个程序：
pid=tl.program_id(axis=0)# We use a 1D launch grid so axis is 0.
# This program will process inputs that are offset from the initial data.
# For instance, if you had a vector of length 256 and block_size of 64, the programs
# would each access the elements [0:64, 64:128, 128:192, 192:256].
# Note that offsets is a list of pointers:
# 该程序将处理与初始数据偏移的输入。
# 例如，如果您有长度为 256 的向量和块大小为 64，程序
# 将分别访问元素[0:64, 64:128, 128:192, 192:256]。
# 请注意，偏移量是指针的列表：
block_start=pid*BLOCK_SIZE
offsets=block_start+tl.arange(0,BLOCK_SIZE)
# Create a mask to guard memory operations against out-of-bounds accesses.
# 创建一个mask以防止内存操作超出范围。
mask=offsets<n_elements
# Load x and y from DRAM, masking out any extra elements in case the input is not a
# multiple of the block size.
x=tl.load(x_ptr+offsets,mask=mask)
y=tl.load