背景
文一,文二介绍了常用的GPU性能分析手段,聚焦单个kernel的耗时分析。而实际应用中,还可能需要统计多个kernel耗时数据,比如:
- kernel在不同时刻的耗时可能不一样,只观测一个具体kernel不够准确,所以需要统计kernel耗时分位值。
-
优化性能需要找到耗时大户,这就需要统计各种kernel的耗时占比。
解析Nsys SQLite数据
要达到上述目的,可以直接用torch.event对kernel编码统计,简单粗暴,但过于麻烦。本文暂不讨论这种做法,本次是通过分析nsys perf文件来实现。
给出示意代码:
import torch
import torch.nn.functional as F
def test():
matrix1 = torch.randn((1024, 1024)).cuda()
matrix2 = torch.randn((2048, 2048)).cuda()
matrix3 = torch.randn((4096, 4096)).cuda()
# warm up
for i in range(10):
_ = F.linear(matrix2, matrix2)
for i in range(100):
_ = matrix1 + matrix1
_ = F.celu(matrix3)
_ = F.linear(matrix2, matrix2)
torch.cuda.synchronize()
if __name__ == "__main__":
test()
对其进行nsys profile:
可以看到perf数据与代码的函数一一对应。要读取nsys的数据,需要先export为SQLlite文件,如图:
得到SQLite数据后,就可以用SQL对文件分析。
事先可以选择一个SQLite UI,更加直观点,本文用sqlitestudio,开源产品。其他可按需选择,打开SQLite文件如图:
可以看到文件中数据很多,可以自行研究,一般而言主要关注两张表:
1. CUPTI_ACTIVITY_KIND_KERNEL:kernel信息数据
2. stringIds:字符串映射表
CUPTI_ACTIVITY_KIND_KERNEL中包含了kernel的时间轴信息,是主要信息所在。值得关注的是,表中的kernel Name是一个id,需要在stringIds中找到映射值,便于定位。
kernel有3个Name字段,分别是demangleName、shortName、mangedName。下图是同一个kernel的3个Name:
具体含义可对照下图,一般可用mangedName来定位kernel,比较准确。
统计kernel信息
本质上可以写SQL语句,统计kernel的任何信息,这里给一个统计kernel的耗时分位值的语句,主要是根据:start/end/mangledName字段来计算:
WITH kernel_data AS (
SELECT
mangledName,
start, end,
((end - start)/1000.0) AS duration
FROM
CUPTI_ACTIVITY_KIND_KERNEL
WHERE
mangledName = 4882
AND start > 400000000
),
percentiles AS (
SELECT
COUNT(*) AS total_count,
CAST(COUNT(*) * 0.10 AS INTEGER) AS p10,
CAST(COUNT(*) * 0.30 AS INTEGER) AS p30,
CAST(COUNT(*) * 0.50 AS INTEGER) AS p50,
CAST(COUNT(*) * 0.70 AS INTEGER) AS p70,
CAST(COUNT(*) * 0.80 AS INTEGER) AS p80,
CAST(COUNT(*) * 0.90 AS INTEGER) AS p90,
CAST(COUNT(*) * 0.95 AS INTEGER) AS p95,
CAST(COUNT(*) * 0.98 AS INTEGER) AS p98,
CAST(COUNT(*) * 0.99 AS INTEGER) AS p99
FROM
kernel_data
)
SELECT
'P10' AS percentile,
(SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p10) AS duration
FROM percentiles
UNION ALL
SELECT
'P30' AS percentile,
(SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p30) AS duration
FROM percentiles
UNION ALL
SELECT
'P50' AS percentile,
(SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p50) AS duration
FROM percentiles
UNION ALL
SELECT
'P70' AS percentile,
(SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p70) AS duration
FROM percentiles
UNION ALL
SELECT
'P80' AS percentile,
(SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p80) AS duration
FROM percentiles
UNION ALL
SELECT
'P90' AS percentile,
(SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p90) AS duration
FROM percentiles
UNION ALL
SELECT
'P95' AS percentile,
(SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p95) AS duration
FROM percentiles
UNION ALL
SELECT
'P98' AS percentile,
(SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p98) AS duration
FROM percentiles
UNION ALL
SELECT
'P99' AS percentile,
(SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p99) AS duration
FROM percentiles
UNION ALL
SELECT
'P100' AS percentile,
MAX(duration) AS duration
FROM kernel_data;
其中mangledName=4882指的是celu kernel,运行结果为:
可以看到celu基本上都是55us,这里因为是示例,相差不大,实际中的kernel可能存在较大差别。
其他统计需求可自行优化构造,比如开始提到的统计不同kernel占比,原理与上类似。
后记
关于表中字段含义多是自己摸索出来,若有偏差还望纠正。