如何优雅地测量GPU CUDA Kernel耗时?(三)- nsys统计kernel耗时

背景

文一文二介绍了常用的GPU性能分析手段,聚焦单个kernel的耗时分析。而实际应用中,还可能需要统计多个kernel耗时数据,比如:

  1. kernel在不同时刻的耗时可能不一样,只观测一个具体kernel不够准确,所以需要统计kernel耗时分位值。

  2. 优化性能需要找到耗时大户,这就需要统计各种kernel的耗时占比。

解析Nsys SQLite数据

要达到上述目的,可以直接用torch.event对kernel编码统计,简单粗暴,但过于麻烦。本文暂不讨论这种做法,本次是通过分析nsys perf文件来实现。

给出示意代码:

import torch
import torch.nn.functional as F
​
​
def test():
    matrix1 = torch.randn((1024, 1024)).cuda()
    matrix2 = torch.randn((2048, 2048)).cuda()
    matrix3 = torch.randn((4096, 4096)).cuda()
    # warm up
    for i in range(10):
        _ = F.linear(matrix2, matrix2)
​
    for i in range(100):
        _ = matrix1 + matrix1
        _ = F.celu(matrix3)
        _ = F.linear(matrix2, matrix2)
    torch.cuda.synchronize()
​
​
if __name__ == "__main__":
    test()

对其进行nsys profile:


可以看到perf数据与代码的函数一一对应。要读取nsys的数据,需要先export为SQLlite文件,如图:

得到SQLite数据后,就可以用SQL对文件分析。

事先可以选择一个SQLite UI,更加直观点,本文用sqlitestudio,开源产品。其他可按需选择,打开SQLite文件如图:


可以看到文件中数据很多,可以自行研究,一般而言主要关注两张表:
1. CUPTI_ACTIVITY_KIND_KERNEL:kernel信息数据
2. stringIds:字符串映射表

CUPTI_ACTIVITY_KIND_KERNEL中包含了kernel的时间轴信息,是主要信息所在。值得关注的是,表中的kernel Name是一个id,需要在stringIds中找到映射值,便于定位。
kernel有3个Name字段,分别是demangleName、shortName、mangedName。下图是同一个kernel的3个Name:


具体含义可对照下图,一般可用mangedName来定位kernel,比较准确。

统计kernel信息

本质上可以写SQL语句,统计kernel的任何信息,这里给一个统计kernel的耗时分位值的语句,主要是根据:start/end/mangledName字段来计算:


有以下sql(由ChatGPT生成):

WITH kernel_data AS (
    SELECT 
        mangledName,
        start, end,
        ((end - start)/1000.0) AS duration
    FROM 
        CUPTI_ACTIVITY_KIND_KERNEL
    WHERE 
        mangledName = 4882
        AND start > 400000000
),
percentiles AS (
    SELECT 
        COUNT(*) AS total_count,
        CAST(COUNT(*) * 0.10 AS INTEGER) AS p10,
        CAST(COUNT(*) * 0.30 AS INTEGER) AS p30,
        CAST(COUNT(*) * 0.50 AS INTEGER) AS p50,
        CAST(COUNT(*) * 0.70 AS INTEGER) AS p70,
        CAST(COUNT(*) * 0.80 AS INTEGER) AS p80,
        CAST(COUNT(*) * 0.90 AS INTEGER) AS p90,
        CAST(COUNT(*) * 0.95 AS INTEGER) AS p95,
        CAST(COUNT(*) * 0.98 AS INTEGER) AS p98,
        CAST(COUNT(*) * 0.99 AS INTEGER) AS p99
    FROM 
        kernel_data
)
SELECT 
    'P10' AS percentile,
    (SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p10) AS duration
FROM percentiles
UNION ALL
SELECT 
    'P30' AS percentile,
    (SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p30) AS duration
FROM percentiles
UNION ALL
SELECT 
    'P50' AS percentile,
    (SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p50) AS duration
FROM percentiles
UNION ALL
SELECT 
    'P70' AS percentile,
    (SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p70) AS duration
FROM percentiles
UNION ALL
SELECT 
    'P80' AS percentile,
    (SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p80) AS duration
FROM percentiles
UNION ALL
SELECT 
    'P90' AS percentile,
    (SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p90) AS duration
FROM percentiles
UNION ALL
SELECT 
    'P95' AS percentile,
    (SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p95) AS duration
FROM percentiles
UNION ALL
SELECT 
    'P98' AS percentile,
    (SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p98) AS duration
FROM percentiles
UNION ALL
SELECT 
    'P99' AS percentile,
    (SELECT duration FROM (SELECT duration, ROW_NUMBER() OVER (ORDER BY duration) AS row_num FROM kernel_data) WHERE row_num = p99) AS duration
FROM percentiles
UNION ALL
SELECT 
    'P100' AS percentile,
    MAX(duration) AS duration
FROM kernel_data;

其中mangledName=4882指的是celu kernel,运行结果为:


可以看到celu基本上都是55us,这里因为是示例,相差不大,实际中的kernel可能存在较大差别。
其他统计需求可自行优化构造,比如开始提到的统计不同kernel占比,原理与上类似。

后记

关于表中字段含义多是自己摸索出来,若有偏差还望纠正。

系列文章

  1. 如何优雅地测量GPU CUDA Kernel耗时?(一)
  2. 如何优雅地测量GPU CUDA Kernel耗时?(二)
  3. 如何优雅地测量GPU CUDA Kernel耗时?(三)
本文链接:https://rainlin.top/archives/269
转载请注明转载自:https://rainlin.top
暂无评论

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
上一篇
下一篇