Skip to content

NVIDIA GPU Marker API

Thomas Gruber requested to merge cb/gpumarker into main

Created by: carstenbauer

First 10 minute attempt to wrap LIKWID's GPU Marker API 😄. Might be functional already but I couldn't really test it thorougly since I don't have easy access to a node with 1) likwid >= 5.0.0 compiled with NVIDIA_INTERFACE = true and 2) a NVIDIA GPU right now.

I tried to quickly build likwid on the JUWELS Booster and test things there with A100 cards, but I get quite a few Permission denied messages, so probably I screwed something up 😄 (apart from the nvidia setting I only changed the config.mk to use perf_counters ran make without thinking further).

Anyways, here my current test script:

using LIKWID
using CUDA
using LinearAlgebra

@assert CUDA.functional()

T = Float32
N = 100_000_000

a = convert(T, 3.141)
z = zeros(T, N)
x = rand(T, N)
y = rand(T, N)

z_gpu = zeros(T, N)
x_gpu = rand(T, N)
y_gpu = rand(T, N)

function saxpy_cpu!(z,a,x,y)
    for i in eachindex(z)
        z[i] = a*x[i] + y[i]
    end
    return z
end

function saxpy_gpu!(z,a,x,y)
    CUDA.@sync z .= a .* x .+ y
end

println("CPU")
saxpy_cpu!(z,a,x,y)
LIKWID.Marker.startregion("saxpy_cpu!")
saxpy_cpu!(z,a,x,y)
LIKWID.Marker.stopregion("saxpy_cpu!")

println("GPU")
saxpy_gpu!(z_gpu,a,x_gpu,y_gpu)
LIKWID.GPUMarker.startregion("saxpy_gpu!")
saxpy_gpu!(z_gpu,a,x_gpu,y_gpu)
LIKWID.GPUMarker.stopregion("saxpy_gpu!")

and this is the output

➜  bauer3@jwb0033 /p/scratch/chku27/hku273/likwid-test  likwid-perfctr -C 0 -g FLOPS_SP -G 0 -W FLOPS_SP -m julia --project=. likwid_gpu.jl
INFO: You are running LIKWID in a cpuset with 1 CPUs. Taking given IDs as logical ID in cpuset
--------------------------------------------------------------------------------
CPU name:       AMD EPYC 7402 24-Core Processor                
CPU type:       AMD K17 (Zen2) architecture
CPU clock:      2.80 GHz
ERROR - [./src/includes/perfmon_perfevent.h:perfmon_setupCountersThread_perfevent:881] Permission denied.
Setup of event ACTUAL_CPU_CLOCK on CPU 18 failed: Permission denied
ERROR - [./src/includes/perfmon_perfevent.h:perfmon_setupCountersThread_perfevent:881] Permission denied.
Setup of event MAX_CPU_CLOCK on CPU 18 failed: Permission denied
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
ERROR - [./src/includes/perfmon_perfevent.h:perfmon_setupCountersThread_perfevent:881] Permission denied.
Setup of event ACTUAL_CPU_CLOCK on CPU 18 failed: Permission denied
ERROR - [./src/includes/perfmon_perfevent.h:perfmon_setupCountersThread_perfevent:881] Permission denied.
Setup of event MAX_CPU_CLOCK on CPU 18 failed: Permission denied
CPU
GPU
--------------------------------------------------------------------------------
Region saxpy_cpu!, Group 1: FLOPS_SP
+-------------------+-------------+
|    Region Info    | HWThread 18 |
+-------------------+-------------+
| RDTSC Runtime [s] |    0.072155 |
|     call count    |           1 |
+-------------------+-------------+

+---------------------------+---------+-------------+
|           Event           | Counter | HWThread 18 |
+---------------------------+---------+-------------+
|      ACTUAL_CPU_CLOCK     |  FIXC1  |           0 |
|       MAX_CPU_CLOCK       |  FIXC2  |           0 |
|    RETIRED_INSTRUCTIONS   |   PMC0  |  1217671000 |
|    CPU_CLOCKS_UNHALTED    |   PMC1  |   239200700 |
| RETIRED_SSE_AVX_FLOPS_ALL |   PMC2  |   200000400 |
|           MERGE           |   PMC3  |           0 |
+---------------------------+---------+-------------+

+----------------------+-------------+
|        Metric        | HWThread 18 |
+----------------------+-------------+
|  Runtime (RDTSC) [s] |      0.0722 |
| Runtime unhalted [s] |           0 |
|      Clock [MHz]     |      -      |
|          CPI         |      0.1964 |
|     SP [MFLOP/s]     |   2771.7970 |
+----------------------+-------------+

Region saxpy_gpu!, Group 1: FLOPS_SP
+-------------------+----------+
|    Region Info    |   GPU 0  |
+-------------------+----------+
| RDTSC Runtime [s] | 0.061121 |
|     call count    |        1 |
+-------------------+----------+

+----------------------------------------------------+---------+-------+
|                        Event                       | Counter | GPU 0 |
+----------------------------------------------------+---------+-------+
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM |   GPU0  |     0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM |   GPU1  |     0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM |   GPU2  |     0 |
+----------------------------------------------------+---------+-------+

+---------------------+--------+
|        Metric       |  GPU 0 |
+---------------------+--------+
| Runtime (RDTSC) [s] | 0.0611 |
|     SP [MFLOP/s]    |      0 |
+---------------------+--------+

Don't know why the SP counter for GPU 0 is zero (Maybe I need to GPUMarker.threadinit() somewhere?). Will test things more properly soon.

cc: @JBlaschke @vchuravy @TomTheBear

Closes #2 (closed)

Merge request reports

Loading