NVIDIA GPU Marker API
Created by: carstenbauer
First 10 minute attempt to wrap LIKWID's GPU Marker API NVIDIA_INTERFACE = true
and 2) a NVIDIA GPU right now.
I tried to quickly build likwid on the JUWELS Booster and test things there with A100 cards, but I get quite a few Permission denied
messages, so probably I screwed something up config.mk
to use perf_counters
ran make
without thinking further).
Anyways, here my current test script:
using LIKWID
using CUDA
using LinearAlgebra
@assert CUDA.functional()
T = Float32
N = 100_000_000
a = convert(T, 3.141)
z = zeros(T, N)
x = rand(T, N)
y = rand(T, N)
z_gpu = zeros(T, N)
x_gpu = rand(T, N)
y_gpu = rand(T, N)
function saxpy_cpu!(z,a,x,y)
for i in eachindex(z)
z[i] = a*x[i] + y[i]
end
return z
end
function saxpy_gpu!(z,a,x,y)
CUDA.@sync z .= a .* x .+ y
end
println("CPU")
saxpy_cpu!(z,a,x,y)
LIKWID.Marker.startregion("saxpy_cpu!")
saxpy_cpu!(z,a,x,y)
LIKWID.Marker.stopregion("saxpy_cpu!")
println("GPU")
saxpy_gpu!(z_gpu,a,x_gpu,y_gpu)
LIKWID.GPUMarker.startregion("saxpy_gpu!")
saxpy_gpu!(z_gpu,a,x_gpu,y_gpu)
LIKWID.GPUMarker.stopregion("saxpy_gpu!")
and this is the output
➜ bauer3@jwb0033 /p/scratch/chku27/hku273/likwid-test likwid-perfctr -C 0 -g FLOPS_SP -G 0 -W FLOPS_SP -m julia --project=. likwid_gpu.jl
INFO: You are running LIKWID in a cpuset with 1 CPUs. Taking given IDs as logical ID in cpuset
--------------------------------------------------------------------------------
CPU name: AMD EPYC 7402 24-Core Processor
CPU type: AMD K17 (Zen2) architecture
CPU clock: 2.80 GHz
ERROR - [./src/includes/perfmon_perfevent.h:perfmon_setupCountersThread_perfevent:881] Permission denied.
Setup of event ACTUAL_CPU_CLOCK on CPU 18 failed: Permission denied
ERROR - [./src/includes/perfmon_perfevent.h:perfmon_setupCountersThread_perfevent:881] Permission denied.
Setup of event MAX_CPU_CLOCK on CPU 18 failed: Permission denied
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
ERROR - [./src/includes/perfmon_perfevent.h:perfmon_setupCountersThread_perfevent:881] Permission denied.
Setup of event ACTUAL_CPU_CLOCK on CPU 18 failed: Permission denied
ERROR - [./src/includes/perfmon_perfevent.h:perfmon_setupCountersThread_perfevent:881] Permission denied.
Setup of event MAX_CPU_CLOCK on CPU 18 failed: Permission denied
CPU
GPU
--------------------------------------------------------------------------------
Region saxpy_cpu!, Group 1: FLOPS_SP
+-------------------+-------------+
| Region Info | HWThread 18 |
+-------------------+-------------+
| RDTSC Runtime [s] | 0.072155 |
| call count | 1 |
+-------------------+-------------+
+---------------------------+---------+-------------+
| Event | Counter | HWThread 18 |
+---------------------------+---------+-------------+
| ACTUAL_CPU_CLOCK | FIXC1 | 0 |
| MAX_CPU_CLOCK | FIXC2 | 0 |
| RETIRED_INSTRUCTIONS | PMC0 | 1217671000 |
| CPU_CLOCKS_UNHALTED | PMC1 | 239200700 |
| RETIRED_SSE_AVX_FLOPS_ALL | PMC2 | 200000400 |
| MERGE | PMC3 | 0 |
+---------------------------+---------+-------------+
+----------------------+-------------+
| Metric | HWThread 18 |
+----------------------+-------------+
| Runtime (RDTSC) [s] | 0.0722 |
| Runtime unhalted [s] | 0 |
| Clock [MHz] | - |
| CPI | 0.1964 |
| SP [MFLOP/s] | 2771.7970 |
+----------------------+-------------+
Region saxpy_gpu!, Group 1: FLOPS_SP
+-------------------+----------+
| Region Info | GPU 0 |
+-------------------+----------+
| RDTSC Runtime [s] | 0.061121 |
| call count | 1 |
+-------------------+----------+
+----------------------------------------------------+---------+-------+
| Event | Counter | GPU 0 |
+----------------------------------------------------+---------+-------+
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FADD_PRED_ON_SUM | GPU0 | 0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FMUL_PRED_ON_SUM | GPU1 | 0 |
| SMSP_SASS_THREAD_INST_EXECUTED_OP_FFMA_PRED_ON_SUM | GPU2 | 0 |
+----------------------------------------------------+---------+-------+
+---------------------+--------+
| Metric | GPU 0 |
+---------------------+--------+
| Runtime (RDTSC) [s] | 0.0611 |
| SP [MFLOP/s] | 0 |
+---------------------+--------+
Don't know why the SP counter for GPU 0 is zero (Maybe I need to GPUMarker.threadinit()
somewhere?). Will test things more properly soon.
cc: @JBlaschke @vchuravy @TomTheBear
Closes #2 (closed)