Profiling¶
NVIDIA Profiling Tools¶
HPC systems typically favour batch jobs rather than interactive jobs for improved utilsation of resources. The Nvidia profiling tools can all be used to capture all required via the command line, which can then be interrogated using the GUI tools locally.
Nsight Systems and Nsight Compute are the modern Nvidia profiling tools, introduced with CUDA 10.0 supporting Pascal+ and Volta+ respectivley.
The NVIDIA Visual Profiler is the legacy profiling tool, with full support for GPUs up to pascal (SM < 75), partial support for Turing (SM 75 and no support for Ampere (SM80).
Compiler settings for profiling¶
Applications compiled with nvcc
should pass -lineinfo
(or --generate-line-info
) to include source-level profile information.
Additionally, NVIDIA Tools Extension SDK can be used to enhance these profiling tools.
Nsight Systems and Nsight Compute¶
Note
- Nsight Systems supports Pascal and above (SM 60+)
- Nsight Compute supports Volta and aboce (SM 70+)
Generate an application timeline with Nsight Systems CLI (nsys
):
nsys profile -o timeline ./myapplication
Use the --trace
argument to specify which APIs should be traced.
See the nsys profiling command switch options for further information.
nsys profile -o timeline --trace cuda,nvtx,osrt,openacc ./myapplication <arguments>
Note
On Bede (Power9) the --trace
option osrt
can lead to SIGILL
errors. As this is a default, consider passing --trace cuda,nvtx
as an alternative minimum.
Once this file has been downloaded to your local machine, it can be opened in nsys-ui
/nsight-sys
via File > Open > timeline.qdrep
:
Fine-grained kernel profile information can be captured using remote Nsight Compute CLI (ncu
/nv-nsight-cu-cli
):
ncu -o profile --set full ./myapplication <arguments>
Note
ncu
is available since CUDA v11.0.194, and Nsight Compute v2020.1.1. For older versions of CUDA use nv-nsight-cu-cli
(if Nsight Compute is installed).
This will capture the full set of available metrics, to populate all sections of the Nsight Compute GUI, however this can lead to very long run times to capture all the information.
For long running applications, it may be favourable to capture a smaller set of metrics using the --set
, --section
and --metrics
flags as described in the Nsight Comptue Profile Command Line Options table.
The scope of the section being profiled can also be reduced using NVTX Filtering; or by targetting specific kernels using --kernel-id
, --kernel-regex
and/or --launch-skip
see the CLI docs for more information).
Once the .ncu-rep
file has been downloaded locally, it can be imported into local Nsight CUDA GUI ncu-ui
/nv-nsight-cu
via:
ncu-ui profile.ncu-rep
Or File > Open > profile.ncu-rep
, or Drag profile.ncu-rep
into the nv-nsight-cu
window.
Note
Older versions of Nsight Compute (CUDA < v11.0.194) used nv-nsight-cu
rather than ncu-ui
.
Note
Older versions of Nsight Compute generated .nsight-cuprof-report
files, instead of .ncu-rep
files.
More info¶
Use the following Nsight report files to follow the tutorial.
Cluster Modules¶
module load nvidia/20.5
Visual Profiler (legacy)¶
Note
- Nvprof does not support CUDA kernel profiling for Turing GPUs (SM75)
- Nvprof does not support Ampere GPUs (SM80+)
Application timelines can be generated using nvprof
:
nvprof -o timeline.nvprof ./myapplication
Fine-grained kernel profile information can be genereted remotely using nvprof
:
nvprof --analysis-metrics -o analysis.nvprof ./myapplication
This captuires the full set of metrics required to complete the guided analysis, and may take a (very long) while.
Large applications request fewer metrics (via --metrics
), fewer events (via --events
) or target specific kernels (via --kernels
). See the nvprof command line options for further information.
Once these files are downloaded to your local machine, Import them into the Visual Profiler GUI (nvvp
)
File > Import
- Select
Nvprof
- Select
Single process
- Select
timeline.nvvp
forTimeline data file
- Add
analysis.nvprof
toEvent/Metric data files
Documentation¶
Cluster Modules¶
module load cuda/10.1
module load cuda/10.2
module load nvidia/20.5
NVIDIA Tools Extension¶
NVIDIA Tools Extension (NVTX) is a C-based API for annotating events and ranges in applications. These markers and ranges can be used to increase the usability of the NVIDIA profiling tools.
- For CUDA
>= 10.0
, NVTX version 3 is distributed as a header only library. - For CUDA
< 10.0
, NVTX is distributed as a shared library.
The location of the headers and shared libraries may vary between Operating Systems, and CUDA installation (i.e. CUDA toolkit, PGI compilers or HPC SDK).
The NVIDIA Developer blog contains several posts on using NVTX:
- Generate Custom Application Profile Timelines with NVTX (Jiri Kraus)
- Track MPI Calls In The NVIDIA Visual Profiler (Jeff Larkin)
- Customize CUDA Fortran Profiling with NVTX (Massimiliano Fatica)
Custom CMake find_package
modules can be written to enable use within Cmake e.g. ptheywood/cuda-cmake-NVTX on GitHub