NVIDIA Profiling Tools

HPC systems typically favour batch jobs rather than interactive jobs for improved utilsation of resources. The Nvidia profiling tools can all be used to capture all required via the command line, which can then be interrogated using the GUI tools locally.

Nsight Systems and Nsight Compute are the modern Nvidia profiling tools, introduced with CUDA 10.0 supporting Pascal+ and Volta+ respectivley.

The NVIDIA Visual Profiler is the legacy profiling tool, with full support for GPUs up to pascal (SM < 75), partial support for Turing (SM 75 and no support for Ampere (SM80).

Compiler settings for profiling

Applications compiled with nvcc should pass -lineinfo (or --generate-line-info) to include source-level profile information.

Additionally, NVIDIA Tools Extension SDK can be used to enhance these profiling tools.

Nsight Systems and Nsight Compute


  • Nsight Systems supports Pascal and above (SM 60+)
  • Nsight Compute supports Volta and aboce (SM 70+)

Generate an application timeline with Nsight Systems CLI (nsys):

nsys profile -o timeline ./myapplication

Use the --trace argument to specify which APIs should be traced. See the nsys profiling command switch options for further information.

nsys profile -o timeline --trace cuda,nvtx,osrt,openacc ./myapplication <arguments>


On Bede (Power9) the --trace option osrt can lead to SIGILL errors. As this is a default, consider passing --trace cuda,nvtx as an alternative minimum.

Once this file has been downloaded to your local machine, it can be opened in nsys-ui/nsight-sys via File > Open > timeline.qdrep:

Fine-grained kernel profile information can be captured using remote Nsight Compute CLI (ncu/nv-nsight-cu-cli):

ncu -o profile --set full ./myapplication <arguments>


ncu is available since CUDA v11.0.194, and Nsight Compute v2020.1.1. For older versions of CUDA use nv-nsight-cu-cli (if Nsight Compute is installed).

This will capture the full set of available metrics, to populate all sections of the Nsight Compute GUI, however this can lead to very long run times to capture all the information.

For long running applications, it may be favourable to capture a smaller set of metrics using the --set, --section and --metrics flags as described in the Nsight Comptue Profile Command Line Options table.

The scope of the section being profiled can also be reduced using NVTX Filtering; or by targetting specific kernels using --kernel-id, --kernel-regex and/or --launch-skip see the CLI docs for more information).

Once the .ncu-rep file has been downloaded locally, it can be imported into local Nsight CUDA GUI ncu-ui/nv-nsight-cu via:

ncu-ui profile.ncu-rep

Or File > Open > profile.ncu-rep, or Drag profile.ncu-rep into the nv-nsight-cu window.


Older versions of Nsight Compute (CUDA < v11.0.194) used nv-nsight-cu rather than ncu-ui.


Older versions of Nsight Compute generated .nsight-cuprof-report files, instead of .ncu-rep files.

Cluster Modules

  • module load nvidia/20.5

Visual Profiler (legacy)


  • Nvprof does not support CUDA kernel profiling for Turing GPUs (SM75)
  • Nvprof does not support Ampere GPUs (SM80+)

Application timelines can be generated using nvprof:

nvprof -o timeline.nvprof ./myapplication

Fine-grained kernel profile information can be genereted remotely using nvprof:

nvprof --analysis-metrics -o analysis.nvprof ./myapplication

This captuires the full set of metrics required to complete the guided analysis, and may take a (very long) while. Large applications request fewer metrics (via --metrics), fewer events (via --events) or target specific kernels (via --kernels). See the nvprof command line options for further information.

Once these files are downloaded to your local machine, Import them into the Visual Profiler GUI (nvvp)

  • File > Import
  • Select Nvprof
  • Select Single process
  • Select timeline.nvvp for Timeline data file
  • Add analysis.nvprof to Event/Metric data files


Cluster Modules

  • module load cuda/10.1
  • module load cuda/10.2
  • module load nvidia/20.5

NVIDIA Tools Extension

NVIDIA Tools Extension (NVTX) is a C-based API for annotating events and ranges in applications. These markers and ranges can be used to increase the usability of the NVIDIA profiling tools.

  • For CUDA >= 10.0, NVTX version 3 is distributed as a header only library.
  • For CUDA <  10.0, NVTX is distributed as a shared library.

The location of the headers and shared libraries may vary between Operating Systems, and CUDA installation (i.e. CUDA toolkit, PGI compilers or HPC SDK).

The NVIDIA Developer blog contains several posts on using NVTX:

Custom CMake find_package modules can be written to enable use within Cmake e.g. ptheywood/cuda-cmake-NVTX on GitHub