[Online] Scaling CUDA-Accelerated Applications

Name: [Online] Scaling CUDA-Accelerated Applications
Start: 2026-09-07T09:00:00+02:00
End: 2026-09-09T15:00:00+02:00
Location: Online

7 Sept 2026, 09:00 → 9 Sept 2026, 15:00 Europe/Berlin

Online

Description

Scaling CUDA-Accelerated Applications

NHR@FAU

Schedule & Format

Date: 2026, September 7-9
Times:
- Sep 7: 9:00 - 15:00 CE(S)T
- Sep 8: 9:00 - 15:00 CE(S)T
- Sep 9: 9:00 - 15:00 CE(S)T
Format: Three-day
Location: Online via Zoom
Language: English

Registered participants will receive the video conferencing link via email on the day before the course.

From Zero to Multi-Node GPU Programming

This event is part of the From Zero to Multi-Node GPU Programming series. Registration is done individually for each part of the series.

Part 1 - Introduction to CUDA C/C++ (2026, September 3-4) (Register)
Part 2 - Scaling CUDA-Accelerated Applications (this course) (2026, September 7-9) (Register)

Instructors

Dr. Sebastian Kuckuk, NHR@FAU, certified NVIDIA DLI Ambassador
Aditya Ujeniya, NHR@FAU
Markus Velten, NHR@TUD, certified NVIDIA DLI Ambassador

This course is organized by Erlangen National High Performance Computing Center (NHR@FAU) in collaboration with NHR@TUD.

Course Description

Scaling a GPU application beyond a single accelerator requires both intra-node and inter-node parallelism. This course provides a comprehensive treatment of both: part one covers CUDA streams, multi-GPU execution within a node, and direct peer-to-peer GPU memory access; part two extends that foundation across compute nodes using CUDA-aware MPI and NVSHMEM, including 1D domain decomposition and halo-exchange patterns, with copy/compute overlap as a recurring optimization. A single 2D heat-diffusion stencil serves as the running example, refined step by step from a CPU baseline through managed memory and algorithmic partitioning to distributed multi-GPU execution. Each hands-on step is provided at multiple difficulty levels, from guided starting points to full solutions.

This course was developed to replace the two formerly separate NVIDIA DLI courses Accelerating CUDA C++ Applications with Multiple GPUs and Scaling CUDA C++ Applications to Multiple Nodes which have been first on hold and then finally discontinued in 2025 and 2026.

Prerequisites

Knowledge

Experience with CUDA C++ GPU programming, including memory allocation, kernel launches, grid-stride loops, and error handling (equivalent to the Introduction to CUDA C/C++ course)
Familiarity with the Linux command line as well as compiling and running CUDA applications

Technical

A modern web browser (for JupyterHub access to NHR@FAU's HPC clusters)
A local installation of NVIDIA Nsight Systems

Course Structure

Motivation and the running example: a 2D heat-diffusion stencil scaled throughout the course
CPU baseline and single-GPU port: managed memory, 2D execution configuration, and prefetching
Algorithmic work partitioning: decomposing the domain into patches for multi-GPU execution
CUDA streams: concurrent per-patch execution and Nsight Systems timeline analysis
Multi-GPU within a node: device management, per-patch allocations, and halo exchange
Direct inter-GPU communication: unified virtual addressing and peer-to-peer transfers
Overlapping communication and computation with multiple streams
Multi-node parallelism with MPI: rank-to-GPU mapping, CUDA-aware MPI, and GPUDirect RDMA
NVSHMEM: the symmetric-memory model and GPU-initiated one-sided communication
Outlook: hierarchical reductions (CUB, NCCL), multi-dimensional domain decomposition, and parallel I/O

Learning Outcomes

After completing this course, you will be able to:

Port a CPU application to a single GPU using CUDA managed memory and prefetching
Use concurrent CUDA streams to overlap memory transfers with GPU computation
Scale CUDA C++ workloads across multiple GPUs within a single compute node
Enable and exploit direct peer-to-peer GPU memory access for efficient intra-node communication
Write portable, scalable SPMD code using CUDA-aware MPI with inter-node GPU communication
Apply NVSHMEM for GPU-initiated data transfers using the symmetric memory model
Implement domain decomposition and halo exchange patterns for distributed GPU workloads
Profile multi-GPU execution and identify performance bottlenecks with NVIDIA Nsight Systems

Registration, Wait List and Withdrawal Policy

Registration

Please register at the bottom of this page. Registration is open until a few days before the course starts, or until the course is fully booked.

Prices and Eligibility

This course is open and free of charge for participants affiliated with academic institutions in European Union (EU) member states and Horizon 2020-associated countries.

Wait List

If the course reaches its maximum capacity, you can request to join the wait list by sending an email to nhr-training@fau.de. Please include your name and university affiliation in the message.

Withdrawal Policy

Please only register if you are committed to attending the course. No-shows will be blacklisted and excluded from future events.

If you need to withdraw your registration, please either cancel it directly through the registration system or send an email to nhr-training@fau.de.

Additional Courses

You can find an up-to-date list of all courses offered by NHR@FAU at https://hpc.fau.de/teaching/tutorials-and-courses/.

Registration

Participants

The agenda of this meeting is empty

Choose timezone

[Online] Scaling CUDA-Accelerated Applications

Online

Scaling CUDA-Accelerated Applications

Schedule & Format

From Zero to Multi-Node GPU Programming

Instructors

Course Description

Prerequisites

Knowledge

Technical

Course Structure

Learning Outcomes

Registration, Wait List and Withdrawal Policy

Registration

Prices and Eligibility

Wait List

Withdrawal Policy

Additional Courses