Since 2024, NVIDIA’s driver ecosystem has undergone significant restructuring.
Starting with r560, open GPU kernel modules became the default, accompanied by the introduction of the nvidia-driver-assistant utility to simplify automatic setup.
This post is not an installation guide, but a personal attempt to organize my understanding of the driver structure, the differences between the open/server/proprietary driver variants, and the scenarios in which Fabric Manager is required. If you notice any inaccuracies or have suggestions for improvement, I would greatly appreciate your feedback.
Driver Architecture Overview #
As of r580, the NVIDIA driver stack can be viewed as consisting of two kernel-space implementations and one user-space component.
| Category | Component | Description |
|---|---|---|
| Kernel-space (implementation) | Open GPU Kernel Modules | MIT/GPLv2 dual-licensed modules for Turing and later GPUs. |
| Proprietary GPU Kernel Modules | Closed-source implementation for older architectures (Maxwell–Volta). | |
| User-space | CUDA Driver / Runtime | Proprietary user-space binaries providing CUDA APIs and GPU management. |
The following points summarize the key aspects of this structure:
- Depending on the GPU generation and packaging configuration, kernel modules may use either the open or proprietary implementation.
- The CUDA driver and runtime remain proprietary because they provide the user-space CUDA APIs.
- The
nvidia-driver-serverpackage belongs to the datacenter/server driver stream and is commonly used on NVSwitch-based systems (e.g., HGX/DGX). - NVSwitch systems require Fabric Manager and NSCQ components to initialize and manage the GPU fabric topology.
Comparing Open, Server, and Proprietary Packages #
| Package | Purpose | Notes |
|---|---|---|
| nvidia-open | Open GPU kernel module stack | Recommended for Turing and later GPUs |
| nvidia-driver | Standard NVIDIA driver stack | May use open or proprietary modules depending on distro/repository packaging |
| nvidia-driver-server | Datacenter/server driver stream | Commonly used for HGX/NVSwitch systems |
| cuda-toolkit | CUDA SDK/toolchain | Includes nvcc, CUDA libraries, and development tools |
| cuda-runtime | CUDA runtime libraries | Runtime-only userspace components |
| cuda-drivers | Convenience meta-package for NVIDIA drivers | May replace/conflict with open-driver packages on Ubuntu |
CUDA Userspace Components Remain Proprietary #
Even though GPU kernel modules have been open-sourced, user-space components such as libcuda.so and libnvidia-ml.so remain proprietary.
Open GPU Kernel Modules replace only the kernel-space GPU driver layer.
Fabric Manager: When and Why #
Fabric Manager is an optional component required only for NVSwitch-based configurations (e.g., HGX, DGX). It monitors NVSwitch interconnects and manages GPU fabric topology.
Fabric Manager is required only for NVSwitch-based systems. In practice, NVSwitch platforms typically use the nvidia-driver-server driver stream together with Fabric Manager and NSCQ components.
Fabric Manager and the NSCQ components are installed alongside this server-driver stream. Using Fabric Manager with nvidia-open or the standard nvidia-driver stream is not supported.
For PCIe GPUs (e.g., H100 PCIe, L40S), Fabric Manager is not required.
NVLink vs NVSwitch #
| Type | Name | Function |
|---|---|---|
| Link-level | NVLink | Point-to-point GPU interconnect (used in 2–4 GPU setups). |
| Fabric-level | NVSwitch | Crossbar switch connecting all GPUs in 8+ GPU systems. |
NVLink-only systems typically do not require Fabric Manager. NVSwitch systems require it for fabric initialization and management.
nvidia-driver-assistant Utility #
Introduced in r580, nvidia-driver-assistant automatically detects GPU and OS versions, and installs the correct driver variant.
sudo apt install nvidia-driver-assistant
nvidia-driver-assistant --install
# To explicitly choose proprietary modules:
sudo nvidia-driver-assistant --install --module-flavor closedDecision Rules #
NVSwitch-based systems (HGX / DGX) #
Use nvidia-driver-server together with nvidia-fabricmanager. Install cuda-toolkit separately if CUDA development or ML workloads are required.
- Required for HGX and DGX platforms with NVSwitch interconnects.
nvidia-driver-serverrelies on proprietary kernel modules and includes the NSCQ components needed for NVSwitch.- Fabric Manager must be installed and active for NVSwitch initialization.
Non-NVSwitch GPU servers (standard ML compute nodes) #
Use nvidia-open (or nvidia-driver) for the kernel-space driver, and install cuda-toolkit separately if CUDA development is required.
Avoid the generic cuda-drivers meta-package if you want to preserve the open GPU kernel module stack on Ubuntu.
- A kernel-space driver (
nvidia-openornvidia-driver) is required to register GPUs and enable CUDA communication. cuda-toolkitprovides CUDA compilers, libraries, and development tools.cuda-runtimeprovides runtime-only CUDA userspace libraries.- Fabric Manager is not required on PCIe-based systems.
Containerized compute environments (Docker / Kubernetes / Slurm) #
Host system: nvidia-open (or nvidia-driver)
Optional on host: cuda-toolkit
Containers: cuda-runtime
- The host provides the NVIDIA kernel driver and
/dev/nvidia*devices. - Containers typically require only CUDA userspace libraries.
- NVIDIA kernel modules are not loaded inside containers.
Troubleshooting Notes #
CUDA runtime error (802): system not yet initialized #
This error may appear on NVSwitch-based systems when the Fabric Manager service is not installed.
To resolve this issue, make sure that Fabric Manager is installed and active. It is required for initializing and managing NVSwitch interconnects.
References #
For detailed installation steps and the latest package commands, refer to the official NVIDIA documentation: Driver Installation Guide