Since 2024, NVIDIA’s driver ecosystem has undergone significant restructuring.
Starting with r560, open GPU kernel modules became the default, accompanied by the introduction of the nvidia-driver-assistant utility to simplify automatic setup.
This post is not an installation guide, but a personal attempt to organize my understanding of the driver structure, the differences between the open/server/proprietary driver variants, and the scenarios in which Fabric Manager is required. If you notice any inaccuracies or have suggestions for improvement, I would greatly appreciate your feedback.
Driver Architecture Overview #
As of r580, the NVIDIA driver stack can be viewed as consisting of two kernel-space implementations and one user-space component.
| Category | Component | Description |
|---|---|---|
| Kernel-space (implementation) | Open GPU Kernel Modules | MIT/GPLv2 dual-licensed modules for Turing and later GPUs. |
| Proprietary GPU Kernel Modules | Closed-source implementation for older architectures (Maxwell–Volta). | |
| User-space | CUDA Driver / Runtime | Proprietary user-space binaries providing CUDA APIs and GPU management. |
The following points summarize the key aspects of this structure:
- Depending on the GPU generation, kernel modules may use either the open or proprietary implementation.
- The CUDA driver and runtime remain proprietary because they provide the user-space CUDA APIs.
- The nvidia-driver-server package is intended for NVSwitch-based systems, where Fabric Manager and NSCQ components must also be installed. These components rely on proprietary kernel modules.
Comparing Open, Server, and Proprietary Packages #
| Package | Kernel Module Type | Intended Use | Notes |
|---|---|---|---|
| nvidia-open | open | Standard systems (Turing+) | Uses open GPU kernel modules |
| nvidia-driver | open (default) / proprietary (optional) | General-purpose systems | `–module-flavor` to choose module variant |
| nvidia-driver-server | proprietary | NVSwitch / HGX systems | Requires Fabric Manager / NSCQ |
| cuda-drivers | proprietary | CUDA development / execution stacks | Driver + CUDA user-space libraries |
| cuda-runtime | proprietary | Compute-only / headless nodes | User-space CUDA runtime only (no graphics) |
CUDA Driver Is Proprietary #
The CUDA driver (cuda-driver) remains a closed-source component.
Even though kernel modules have been open-sourced, all user-space binaries (e.g., libcuda.so, libnvidia-ml.so) are still proprietary.
The open GPU kernel module only replaces the kernel-side layer.
Fabric Manager: When and Why #
Fabric Manager is an optional component required only for NVSwitch-based configurations (e.g., HGX, DGX). It monitors NVSwitch interconnects and manages GPU fabric topology.
Fabric Manager is required only for NVSwitch-based systems. These systems must use the nvidia-driver-server package, which provides the proprietary kernel module stack needed for NVSwitch operation.
Fabric Manager and the NSCQ components are installed alongside this server-driver stream. Using Fabric Manager with nvidia-open or the standard nvidia-driver stream is not supported.
For PCIe GPUs (e.g., H100 PCIe, L40S), Fabric Manager is not required.
NVLink vs NVSwitch #
| Type | Name | Function |
|---|---|---|
| Link-level | NVLink | Point-to-point GPU interconnect (used in 2–4 GPU setups). |
| Fabric-level | NVSwitch | Crossbar switch connecting all GPUs in 8+ GPU systems. |
NVLink-only systems do not require Fabric Manager. NVSwitch systems must use it.
nvidia-driver-assistant Utility #
Introduced in r580, nvidia-driver-assistant automatically detects GPU and OS versions, and installs the correct driver variant.
sudo apt install nvidia-driver-assistant
nvidia-driver-assistant --install
# To explicitly choose proprietary modules:
sudo nvidia-driver-assistant --install --module-flavor closedDecision Rules #
NVSwitch-based systems (HGX / DGX) #
Use nvidia-driver-server together with nvidia-fabricmanager, plus cuda-drivers for ML workloads.
- Required for HGX and DGX platforms with NVSwitch interconnects.
nvidia-driver-serverrelies on proprietary kernel modules and includes the NSCQ components needed for NVSwitch.- Fabric Manager must be installed and active for NVSwitch initialization.
Non-NVSwitch GPU servers (standard ML compute nodes) #
Use nvidia-driver (or nvidia-open) together with cuda-drivers.
cuda-driversprovides the user-space CUDA libraries required for training and inference.- A kernel-space driver (
nvidia-driverornvidia-open) is required to register GPUs and enable CUDA communication. - Fabric Manager is not required on PCIe-based systems.
Containerized compute environments (Docker / Kubernetes / Slurm) #
Host system: nvidia-driver (or nvidia-open) + cuda-drivers
Containers: cuda-runtime or cuda-drivers
- The host already provides
/dev/nvidia*devices and the kernel-space driver. - Containers require only the user-space CUDA libraries; no kernel-space modules run inside containers.
Troubleshooting Notes #
CUDA runtime error (802): system not yet initialized #
This error may appear on NVSwitch-based systems when the Fabric Manager service is not installed.
To resolve this issue, make sure that Fabric Manager is installed and active. It is required for initializing and managing NVSwitch interconnects.
References #
For detailed installation steps and the latest package commands, refer to the official NVIDIA documentation: Driver Installation Guide