Understanding NVIDIA GPU Driver Architecture and Server Configuration

Table of Contents

Since 2024, NVIDIA’s driver ecosystem has undergone significant restructuring. Starting with r560, open GPU kernel modules became the default, accompanied by the introduction of the nvidia-driver-assistant utility to simplify automatic setup.

This post is not an installation guide, but a personal attempt to organize my understanding of the driver structure, the differences between the open/server/proprietary driver variants, and the scenarios in which Fabric Manager is required. If you notice any inaccuracies or have suggestions for improvement, I would greatly appreciate your feedback.

Driver Architecture Overview
#

As of r580, the NVIDIA driver stack can be viewed as consisting of two kernel-space implementations and one user-space component.

Category	Component	Description
Kernel-space (implementation)	Open GPU Kernel Modules	MIT/GPLv2 dual-licensed modules for Turing and later GPUs.
	Proprietary GPU Kernel Modules	Closed-source implementation for older architectures (Maxwell–Volta).
User-space	CUDA Driver / Runtime	Proprietary user-space binaries providing CUDA APIs and GPU management.

The following points summarize the key aspects of this structure:

Depending on the GPU generation and packaging configuration, kernel modules may use either the open or proprietary implementation.
The CUDA driver and runtime remain proprietary because they provide the user-space CUDA APIs.
The nvidia-driver-server package belongs to the datacenter/server driver stream and is commonly used on NVSwitch-based systems (e.g., HGX/DGX).
NVSwitch systems require Fabric Manager and NSCQ components to initialize and manage the GPU fabric topology.

Comparing Open, Server, and Proprietary Packages
#

Package	Purpose	Notes
nvidia-open	Open GPU kernel module stack	Recommended for Turing and later GPUs
nvidia-driver	Standard NVIDIA driver stack	May use open or proprietary modules depending on distro/repository packaging
nvidia-driver-server	Datacenter/server driver stream	Commonly used for HGX/NVSwitch systems
cuda-toolkit	CUDA SDK/toolchain	Includes nvcc, CUDA libraries, and development tools
cuda-runtime	CUDA runtime libraries	Runtime-only userspace components
cuda-drivers	Convenience meta-package for NVIDIA drivers	May replace/conflict with open-driver packages on Ubuntu

CUDA Userspace Components Remain Proprietary
#

Even though GPU kernel modules have been open-sourced, user-space components such as libcuda.so and libnvidia-ml.so remain proprietary.

Open GPU Kernel Modules replace only the kernel-space GPU driver layer.

Fabric Manager: When and Why
#

Fabric Manager is an optional component required only for NVSwitch-based configurations (e.g., HGX, DGX). It monitors NVSwitch interconnects and manages GPU fabric topology.

Fabric Manager is required only for NVSwitch-based systems. In practice, NVSwitch platforms typically use the nvidia-driver-server driver stream together with Fabric Manager and NSCQ components. Fabric Manager and the NSCQ components are installed alongside this server-driver stream. Using Fabric Manager with nvidia-open or the standard nvidia-driver stream is not supported.

For PCIe GPUs (e.g., H100 PCIe, L40S), Fabric Manager is not required.

NVLink vs NVSwitch
#

Type	Name	Function
Link-level	NVLink	Point-to-point GPU interconnect (used in 2–4 GPU setups).
Fabric-level	NVSwitch	Crossbar switch connecting all GPUs in 8+ GPU systems.

NVLink-only systems typically do not require Fabric Manager. NVSwitch systems require it for fabric initialization and management.

nvidia-driver-assistant Utility
#

Introduced in r580, nvidia-driver-assistant automatically detects GPU and OS versions, and installs the correct driver variant.

bash

sudo apt install nvidia-driver-assistant
nvidia-driver-assistant --install
# To explicitly choose proprietary modules:
sudo nvidia-driver-assistant --install --module-flavor closed

Decision Rules
#

NVSwitch-based systems (HGX / DGX)
#

Use nvidia-driver-server together with nvidia-fabricmanager. Install cuda-toolkit separately if CUDA development or ML workloads are required.

Required for HGX and DGX platforms with NVSwitch interconnects.
nvidia-driver-server relies on proprietary kernel modules and includes the NSCQ components needed for NVSwitch.
Fabric Manager must be installed and active for NVSwitch initialization.

Non-NVSwitch GPU servers (standard ML compute nodes)
#

Use nvidia-open (or nvidia-driver) for the kernel-space driver, and install cuda-toolkit separately if CUDA development is required.

Avoid the generic cuda-drivers meta-package if you want to preserve the open GPU kernel module stack on Ubuntu.

A kernel-space driver (nvidia-open or nvidia-driver) is required to register GPUs and enable CUDA communication.
cuda-toolkit provides CUDA compilers, libraries, and development tools.
cuda-runtime provides runtime-only CUDA userspace libraries.
Fabric Manager is not required on PCIe-based systems.

Containerized compute environments (Docker / Kubernetes / Slurm)
#

Host system: nvidia-open (or nvidia-driver) Optional on host: cuda-toolkit Containers: cuda-runtime

The host provides the NVIDIA kernel driver and /dev/nvidia* devices.
Containers typically require only CUDA userspace libraries.
NVIDIA kernel modules are not loaded inside containers.

Troubleshooting Notes
#

CUDA runtime error (802): system not yet initialized
#

This error may appear on NVSwitch-based systems when the Fabric Manager service is not installed.

To resolve this issue, make sure that Fabric Manager is installed and active. It is required for initializing and managing NVSwitch interconnects.

References
#

For detailed installation steps and the latest package commands, refer to the official NVIDIA documentation: Driver Installation Guide

Driver Architecture Overview #

Comparing Open, Server, and Proprietary Packages #

CUDA Userspace Components Remain Proprietary #

Fabric Manager: When and Why #

NVLink vs NVSwitch #

nvidia-driver-assistant Utility #

Decision Rules #

NVSwitch-based systems (HGX / DGX) #

Non-NVSwitch GPU servers (standard ML compute nodes) #

Containerized compute environments (Docker / Kubernetes / Slurm) #

Troubleshooting Notes #

CUDA runtime error (802): system not yet initialized #

References #