Understanding NVIDIA GPU Driver Architecture and Server Configuration

Table of Contents

Since 2024, NVIDIA’s driver ecosystem has undergone significant restructuring. Starting with r560, open GPU kernel modules became the default, accompanied by the introduction of the nvidia-driver-assistant utility to simplify automatic setup.

This post is not an installation guide, but a personal attempt to organize my understanding of the driver structure, the differences between the open/server/proprietary driver variants, and the scenarios in which Fabric Manager is required. If you notice any inaccuracies or have suggestions for improvement, I would greatly appreciate your feedback.

Driver Architecture Overview
#

As of r580, the NVIDIA driver stack can be viewed as consisting of two kernel-space implementations and one user-space component.

Category	Component	Description
Kernel-space (implementation)	Open GPU Kernel Modules	MIT/GPLv2 dual-licensed modules for Turing and later GPUs.
	Proprietary GPU Kernel Modules	Closed-source implementation for older architectures (Maxwell–Volta).
User-space	CUDA Driver / Runtime	Proprietary user-space binaries providing CUDA APIs and GPU management.

The following points summarize the key aspects of this structure:

Depending on the GPU generation, kernel modules may use either the open or proprietary implementation.
The CUDA driver and runtime remain proprietary because they provide the user-space CUDA APIs.
The nvidia-driver-server package is intended for NVSwitch-based systems, where Fabric Manager and NSCQ components must also be installed. These components rely on proprietary kernel modules.

Comparing Open, Server, and Proprietary Packages
#

Package	Kernel Module Type	Intended Use	Notes
nvidia-open	open	Standard systems (Turing+)	Uses open GPU kernel modules
nvidia-driver	open (default) / proprietary (optional)	General-purpose systems	`–module-flavor` to choose module variant
nvidia-driver-server	proprietary	NVSwitch / HGX systems	Requires Fabric Manager / NSCQ
cuda-drivers	proprietary	CUDA development / execution stacks	Driver + CUDA user-space libraries
cuda-runtime	proprietary	Compute-only / headless nodes	User-space CUDA runtime only (no graphics)

CUDA Driver Is Proprietary
#

The CUDA driver (cuda-driver) remains a closed-source component. Even though kernel modules have been open-sourced, all user-space binaries (e.g., libcuda.so, libnvidia-ml.so) are still proprietary. The open GPU kernel module only replaces the kernel-side layer.

Fabric Manager: When and Why
#

Fabric Manager is an optional component required only for NVSwitch-based configurations (e.g., HGX, DGX). It monitors NVSwitch interconnects and manages GPU fabric topology.

Fabric Manager is required only for NVSwitch-based systems. These systems must use the nvidia-driver-server package, which provides the proprietary kernel module stack needed for NVSwitch operation. Fabric Manager and the NSCQ components are installed alongside this server-driver stream. Using Fabric Manager with nvidia-open or the standard nvidia-driver stream is not supported.

For PCIe GPUs (e.g., H100 PCIe, L40S), Fabric Manager is not required.

NVLink vs NVSwitch
#

Type	Name	Function
Link-level	NVLink	Point-to-point GPU interconnect (used in 2–4 GPU setups).
Fabric-level	NVSwitch	Crossbar switch connecting all GPUs in 8+ GPU systems.

NVLink-only systems do not require Fabric Manager. NVSwitch systems must use it.

nvidia-driver-assistant Utility
#

Introduced in r580, nvidia-driver-assistant automatically detects GPU and OS versions, and installs the correct driver variant.

bash

sudo apt install nvidia-driver-assistant
nvidia-driver-assistant --install
# To explicitly choose proprietary modules:
sudo nvidia-driver-assistant --install --module-flavor closed

Decision Rules
#

NVSwitch-based systems (HGX / DGX)
#

Use nvidia-driver-server together with nvidia-fabricmanager, plus cuda-drivers for ML workloads.

Required for HGX and DGX platforms with NVSwitch interconnects.
nvidia-driver-server relies on proprietary kernel modules and includes the NSCQ components needed for NVSwitch.
Fabric Manager must be installed and active for NVSwitch initialization.

Non-NVSwitch GPU servers (standard ML compute nodes)
#

Use nvidia-driver (or nvidia-open) together with cuda-drivers.

cuda-drivers provides the user-space CUDA libraries required for training and inference.
A kernel-space driver (nvidia-driver or nvidia-open) is required to register GPUs and enable CUDA communication.
Fabric Manager is not required on PCIe-based systems.

Containerized compute environments (Docker / Kubernetes / Slurm)
#

Host system: nvidia-driver (or nvidia-open) + cuda-drivers Containers: cuda-runtime or cuda-drivers

The host already provides /dev/nvidia* devices and the kernel-space driver.
Containers require only the user-space CUDA libraries; no kernel-space modules run inside containers.

Troubleshooting Notes
#

CUDA runtime error (802): system not yet initialized
#

This error may appear on NVSwitch-based systems when the Fabric Manager service is not installed.

To resolve this issue, make sure that Fabric Manager is installed and active. It is required for initializing and managing NVSwitch interconnects.

References
#

For detailed installation steps and the latest package commands, refer to the official NVIDIA documentation: Driver Installation Guide

Driver Architecture Overview #

Comparing Open, Server, and Proprietary Packages #

CUDA Driver Is Proprietary #

Fabric Manager: When and Why #

NVLink vs NVSwitch #

nvidia-driver-assistant Utility #

Decision Rules #

NVSwitch-based systems (HGX / DGX) #

Non-NVSwitch GPU servers (standard ML compute nodes) #

Containerized compute environments (Docker / Kubernetes / Slurm) #

Troubleshooting Notes #

CUDA runtime error (802): system not yet initialized #

References #