Skip to main content

Understanding NVIDIA GPU Driver Architecture and Server Configuration

··4 mins·
Makoto Morinaga
Author
Makoto Morinaga
A personal notebook for tech notes, coding, and system experiments.
Table of Contents

Since 2024, NVIDIA’s driver ecosystem has undergone significant restructuring. Starting with r560, open GPU kernel modules became the default, accompanied by the introduction of the nvidia-driver-assistant utility to simplify automatic setup.

This post is not an installation guide, but a personal attempt to organize my understanding of the driver structure, the differences between the open/server/proprietary driver variants, and the scenarios in which Fabric Manager is required. If you notice any inaccuracies or have suggestions for improvement, I would greatly appreciate your feedback.

Driver Architecture Overview
#

As of r580, the NVIDIA driver stack can be viewed as consisting of two kernel-space implementations and one user-space component.

Category Component Description
Kernel-space (implementation) Open GPU Kernel Modules MIT/GPLv2 dual-licensed modules for Turing and later GPUs.
Proprietary GPU Kernel Modules Closed-source implementation for older architectures (Maxwell–Volta).
User-space CUDA Driver / Runtime Proprietary user-space binaries providing CUDA APIs and GPU management.

The following points summarize the key aspects of this structure:

  • Depending on the GPU generation, kernel modules may use either the open or proprietary implementation.
  • The CUDA driver and runtime remain proprietary because they provide the user-space CUDA APIs.
  • The nvidia-driver-server package is intended for NVSwitch-based systems, where Fabric Manager and NSCQ components must also be installed. These components rely on proprietary kernel modules.

Comparing Open, Server, and Proprietary Packages
#

Package Kernel Module Type Intended Use Notes
nvidia-open open Standard systems (Turing+) Uses open GPU kernel modules
nvidia-driver open (default) / proprietary (optional) General-purpose systems `–module-flavor` to choose module variant
nvidia-driver-server proprietary NVSwitch / HGX systems Requires Fabric Manager / NSCQ
cuda-drivers proprietary CUDA development / execution stacks Driver + CUDA user-space libraries
cuda-runtime proprietary Compute-only / headless nodes User-space CUDA runtime only (no graphics)

CUDA Driver Is Proprietary
#

The CUDA driver (cuda-driver) remains a closed-source component. Even though kernel modules have been open-sourced, all user-space binaries (e.g., libcuda.so, libnvidia-ml.so) are still proprietary. The open GPU kernel module only replaces the kernel-side layer.

Fabric Manager: When and Why
#

Fabric Manager is an optional component required only for NVSwitch-based configurations (e.g., HGX, DGX). It monitors NVSwitch interconnects and manages GPU fabric topology.

Fabric Manager is required only for NVSwitch-based systems. These systems must use the nvidia-driver-server package, which provides the proprietary kernel module stack needed for NVSwitch operation. Fabric Manager and the NSCQ components are installed alongside this server-driver stream. Using Fabric Manager with nvidia-open or the standard nvidia-driver stream is not supported.

For PCIe GPUs (e.g., H100 PCIe, L40S), Fabric Manager is not required.

NVLink vs NVSwitch #

Type Name Function
Link-level NVLink Point-to-point GPU interconnect (used in 2–4 GPU setups).
Fabric-level NVSwitch Crossbar switch connecting all GPUs in 8+ GPU systems.

NVLink-only systems do not require Fabric Manager. NVSwitch systems must use it.

nvidia-driver-assistant Utility
#

Introduced in r580, nvidia-driver-assistant automatically detects GPU and OS versions, and installs the correct driver variant.

bash
sudo apt install nvidia-driver-assistant
nvidia-driver-assistant --install
# To explicitly choose proprietary modules:
sudo nvidia-driver-assistant --install --module-flavor closed

Decision Rules
#

NVSwitch-based systems (HGX / DGX)
#

Use nvidia-driver-server together with nvidia-fabricmanager, plus cuda-drivers for ML workloads.

  • Required for HGX and DGX platforms with NVSwitch interconnects.
  • nvidia-driver-server relies on proprietary kernel modules and includes the NSCQ components needed for NVSwitch.
  • Fabric Manager must be installed and active for NVSwitch initialization.

Non-NVSwitch GPU servers (standard ML compute nodes)
#

Use nvidia-driver (or nvidia-open) together with cuda-drivers.

  • cuda-drivers provides the user-space CUDA libraries required for training and inference.
  • A kernel-space driver (nvidia-driver or nvidia-open) is required to register GPUs and enable CUDA communication.
  • Fabric Manager is not required on PCIe-based systems.

Containerized compute environments (Docker / Kubernetes / Slurm)
#

Host system: nvidia-driver (or nvidia-open) + cuda-drivers Containers: cuda-runtime or cuda-drivers

  • The host already provides /dev/nvidia* devices and the kernel-space driver.
  • Containers require only the user-space CUDA libraries; no kernel-space modules run inside containers.

Troubleshooting Notes
#

CUDA runtime error (802): system not yet initialized
#

This error may appear on NVSwitch-based systems when the Fabric Manager service is not installed.

To resolve this issue, make sure that Fabric Manager is installed and active. It is required for initializing and managing NVSwitch interconnects.

References
#

For detailed installation steps and the latest package commands, refer to the official NVIDIA documentation: Driver Installation Guide