一、Ubuntu主机安装Nvidia CUDA 驱动
本小节参考NVIDIA Driver Installation Quickstart Guide :: NVIDIA Tesla Documentation
本节叙述如何使用包管理器在 Ubuntu 16.04 LTS 和 Ubuntu 18.04 LTS 发行版上安装 NVIDIA 驱动程序。
- NVIDIA 驱动程序在安装时需要依赖当前内核版本的
linux kernel header
和开发包。例如,linux 内核是 4.4.0,则必须安装linux-headers-4.4.0
。$ sudo apt-get install linux-headers-$(uname -r)
- 确保 CUDA 软件源上的包优先于Canonical软件源
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g') $ wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin $ sudo mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600
- 安装 CUDA 软件源的GPG公钥
$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7fa2af80.pub
- 安装 CUDA 软件源
$ echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list
- 更新 APT 缓存并使用 CUDA 软件源安装驱动程序。可以使用
--no-install-recommends
选项安装简化版驱动程序,无需任何 X 依赖。这对于云实例上的 headless 安装特别有用。$ sudo apt-get update $ sudo apt-get -y install cuda-drivers
- 验证nVidia驱动安装结果
$ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A | | 30% 28C P8 17W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A | | 30% 25C P8 12W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
二、安装Docker与NVIDIA Container Toolkit
本小节参考Installation Guide - NVIDIA Cloud Native Technologies documentation
- 安装Docker
$ curl -fsSL https://get.docker.com | bash -s docker --mirror Aliyun
- 添加nvidia-docker软件源与对应GPG 公钥
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) $ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - $ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
- 安装nvidia-docker2
$ sudo apt-get update $ sudo apt-get install -y nvidia-docker2
- Docker 的默认运行时改为
nvidia-container-runtime
,而不是runc
$ vim /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } }, "registry-mirrors": ["https://hub-mirror.c.163.com"] }
- 重启 Docker Engine
$ systemctl restart docker
- 验证 nvidia-docker
$ docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A | | 30% 27C P8 17W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A | | 30% 25C P8 14W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
三、添加主机到Kubesphere集群
- 修改
config-sample.yaml
,添加GPU主机到配置文件$ vim config-sample.yaml
- 使用
kubekey
根据配置文件自动化加入节点到Kubesphere集群$ ./kk add nodes -f config-sample.yaml
- 设置节点标签,打上GPU节点标签
图形化操作,参考 Kubesphere - 节点管理 - 在Kubesphere集群安装
k8s-device-plugin
插件
参考 调度 GPUs | Kubernetes$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml