PyTorch On K8S 共享内存问题定位
Background
将Pytorch运行在K8S,报以下错误:
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
问题定位
根据PyTorch README发现:
Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with
--ipc=host
or--shm-size
command line options tonvidia-docker run
.
这里说明了,PyTorch的IPC会利用共享内存,所以共享内存必须足够大。
Docker默认共享内存是64M,并且可以通过docker run --shm-size
进行修改,但是K8S怎么搞呢?根据API文档,发现K8S没办法直接指定,所以只能另辟蹊径。
参考: issue
最终解决方法:
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- volumeMounts:
- mountPath: /dev/shm
name: dshm
原来emptyDir
还是支持内存,然后挂载到容器的shm目录,最终实现对容器的共享内存进行扩容。脑洞有点大,学习了。