Skip to content

Failed to discover NVIDIA GPU in the running container started by buildah (vfs + chroot) #5227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
enihcam opened this issue Dec 16, 2023 · 16 comments

Comments

@enihcam
Copy link

enihcam commented Dec 16, 2023

Description
Failed to discover NVIDIA GPU in the running container started by buildah (vfs + chroot)

Steps to reproduce the issue:

  1. start a GPU container that does NOT support Docker-in-Docker (for security reasons)
  2. install buildah
  3. configure storage driver export STORAGE_DRIVER=vfs and isolation export BUILDAH_ISOLATION=chroot
  4. build a PyTorch+CUDA image with buildah and run with buildah

Describe the results you received:
image

Describe the results you expected:
pytorch finds the gpu run the code successfully.

Output of rpm -q buildah or apt list buildah:

# rpm -q buildah
buildah-1.30.0-1.tl4.x86_64

Output of buildah version:

# buildah version
Version:         1.30.0
Go Version:      go1.19
Image Spec:      1.0.2-dev
Runtime Spec:    1.1.0-rc.1
CNI Spec:        1.0.0
libcni Version:  v1.1.2
image Version:   5.25.0
Git Commit:
Built:           Fri Jul 14 19:36:27 2023
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64

Output of podman version if reporting a podman build issue:

(paste your output here)

Output of cat /etc/*release:

# cat /etc/*release
NAME="TencentOS Server"
VERSION="4.0"
ID="tencentos"
ID_LIKE="tencentos"
VERSION_ID="4.0"
PLATFORM_ID="platform:tl4.0"
PRETTY_NAME="TencentOS Server 4.0"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:tencentos:tencentos:4.0"
HOME_URL="https://cloud.tencent.com/product/ts"
BUG_REPORT_URL="https://cloud.tencent.com/product/ts"
TencentOS Server 4.0

Output of uname -a:

# uname -a
Linux root-pvkf3ma0a 5.4.119-19.0009.28 #1 SMP Thu May 18 10:37:10 CST 2023 x86_64 GNU/Linux

Output of cat /etc/containers/storage.conf:

# cat /etc/containers/storage.conf
[storage]
driver = "vfs"
runroot = "/data/containers/storage"
graphroot = "/data/containers/storage"
rootless_storage_path = "/data/containers/storage"

[storage.options.vfs]
ignore_chown_errors = "true"
@rhatdan
Copy link
Member

rhatdan commented Dec 16, 2023

Isn't the GPU a device? Say /dev/gpu?

Could you try

ctr=$(buildah from --device /dev/gpu ...)
buildah run $ctr ...

Copy link

A friendly reminder that this issue had no activity for 30 days.

@enihcam
Copy link
Author

enihcam commented Jul 10, 2024

Isn't the GPU a device? Say /dev/gpu?

Could you try

ctr=$(buildah from --device /dev/gpu ...) buildah run $ctr ...

sorry for late reply. i tried the following:

ctr=$(buildah --device /dev/nvidia0 from for.example.com/gpu_image_for_test)
buildah run $ctr /bin/bash

and then nvidia-smi gave me no output at all.

btw, this container is run in another container with vfs+chroot mode.

@rhatdan
Copy link
Member

rhatdan commented Jul 11, 2024

Could you try

buildah --device=nvidia.com/gpu=all from ...

@enihcam
Copy link
Author

enihcam commented Jul 13, 2024

Could you try

buildah --device=nvidia.com/gpu=all from ...

stat nvidia.com/gpu=all: no such file or directory

@rhatdan
Copy link
Member

rhatdan commented Jul 13, 2024

What version of buildah are you using?

@enihcam
Copy link
Author

enihcam commented Jul 15, 2024

What version of buildah are you using?

~ # buildah version
Version:         1.33.7
Go Version:      go1.21.9 (Red Hat 1.21.9-1.module+el8.8.0+632+2dde9914)
Image Spec:      1.1.0-rc.5
Runtime Spec:    1.1.0
CNI Spec:        1.0.0
libcni Version:  v1.1.2
image Version:   5.29.2
Git Commit:
Built:           Tue Jun 18 11:12:42 2024
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64
~ # env | grep BUILDAH
BUILDAH_FORMAT=docker
BUILDAH_ISOLATION=chroot
~ # env | grep STORAGE
STORAGE_DRIVER=vfs

@rhatdan
Copy link
Member

rhatdan commented Jul 15, 2024

Any chance you can update the version?

$ buildah -v
buildah version 1.36.0 (image-spec 1.1.0, runtime-spec 1.2.0)
tmp $ buildah version
Version:         1.36.0
Go Version:      go1.22.3
Image Spec:      1.1.0
Runtime Spec:    1.2.0
CNI Spec:        1.0.0
libcni Version:  
image Version:   5.31.0
Git Commit:      
Built:           Mon May 27 09:11:54 2024
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64

@rhatdan
Copy link
Member

rhatdan commented Jul 15, 2024

$ git show 7658d9ed7e02ec5cf90cc397f78a5755599b0a32
commit 7658d9ed7e02ec5cf90cc397f78a5755599b0a32
Author: Daniel J Walsh <[email protected]>
Date:   Mon Mar 25 11:55:50 2024 -0400

    Support nvidia.com/gpus as devices
    

    Signed-off-by: Daniel J Walsh <[email protected]>

diff --git a/pkg/parse/parse_unix.go b/pkg/parse/parse_unix.go
index ff8ce854e..d3f3dc14c 100644
--- a/pkg/parse/parse_unix.go
+++ b/pkg/parse/parse_unix.go
@@ -7,6 +7,7 @@ import (
        "fmt"
        "os"
        "path/filepath"
+       "strings"
 
        "github.com/containers/buildah/define"
        "github.com/opencontainers/runc/libcontainer/devices"
@@ -18,6 +19,12 @@ func DeviceFromPath(device string) (define.ContainerDevices, error) {
        if err != nil {
                return nil, err
        }
+       if strings.HasPrefix(src, "nvidia.com") {
+               device := define.BuildahDevice{Source: src, Destination: dst}
+               devs = append(devs, device)
+               return devs, nil
+       }
+
        srcInfo, err := os.Stat(src)
        if err != nil {
                return nil, fmt.Errorf("getting info of source device %s: %w", src, err)

@rhatdan rhatdan closed this as completed Jul 15, 2024
@rhatdan
Copy link
Member

rhatdan commented Jul 16, 2024

Yes 1.36 has the patch.

@enihcam
Copy link
Author

enihcam commented Jul 17, 2024

Yes 1.36 has the patch.

https://github.com/containers/buildah/blob/release-1.36/pkg/parse/parse_unix.go

It seems like the patch is missing. Could you confirm? Thanks.

@forwardmeasure
Copy link

Hello all, any update here? I don't see parse_unix.go having the patch that was mentioned.

@enihcam
Copy link
Author

enihcam commented Jul 29, 2024

@rhatdan your input is needed.

@rhatdan rhatdan reopened this Jul 29, 2024
@nalind
Copy link
Member

nalind commented Aug 5, 2024

Does the container have access to the necessary CDI configuration in its /etc/cdi directory, either volume-mounted from the host where nvidia-ctk cdi generate was run to generate it, or via some other mechanism?

@enihcam
Copy link
Author

enihcam commented Aug 29, 2024

Any workaround before the PR is merged?

@nalind
Copy link
Member

nalind commented Aug 29, 2024

I think the current expectation is that, if the data in /etc/cdi is provided to the container, we won't need this PR, since the CDI logic in 1.36 (and 1.37) already gets a crack at device specifications.

Does the container have access to the necessary CDI configuration in its /etc/cdi directory, either volume-mounted from the host where nvidia-ctk cdi generate was run to generate it, or via some other mechanism?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants