Skip to content

aardvark-dns locks mount points #25994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Lalufu opened this issue Apr 27, 2025 · 7 comments · May be fixed by containers/common#2431
Open

aardvark-dns locks mount points #25994

Lalufu opened this issue Apr 27, 2025 · 7 comments · May be fixed by containers/common#2431
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. network Networking related issue or feature triaged Issue has been triaged

Comments

@Lalufu
Copy link

Lalufu commented Apr 27, 2025

Issue Description

Note: I'm not sure if this is a podman or an aardvark-dns issue.

Running aardvark-dns processes will block devices backing mount points (like partitions, logical volumes...) from being destroyed if the file system mounted from these devices were mounted when aardvark-dns was started. The aardvark-dns process must be terminated for the device to be freed

Steps to reproduce the issue

Mote: I'm using zfs for demonstration purposes here, as this is easiest on the machine in question, but this is not a ZFS specific problem. The problem also manifests with partitions and logical volumes.

All podman commands run as rootless.

Create a very simple pod that does not mount anything from the host:

$ cat foo.yaml
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: test
  name: test
spec:
  containers:
  - image: fedora:42
    name: test
    securityContext:
      readOnlyRootFilesystem: true
    command:
      - "sleep"
    args:
      - "7200"
  # Use user namespaces
  hostUsers: false

Create a new file system on the host:

$ sudo zfs create tank/mirror/solaris
$ findmnt tank/mirror/solaris
TARGET               SOURCE              FSTYPE OPTIONS
/tank/mirror/solaris tank/mirror/solaris zfs    rw,nosuid,nodev,relatime,seclabel,xattr,noacl,casesensitive

Note the file system is empty, it's not used by anything, it's just mounted.

Start the pod

$ podman kube play ./foo.yaml 
Pod:
6f891b76a212ef24a7cb8c63b1a9e2ed4c27e02d1571dd99d84706848ebdafb1
Container:
bc95925056f27b08038fe85001f48ff449701ae1d714a15f3f86245abcec5fea

$ podman container list
CONTAINER ID  IMAGE                                    COMMAND     CREATED        STATUS        PORTS       NAMES
56f36f75e305  localhost/podman-pause:5.4.2-1743552000              6 seconds ago  Up 5 seconds              6f891b76a212-infra
bc95925056f2  registry.fedoraproject.org/fedora:42     7200        6 seconds ago  Up 5 seconds              test-test

Try to destroy the just-created file system

$ sudo zfs destroy tank/mirror/solaris
cannot destroy 'tank/mirror/solaris': dataset is busy

There are no users of that path

$ sudo lsof /tank/mirror/solaris
$ 

But the mount point is still held

$ sudo grep /tank/mirror/solaris /proc/*/mounts
/proc/2840545/mounts:tank/mirror/solaris /tank/mirror/solaris zfs rw,seclabel,nosuid,nodev,relatime,xattr,noacl,casesensitive 0 0
$ ps q 2840545
    PID TTY      STAT   TIME COMMAND
2840545 ?        Ssl    0:00 /usr/libexec/podman/aardvark-dns --config /run/user/1000/containers/networks/aardvark-dns -p 53 run

Stop the pod

$ podman kube play ./foo.yaml --down
WARN[0010] StopSignal SIGTERM failed to stop container test-test in 10 seconds, resorting to SIGKILL 
Pods stopped:
6f891b76a212ef24a7cb8c63b1a9e2ed4c27e02d1571dd99d84706848ebdafb1
Pods removed:
6f891b76a212ef24a7cb8c63b1a9e2ed4c27e02d1571dd99d84706848ebdafb1
Secrets removed:
Volumes removed:

The file system can now be destroyed

$ sudo zfs destroy tank/mirror/solaris
$ 

Describe the results you received

Devices backing file systems that were mounted when aardvark-dns was started cannot be destroyed until aardvark-dns is stopped

Describe the results you expected

Devices can be destroyed

podman info output

host:
  arch: amd64
  buildahVersion: 1.39.4
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.13-1.fc41.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.13, commit: '
  cpuUtilization:
    idlePercent: 91.39
    systemPercent: 4.81
    userPercent: 3.81
  cpus: 8
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    version: "41"
  eventLogger: journald
  freeLocks: 2042
  hostname: ethan.home.dn.lalufu.net
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 100
      size: 1host:
  arch: amd64
  buildahVersion: 1.39.4
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.13-1.fc41.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.13, commit: '
  cpuUtilization:
    idlePercent: 91.39
    systemPercent: 4.81
    userPercent: 3.81
  cpus: 8
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    version: "41"
  eventLogger: journald
  freeLocks: 2042
  hostname: ethan.home.dn.lalufu.net
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 100
      size: 1
    - container_id: 1
      host_id: 558752
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 558752
      size: 65536
  kernel: 6.13.12-200.fc41.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 15318974464
  memTotal: 134943457280
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.14.0-1.fc41.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.14.0
    package: netavark-1.14.1-1.fc41.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.14.1
  ociRuntime:
    name: crun
    package: crun-1.21-1.fc41.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.21
      commit: 10269840aa07fb7e6b7e1acff6198692d8ff5c88
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20250415.g2340bbf-1.fc41.x86_64
    version: ""
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.3.1-1.fc41.x86_64
    version: |-
      slirp4netns version 1.3.1
      commit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
      libslirp: 4.8.0
      SLIRP_CONFIG_VERSION_MAX: 5
      libseccomp: 2.5.5
  swapFree: 34357637120
  swapTotal: 34359734272
  uptime: 67h 32m 28.00s (Approximately 2.79 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
store:
  configFile: /home/sun/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: vfs
  graphOptions: {}
  graphRoot: /home/sun/.local/share/containers/storage
  graphRootAllocated: 107374182400
  graphRootUsed: 43418984448
  graphStatus: {}
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 4
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/sun/.local/share/containers/storage/volumes
version:
  APIVersion: 5.4.2
  BuildOrigin: Fedora Project
  Built: 1743552000
  BuiltTime: Wed Apr  2 00:00:00 2025
  GitCommit: be85287fcf4590961614ee37be65eeb315e5d9ff
  GoVersion: go1.23.7
  Os: linux
  OsArch: linux/amd64
  Version: 5.4.2

[
    - container_id: 1
      host_id: 558752
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 558752
      size: 65536
  kernel: 6.13.12-200.fc41.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 15318974464
  memTotal: 134943457280
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.14.0-1.fc41.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.14.0
    package: netavark-1.14.1-1.fc41.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.14.1
  ociRuntime:
    name: crun
    package: crun-1.21-1.fc41.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.21
      commit: 10269840aa07fb7e6b7e1acff6198692d8ff5c88
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20250415.g2340bbf-1.fc41.x86_64
    version: ""
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.3.1-1.fc41.x86_64
    version: |-
      slirp4netns version 1.3.1
      commit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
      libslirp: 4.8.0
      SLIRP_CONFIG_VERSION_MAX: 5
      libseccomp: 2.5.5
  swapFree: 34357637120
  swapTotal: 34359734272
  uptime: 67h 32m 28.00s (Approximately 2.79 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
store:
  configFile: /home/sun/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: vfs
  graphOptions: {}
  graphRoot: /home/sun/.local/share/containers/storage
  graphRootAllocated: 107374182400
  graphRootUsed: 43418984448
  graphStatus: {}
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 4
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/sun/.local/share/containers/storage/volumes
version:
  APIVersion: 5.4.2
  BuildOrigin: Fedora Project
  Built: 1743552000
  BuiltTime: Wed Apr  2 00:00:00 2025
  GitCommit: be85287fcf4590961614ee37be65eeb315e5d9ff
  GoVersion: go1.23.7
  Os: linux
  OsArch: linux/amd64
  Version: 5.4.2

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

No

Additional environment details

Additional environment details

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

@Lalufu Lalufu added the kind/bug Categorizes issue or PR as related to a bug. label Apr 27, 2025
@ninja-quokka
Copy link
Collaborator

I attempted to reproduce this on the latest main branch without success.

I can't install zfs, OpenZFS don't ship arm64 builds so I made my attempt with tempfs.

sudo mkdir /mnt/mytmp
sudo mount -t tmpfs -o size=1G tmpfs /mnt/mytmp
findmnt /mnt/mytmp
./bin/podman kube play ./foo.yaml
sudo umount /mnt/mytmp
findmnt /mnt/mytmp

Could you also try with tmpfs?

@ninja-quokka ninja-quokka added network Networking related issue or feature needs-info Need info from reporter labels Apr 28, 2025
Copy link

A reviewer has determined we need more information to understand the reported issue. A comment on what is missing should be provided. Be certain you:

  • provide an exact reproducer where possible
  • verify you have provided all relevant information - minimum is podman info
  • answer any follow up questions

If no response to the needs-info is provided in 30 days, this issue may be closed by our stale bot.

For more information on reporting issues on this repository, consult our issue guide.

@Lalufu
Copy link
Author

Lalufu commented Apr 28, 2025

There might have been some confusion in my initial report. zfs destroy <dataset> does two things:

  • Unmount the file system on the dataset
  • Remove the dataset from the pool

The first step works fine, even with the container running. After running zfs destroy tank/mirror/solaris, /tank/mirror/solaris is no longer mounted in the default mount namespace on the host. It is the second part (removing the dataset) that fails, because it is still mounted in the mount namespace that aardvark-dns uses (at least I think that's what's going on).

Hence, the tmpfs test is probably not sufficient to test this, because tmpfs doesn't have a real device backing it whose destruction could be affected by this?

To make this clearer to reproduce, I've also done the test with LVS. This is a bit more involved, but basically tests the same thing. /dev/sdi1 is a partition on a USB stick, otherwise unused.

Create a PV, a VG, and a LV, then format with XFS

$ sudo pvcreate /dev/sdi1
  Physical volume "/dev/sdi1" successfully created.
  Creating devices file /etc/lvm/devices/system.devices

$ sudo vgcreate vg-test /dev/sdi1
  Volume group "vg-test" successfully created

$ sudo lvcreate -n lv-test -l 100%FREE vg-test
  Logical volume "lv-test" created.

$ sudo mkfs.xfs /dev/vg-test/lv-test
meta-data=/dev/vg-test/lv-test   isize=512    agcount=4, agsize=1877504 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=7510016, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Mount the file system somewhere

$ sudo mkdir /test
$ sudo mount /dev/vg-test/lv-test /test
$ sudo lvs
  LV      VG      Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv-test vg-test -wi-ao---- <28.65g

Note the o in the lvs output, the logical volume is "open", it's used by something (the mounted file system on it)

Start the pod

$ podman kube play ./foo.yaml
Pod:                          
7be71e02b9632d6edd64c77101033e60d5909d05c8d61bcafb511369975245f4
Container:                    
fb408c0196200548a0b017fc961da283c1e2743bf813475bc93fb6777db189bd

Unmount the file system on the host. Note that this succeeds.

$ sudo umount /test
$ findmnt /test
$

Looking at the LV, it is still open

$ sudo lvs
  LV      VG      Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv-test vg-test -wi-ao---- <28.65g

And it cannot be removed

$ sudo lvremove /dev/vg-test/lv-test
  Logical volume vg-test/lv-test contains a filesystem in use.

The LV is still open

$ sudo lvs
  LV      VG      Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv-test vg-test -wi-ao---- <28.65g

$ sudo grep '/test' /proc/*/mounts
/proc/3541779/mounts:/dev/mapper/vg--test-lv--test /test xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0

$ ps q 3541779
    PID TTY      STAT   TIME COMMAND
3541779 ?        Ssl    0:00 /usr/libexec/podman/aardvark-dns --config /run/user/1000/containers/networks/aardvark-dns -p 53 run

Stopping the container allows removal of the LV

$ podman kube play ./foo.yaml --down
WARN[0010] StopSignal SIGTERM failed to stop container test-test in 10 seconds, resorting to SIGKILL
Pods stopped:                 
7be71e02b9632d6edd64c77101033e60d5909d05c8d61bcafb511369975245f4
Pods removed:                 
7be71e02b9632d6edd64c77101033e60d5909d05c8d61bcafb511369975245f4
Secrets removed:              
Volumes removed:              

$ sudo lvremove /dev/vg-test/lv-test
Do you really want to remove active logical volume vg-test/lv-test? [y/n]: y
  Logical volume "lv-test" successfully removed.

The warning is because the volume is still "active", but it is no longer "open".

@Luap99
Copy link
Member

Luap99 commented Apr 28, 2025

diff --git a/vendor/github.com/containers/common/libnetwork/internal/rootlessnetns/netns_linux.go b/vendor/github.com/containers/common/libnetwork/internal/rootlessnetns/netns_linux.go
index 2655587654..06ed3cb5b5 100644
--- a/vendor/github.com/containers/common/libnetwork/internal/rootlessnetns/netns_linux.go
+++ b/vendor/github.com/containers/common/libnetwork/internal/rootlessnetns/netns_linux.go
@@ -369,7 +369,7 @@ func (n *Netns) setupMounts() error {
 
        // Ensure we mount private in our mountns to prevent accidentally
        // overwriting the host mounts in case the default propagation is shared.
-       err = unix.Mount("", "/", "", unix.MS_PRIVATE|unix.MS_REC, "")
+       err = unix.Mount("", "/", "", unix.MS_SLAVE|unix.MS_REC, "")
        if err != nil {
                return wrapError("make tree private in new mount namespace", err)
        }

Could you try building podman with this and test it? I think the private mount means that the umount on the host is not propagated into the rootless-netns mount namesapce so it stays mounted there I think. With salve the event should be propagated correctly.

I can try to reproduce later, likely it would be best to use loop device for the lvm setup so the reproducer doesn't have to depend on an external device.

@Lalufu
Copy link
Author

Lalufu commented Apr 28, 2025 via email

@Luap99
Copy link
Member

Luap99 commented Apr 28, 2025

PODMAN=podman

loop=/tmp/disk.img
sudo fallocate -l 100m  ${loop}
loop_dev=$(sudo losetup -f --show $loop)

sudo pvcreate $loop_dev
sudo vgcreate vg-test $loop_dev
sudo lvcreate -n lv-test -l 100%FREE vg-test

sudo mkfs.ext4 /dev/vg-test/lv-test

mount_point=/tmp/test
sudo mkdir -p $mount_point
sudo mount /dev/vg-test/lv-test $mount_point

$PODMAN network create net1
$PODMAN run --network net1 -d quay.io/libpod/testimage:20241011 sleep inf

sudo umount $mount_point
sudo lvs
sudo lvremove /dev/vg-test/lv-test

I have tested my patch with this which seems to work so I will open a PR with it.

private was intentional to avoid the mounts being shared which means the mounts in the rootless-netns namespace could propagate to the host, containers/common@4225302.
However using slave should work the same way, slave means the mounts are propagated only in one direction from host into our namespace not the other way around which is what we want.

@Luap99 Luap99 added triaged Issue has been triaged and removed needs-info Need info from reporter labels Apr 28, 2025
@Luap99 Luap99 self-assigned this Apr 28, 2025
Luap99 added a commit to Luap99/common that referenced this issue Apr 28, 2025
We don't want to leak our mounts to the host but we still like to to
update mounts/umount events from the host. This is so when a fs is
unmounted on the host we don't happen to keep it open in aardvark-dns.

Fixes: containers/podman#25994
Fixes: 4225302 ("libnetwork/rootlessnetns: make mountns tree private")

Signed-off-by: Paul Holzinger <[email protected]>
@Lalufu
Copy link
Author

Lalufu commented Apr 28, 2025 via email

Luap99 added a commit to Luap99/common that referenced this issue May 2, 2025
We don't want to leak our mounts to the host but we still like to to
update mounts/umount events from the host. This is so when a fs is
unmounted on the host we don't happen to keep it open in aardvark-dns.

Fixes: containers/podman#25994
Fixes: 4225302 ("libnetwork/rootlessnetns: make mountns tree private")

Signed-off-by: Paul Holzinger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. network Networking related issue or feature triaged Issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants