Devices with cgroup v2

Docker and other container systems by default restrict access to devices on the host. They used to do this with cgroups with the cgroup v1 system, however, the second version of cgroups removed this controller and the man page says:

Cgroup v2 device controller has no interface files and is implemented on top of cgroup BPF.
https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v2.rst

That is just awesome, nothing to see here, go look at the BPF documents if you have cgroup v2.

With cgroup v1 if you wanted to know what devices were permitted, you just would cat /sys/fs/cgroup/XX/devices.allow and you were done!

The kernel documentation is not very helpful, sure its something in BPF and has something to do with the cgroup BPF specifically, but what does that mean?

There doesn’t seem to be an easy corresponding method to get the same information. So to see what restrictions a docker container has, we will have to:

Find what cgroup the programs running in the container belong to
Find what is the eBPF program ID that is attached to our container cgroup
Dump the eBPF program to a text file
Try to interpret the eBPF syntax

The last step is by far the most difficult.

Finding a container’s cgroup

All containers have a short ID and a long ID. When you run the docker ps command, you get the short id. To get the long id you can either use the --no-trunc flag or just guess from the short ID. I usually do the second.

$ docker ps 
CONTAINER ID   IMAGE            COMMAND       CREATED          STATUS          PORTS     NAMES
a3c53d8aaec2   debian:minicom   "/bin/bash"   19 minutes ago   Up 19 minutes             inspiring_shannon

So the short ID is a3c53d8aaec2 and the long ID is a big ugly hex string starting with that. I generally just paste the relevant part in the next step and hit tab. For this container the cgroup is /sys/fs/cgroup/system.slice/docker-a3c53d8aaec23c256124f03d208732484714219c8b5f90dc1c3b4ab00f0b7779.scope/ Notice that the last directory has “docker-” then the short ID.

If you’re not sure of the exact path. The “/sys/fs/cgroup” is the cgroup v2 mount point which can be found with mount -t cgroup2 and then rest is the actual cgroup name. If you know the process running in the container then the cgroup column in ps will show you.

$ ps -o pid,comm,cgroup 140064
    PID COMMAND         CGROUP
 140064 bash            0::/system.slice/docker-a3c53d8aaec23c256124f03d208732484714219c8b5f90dc1c3b4ab00f0b7779.scope

Either way, you will have your cgroup path.

eBPF programs and cgroups

Next we will need to get the eBPF program ID that is attached to our recently found cgroup. To do this, we will need to use the bpftool. One thing that threw me for a long time is when the tool talks about a program or a PROG ID they are talking about the eBPF programs, not your processes! With that out of the way, let’s find the prog id.

$ sudo bpftool cgroup list /sys/fs/cgroup/system.slice/docker-a3c53d8aaec23c256124f03d208732484714219c8b5f90dc1c3b4ab00f0b7779.scope/
ID       AttachType      AttachFlags     Name
90       cgroup_device   multi

Our cgroup is attached to eBPF prog with ID of 90 and the type of program is cgroup _device.

Dumping the eBPF program

Next, we need to get the actual code that is run every time a process running in the cgroup tries to access a device. The program will take some parameters and will return either a 1 for yes you are allowed or a zero for permission denied. Don’t use the file option as it dumps the program in binary format. The text version is hard enough to understand.

sudo bpftool prog dump xlated id 90 > myebpf.txt

Congratulations! You now have the eBPF program in a human-readable (?) format.

Interpreting the eBPF program

The eBPF format as dumped is not exactly user friendly. It probably helps to first go and look at an example program to see what is going on. You’ll see that the program splits type (lower 4 bytes) and access (higher 4 bytes) and then does comparisons on those values. The eBPF has something similar:

   0: (61) r2 = *(u32 *)(r1 +0)
   1: (54) w2 &= 65535
   2: (61) r3 = *(u32 *)(r1 +0)
   3: (74) w3 >>= 16
   4: (61) r4 = *(u32 *)(r1 +4)
   5: (61) r5 = *(u32 *)(r1 +8)

What we find is that once we get past the first few lines filtering the given value that the comparison lines have:

r2 is the device type, 1 is block, 2 is character.
r3 is the device access, it’s used with r1 for comparisons after masking the relevant bits. mknod, read and write are 1,2 and 3 respectively.
r4 is the major number
r5 is the minor number

For a even pretty simple setup, you are going to have around 60 lines of eBPF code to look at. Luckily, you’ll often find the lines for the command options you added will be near the end, which makes it easier. For example:

  63: (55) if r2 != 0x2 goto pc+4
  64: (55) if r4 != 0x64 goto pc+3
  65: (55) if r5 != 0x2a goto pc+2
  66: (b4) w0 = 1
  67: (95) exit

This is a container using the option --device-cgroup-rule='c 100:42 rwm'. It is checking if r2 (device type) is 2 (char) and r4 (major device number) is 0x64 or 100 and r5 (minor device number) is 0x2a or 42. If any of those are not true, move to the next section, otherwise return with 1 (permit). We have all access modes permitted so it doesn’t check for it.

The previous example has all permissions for our device with id 100:42, what about if we only want write access with the option --device-cgroup-rule='c 100:42 r'. The resulting eBPF is:

  63: (55) if r2 != 0x2 goto pc+7  
  64: (bc) w1 = w3
  65: (54) w1 &= 2
  66: (5d) if r1 != r3 goto pc+4
  67: (55) if r4 != 0x64 goto pc+3
  68: (55) if r5 != 0x2a goto pc+2
  69: (b4) w0 = 1
  70: (95) exit

The code is almost the same but we are checking that w3 only has the second bit set, which is for reading, effectively checking for X==X&2. It’s a cautious approach meaning no access still passes but multiple bits set will fail.

The device option

docker run allows you to specify files you want to grant access to your containers with the --device flag. This flag actually does two things. The first is to great the device file in the containers /dev directory, effectively doing a mknod command. The second thing is to adjust the eBPF program. If the device file we specified actually did have a major number of 100 and a minor of 42, the eBPF would look exactly like the above snippets.

What about privileged?

So we have used the direct cgroup options here, what does the --privileged flag do? This lets the container have full access to all the devices (if the user running the process is allowed). Like the --device flag, it makes the device files as well, but what does the filtering look like? We still have a cgroup but the eBPF program is greatly simplified, here it is in full:

   0: (61) r2 = *(u32 *)(r1 +0)
   1: (54) w2 &= 65535
   2: (61) r3 = *(u32 *)(r1 +0)
   3: (74) w3 >>= 16
   4: (61) r4 = *(u32 *)(r1 +4)
   5: (61) r5 = *(u32 *)(r1 +8)
   6: (b4) w0 = 1
   7: (95) exit

There is the usual setup lines and then, return 1. Everyone is a winner for all devices and access types!

Comments

2 responses to “Devices with cgroup v2”

Links 25/05/2023: Istio 1.16.5 and Curl 8.1.1 | Techrights

May 25, 2023

[…] Craig Small: Devices with cgroup v2 […]

Mike

March 21, 2024

Thanks for this well investigated overview of the cgroup v2 devices implementation! I suppose it taught me always find a way to indirectly set the eBPF program, never write it yourself :P. But it’s good to know how it fundamentally works anyway.