epoll on pidfd

The Linux kernel has an interesting file descriptor called pidfd. As the name imples, it is a file descriptor to a pid or a specific process. The nice thing about it is that is guaranteed to be for the specific process you expected when you got that pidfd. A process ID, or PID, has no reuse guarantees, which means what you think process 1234 is and what the kernel knows what process 1234 is could be different because your process exited and the process IDs have looped around.

pidfds are *odd*, they’re half a “normal” file descriptor and half… something else. That means some file descriptor things work and some fail in odd ways. stat() works, but using them in the first parameter of openat() will fail.

One thing you can do with them is use epoll() on them to get process status, in fact the pidfd_open() manual page says:

A PID file descriptor returned by pidfd_open() (or by clone(2) with the CLONE_PID flag) can be used for the following purposes:

…

A PID file descriptor can be monitored using poll(2), select(2), and epoll(7). When the process that it refers to terminates, these interfaces indicate the file descriptor as readable.

So if you want to wait until something terminates, then you can just find the pidfd of the process and sit an epoll_wait() onto it. Simple, right? Except its not quite true.

procps issue #386 stated that if you had a list of processes, then pidwait only finds half of them. I’d like to thank Steve the issue reporter for the initial work on this. The odd thing is that for every exited process, you get two epoll events. You get an EPOLLIN first, then a EPOLLIN | EPOLLHUP after that. Steve suggested the first was when the process exits, the second when the process has been collected by the parent.

I have a collection of oddball processes, including ones that make zombies. A zombie is a child that has exited but has not been wait() ed by its parent. In other words, if a parent doesn’t collect its dead child, then the child becomes a zombie. The test program spawns a child, which exits after some seconds. The parent waits longer, calls wait() waits some more then exits. Running pidwait we can see the following epoll events:

When the child exits, EPOLLIN on the child is triggered. At this stage the child is a zombie.
When the parent calls wait(), then EPOLLIN | EPOLLHUP on the child is triggered.
When the parent exits, EPOLLIN then EPOLLIN | EPOLLHUP on the parent is triggered. That is, two events for the one thing.

If you want to use epoll() to know when a process terminates, then you need to decide on what you mean by that:

If you mean it has exited, but not collected yet (e.g. a zombie possibly) then you need to select on EPOLLIN only.
If you mean the process is fully gone, then EPOLLHUP is a better choice. You can even change the epoll_ctl() call to use this instead.

A “zombie trigger” (EPOLLIN with no subsequent EPOLLHUP) is a bit tricky to work out. There is no guarantee the two events have to be in the same epoll, especially if the parent is a bit tardy on their wait() call.

Fediverse Reactions

Comments