-
Notifications
You must be signed in to change notification settings - Fork 26
Introduce process file descriptor (pidfd) based process monitoring for Linux #125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Resolves #111 |
|
// MARK: - ProcesIdentifier | ||
|
||
/// A platform independent identifier for a Subprocess. | ||
public struct ProcessIdentifier: Sendable, Hashable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worthwhile to make this a protocol given the repetition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on how would we use this protocol? The reason it's repeated is because on different platforms we have different sets of fields. I don't think having a protocol would solve this problem because we'd still need to offer different concrete types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File this one under premature optimization but a protocol might help if the shared, non-platform-specific code started relying on the existence of methods or attributes of ProcessIdentifier
. I took a look and don't see any at the moment beyond description
.
If there are more expectations requiring the various ProcessIdentifier
definitions to stay in sync, it might be helpful to introduce a protocol, not because any given platform needs more than one concrete type, but because the protocol will serve as a contract to keep the implementations in sync with the expectations of the shared code.
} | ||
} | ||
|
||
let unmanaged = Unmanaged<MonitorThreadContext>.fromOpaque(args!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is unmanaged
used again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also renamed it so it's more clear.
@@ -681,12 +681,26 @@ extension Environment { | |||
// MARK: - ProcessIdentifier | |||
|
|||
/// A platform independent identifier for a subprocess. | |||
public struct ProcessIdentifier: Sendable, Hashable, Codable { | |||
public struct ProcessIdentifier: Sendable, Hashable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about whether a protocol would help with ProcessIdentifier
@@ -35,16 +35,13 @@ public struct Execution: Sendable { | |||
public let processIdentifier: ProcessIdentifier | |||
|
|||
#if os(Windows) | |||
internal nonisolated(unsafe) let processInformation: PROCESS_INFORMATION | |||
internal let consoleBehavior: PlatformOptions.ConsoleBehavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unrelated, but you could delete consoleBehavior
as well as nothing actually uses it.
public struct ProcessIdentifier: Sendable, Hashable { | ||
/// The platform specific process identifier value | ||
public let value: pid_t | ||
internal let processFileDescriptor: PlatformFileDescriptor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitid with P_PIDFD was introduced in Linux kernel 5.4, which focal should have. I'm looking into what's missing
@iCharlesHu One thing I'm concerned about this with change is that unfortunately I don't think we can guarantee that the Linux kernel is version 5.4 or later on all Linux distributions officially supported by Swift.
(Also note that PIDFD_NONBLOCK, if we want to use it, is Linux kernel 5.10 or later)
Containers aren't particularly useful for testing this because they're not going to be using the same kernel as the actual OS distribution. Also, the Swift project is dropping support for Focal in Swift 6.2. Are you planning to keep support for Swift 6.1 in SwiftSubprocess?
Here's the minimum kernel versions associated with each Linux distribution currently officially supported by the Swift project and that I expect to be supported for the Swift 6.2 release:
- Amazon Linux 2 (kernel 4.14): https://docs.aws.amazon.com/linux/al2/ug/aml2-kernel.html
- Debian 12 (kernel 6.1): https://www.debian.org/News/2023/20230610.en.html
- Fedora 39 (kernel 6.5): https://en.wikipedia.org/wiki/Fedora_Linux_release_history, also that version is EoL anyways
- RHEL UBI9 (kernel 5.14): https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/9.0_release_notes/new-features#enhancement_kernel
- Ubuntu 20.04 (kernel 5.4): https://en.wikipedia.org/wiki/Ubuntu_version_history#Table_of_versions
- Ubuntu 22.04 (kernel 5.15 or 5.17): https://en.wikipedia.org/wiki/Ubuntu_version_history#Table_of_versions
- Ubuntu 24.04 (kernel 6.8): https://en.wikipedia.org/wiki/Ubuntu_version_history#Table_of_versions
Thus, until the Swift project moves from Amazon Linux 2 to Amazon Linux 2023 (6.1 kernel), I think we have to retain the original process termination path for compatibility with that distribution, as a fallback path. And we'll need that implementation for OpenBSD and other platforms anyways.
There's also Android. As you may know, the Android ecosystem is often running quite old OS versions compared to iOS, and I think only as of Android 13 (2022) does the OS guarantee kernel 5.4 or later. I'm not sure which minimum Android version we're planning to target in the Swift project, but that would be a good question for @finagolfin or @marcprux or someone else from the Android working group.
Perhaps you could do this by making processFileDescriptor
an optional property and falling back to the pid based paths when it is nil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakepetroules thanks for the analysis. I did not know about Ubuntu's kernel release schedule I thought since 20.04 has 5.4 the later ones must have newer ones...
I do have a backup implementation which is to use signalfd
. I initially opted for pidfd
because it's more modern and precise. I guess we'll have to keep both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I wasn't aware of signalfd
(just read up on it). Still, it's Linux specific, so OpenBSD and some other platforms would probably have to continue to rely on the waitid-loop implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try natively compiling this repo and then this pull on Android and running the tests, will let you know what I find.
-1 | ||
) | ||
if eventCount < 0 { | ||
if errno == EINTR || errno == EAGAIN { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: it might be worth introducing a helper function to handle EINTR/EAGAIN since it's such a common pattern throughout this codebase; see https://github.com/apple/swift-system/blob/6ee9a58c36ad98f4bd917a64d153dd211512e65d/Sources/System/Util.swift#L27 for example.
underlyingError: .init(rawValue: errno) | ||
) | ||
|
||
#if os(Linux) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may want || os(Android)
here too, I'm not sure if os(Linux) applies there.
This is not a bug in the existing implementation. It is a bug in the POSIX specification (and a bug in the program.) |
@@ -664,6 +504,10 @@ int _subprocess_fork_exec( | |||
// If we reached this point, something went wrong | |||
write_error_and_exit; | |||
} else { | |||
int _pidfd = _pidfd_open(childPid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use clone
+ CLONE_PIDFD
(Linux 5.2) instead of fork
+ pidfd_open
? Like FreeBSD's pdfork
, this avoids races since combining the latter two functions is not atomic.
That may be true, but the implementation Charles is proposing here is more defensive against other parts of the program misbehaving, which seems like a good thing. Including scenarios where zombies are being reaped correctly throughout the entire program, but maybe the body of one particular |
// - musl 1.1.24 (October 2019) | ||
// - FreeBSD 13.1 (May 2022) | ||
// - Android 14 (API level 34) (October 2023) | ||
return posix_spawn_file_actions_addchdir_np(file_actions, path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will emit a deprecation warning as of *OS 26 since the standardized version has been added.
// MARK: - ProcessIdentifier | ||
|
||
/// A platform independent identifier for a Subprocess. | ||
public struct ProcessIdentifier: Sendable, Hashable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this type move-only and incorporate the close()
operation into deinit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing might involve closing FDs right? Which might be an asynchronous and throwing operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That close()
can fail at all is an unfortunate weird corner of POSIX that I personally tend to ignore, because a failure in close()
other than EINTR
/EAGAIN
is basically non-recoverable. What are you even supposed to do? What can a user do to fix the problem? Generally nothing.
So I just about always just drop a close()
failure on the floor. </hottake>
(As for asynchronous, it's a blocking operation in userland but it can't fail to make forward progress in this case because there's no network I/O involved unless we're doing something really wonky.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had this discussion in another place (here?). We ended up not calling close
, but asserting that close
has already been called in deinit
.
But it does seem like a design people are going to reach for repeatedly. I wonder if we can put our thought process down somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That
close()
can fail at all is an unfortunate weird corner of POSIX that I personally tend to ignore, because a failure inclose()
other thanEINTR
/EAGAIN
is basically non-recoverable. What are you even supposed to do? What can a user do to fix the problem? Generally nothing.
I agree that it is a weirdness; nevertheless, we need to handle it and most likely surface it to the user. We shouldn't just swallow those errors.
(As for asynchronous, it's a blocking operation in userland but it can't fail to make forward progress in this case because there's no network I/O involved unless we're doing something really wonky.)
This is not entirely true. If you are using io_uring you can asynchronously listen for the subprocess termination with pidfd
and signalfd
via io_uring. We need to account for changes in the underlying I/O system where closing can become asynchronous otherwise we will lock ourselves in a corner API-wise.
The only pattern that keeps us flexible is a with-style
based approach.
The current process monitoring code for Linux has a flaw that makes it susceptible to infinite hangs under specific conditions:
Subprocess
.This is because currently, we rely on running
waitid()
withP_ALL
andWNOWAIT
in an infinite loop to detect possible child process state transitions. However, we don’t reap the child process (by specifyingWNOWAIT
) unless we (Subprocess) actually spawned the process.Here’s a simplified pseudo-code to illustrate the issue:
With this setup, if there are zombie children in the process table without reaping,
waitid(P_ALL)
will repeatedly return the same (non-Subprocess-spawned) PID with every call, causing an infinite loop.You can observe this behavior with the following sample code:
After running this example, you’ll notice that the parent process seems to be stuck, and the “cat finished” message is never printed. This is because the parent process never calls
waitid
on theecho
call, leaving it in the process table. Consequently, the monitor thread runs in an infinite loop.While some may argue that this is not a bug in
Subprocess
, but rather an issue with the parent code, since the POSIX standard mandates that the process spawning child process must reap the child process viawaitid
. However, Subprocess should still not hang due to someone else’s bug.To resolve this issue, switch to a Linux-specific process monitoring method by creating and observing the process file descriptor (pidfd) using epoll. This approach is similar to the epoll implementation introduced in #117, with the only difference being that we’re polling pidfd instead of a regular file descriptor.
As part of this change, I also unified the “process handle” design to make it easier to expose process handles to clients later (after the 1.0 release, as requested by #101). We chose to use
ProcessIdentifier
to host platform-specific process file descriptors and process handles because it perfectly aligns with the original use case. To ensure flexibility, we opted for a concreteProcessIdentifier
type instead of just a number, allowing us to add more information if necessary.