Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65911

Jenkins shuts down instead of restarting on Mac M1

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Minor
    • Resolution: Unresolved
    • core
    • 2.297, 2.289.2

    Description

      This issue appeared after migrating our Jenkins instance from a Macbook to a new Mac mini (M1). I could also reproduce it with a clean install on the M1 machine.

      It has been present in the last few versions (at least).

      Steps to reproduce:

      1. Go to http://[JENKINS_URL]/updateCenter/
      2. Check "Restart Jenkins when installation is complete and no jobs are running" (regardless of whether there's an upgrade or not)
      3. Observe what happens in the terminal - after 10 seconds, Jenkins process is killed instead of restarted.

      This is how the log looks like:

      2021-06-17 08:27:47.284+0000 [id=1066]	INFO	jenkins.model.Jenkins$20#run: Restart in 10 seconds
      2021-06-17 08:27:57.287+0000 [id=1066]	INFO	jenkins.model.Jenkins$20#run: Restarting VM as requested by admin
      2021-06-17 08:27:57.287+0000 [id=1066]	INFO	jenkins.model.Jenkins#cleanUp: Stopping Jenkins
      2021-06-17 08:27:57.339+0000 [id=1066]	INFO	jenkins.model.Jenkins$16#onAttained: Started termination
      2021-06-17 08:27:57.370+0000 [id=1066]	INFO	jenkins.model.Jenkins$16#onAttained: Completed termination
      2021-06-17 08:27:57.370+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpDisconnectComputers: Starting node disconnection
      2021-06-17 08:27:57.379+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpShutdownPluginManager: Stopping plugin manager
      2021-06-17 08:27:57.440+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpPersistQueue: Persisting build queue
      2021-06-17 08:27:57.443+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpAwaitDisconnects: Waiting for node disconnection completion
      2021-06-17 08:27:57.443+0000 [id=1066]	INFO	jenkins.model.Jenkins#cleanUp: Jenkins stopped
      [1]    72539 killed     jenkins
      

      Jenkins runs fine when manually started again (with "jenkins" command).

      Attachments

        Issue Links

          Activity

            transcendd Jonatan added a comment -

            This seems to be fixed in the 2.299 version, though it's not documented.

            transcendd Jonatan added a comment - This seems to be fixed in the 2.299 version, though it's not documented.
            markewaite Mark Waite added a comment -

            It is documented in the 2.297 changelog and in the 2.289.2 changelog

            markewaite Mark Waite added a comment - It is documented in the 2.297 changelog and in the 2.289.2 changelog
            timja Tim Jacomb added a comment -

            This was not fixed, it was another issue

            timja Tim Jacomb added a comment - This was not fixed, it was another issue
            timja Tim Jacomb added a comment -

            I've been debugging it.

            It fails on this line:
            https://github.com/jenkinsci/jenkins/blob/195dacbf556fdf0a393d5d139df27ffba3a82b11/core/src/main/java/hudson/lifecycle/UnixLifecycle.java#L80

            The file descriptor it fails on changes in different runs normally between 820-840

            Catching error doesn't help:

            try {} catch (Error e)
            

            Not sure if basil would have any ideas on what to try from here =/

            timja Tim Jacomb added a comment - I've been debugging it. It fails on this line: https://github.com/jenkinsci/jenkins/blob/195dacbf556fdf0a393d5d139df27ffba3a82b11/core/src/main/java/hudson/lifecycle/UnixLifecycle.java#L80 The file descriptor it fails on changes in different runs normally between 820-840 Catching error doesn't help: try {} catch (Error e) Not sure if basil would have any ideas on what to try from here =/
            basil Basil Crow added a comment -

            timja Unfortunately I don't have access to a macOS system to do real debugging on, and debugging by telegram isn't very efficient. But I can try to help.

            The logic makes sense at a high level: we are about to exec, so to avoid leaking open file handles we need to set the close-on-exec flag for all open file descriptors. That this not only failing in the first place but indeed crashing the process is quite surprising.

            The most direct form of analysis would be to work our way backwards from the crash. Do you have a stack trace, registers, or any other diagnostic information for the crash? I don't really know where macOS puts these things, but this page seems to have some answers.

            Another piece of interesting information might be which file descriptor we are trying to set the close-on-exec flag on prior to the crash. If you can put a breakpoint at the line you mentioned and run lsof to list the process's open file descriptors, then keep going until it crashes, then tell us what type of file descriptor caused the crash, that might be interesting. E.g. was it a socket or a regular file, if it was a regular file what was its path, etc.

            basil Basil Crow added a comment - timja Unfortunately I don't have access to a macOS system to do real debugging on, and debugging by telegram isn't very efficient. But I can try to help. The logic makes sense at a high level: we are about to exec , so to avoid leaking open file handles we need to set the close-on-exec flag for all open file descriptors. That this not only failing in the first place but indeed crashing the process is quite surprising. The most direct form of analysis would be to work our way backwards from the crash. Do you have a stack trace, registers, or any other diagnostic information for the crash? I don't really know where macOS puts these things, but this page seems to have some answers . Another piece of interesting information might be which file descriptor we are trying to set the close-on-exec flag on prior to the crash. If you can put a breakpoint at the line you mentioned and run lsof to list the process's open file descriptors, then keep going until it crashes, then tell us what type of file descriptor caused the crash, that might be interesting. E.g. was it a socket or a regular file, if it was a regular file what was its path, etc.
            timja Tim Jacomb added a comment -

            Thanks I’ll add some more debug info, currently I’ve just got the ID, there’s no stacktrace

            timja Tim Jacomb added a comment - Thanks I’ll add some more debug info, currently I’ve just got the ID, there’s no stacktrace
            timja Tim Jacomb added a comment -

            Interesting there was a crash log in the 'Crash reports'

            Exception Type:        EXC_GUARD
            Exception Codes:       GUARD_TYPE_FD
            

            Looks like some apple protection on one of the files

            Full log at: https://gist.github.com/timja/0167c76924c5e44a09c64a545c87ef8b

            timja Tim Jacomb added a comment - Interesting there was a crash log in the 'Crash reports' Exception Type: EXC_GUARD Exception Codes: GUARD_TYPE_FD Looks like some apple protection on one of the files Full log at: https://gist.github.com/timja/0167c76924c5e44a09c64a545c87ef8b
            timja Tim Jacomb added a comment -

            The thread always crashes at:

            Thread 64 Crashed:: Java: restart thread
            0   libsystem_kernel.dylib        	0x000000018c105964 __fcntl + 8
            1   libsystem_kernel.dylib        	0x000000018c105930 fcntl + 88
            
            timja Tim Jacomb added a comment - The thread always crashes at: Thread 64 Crashed:: Java: restart thread 0 libsystem_kernel.dylib 0x000000018c105964 __fcntl + 8 1 libsystem_kernel.dylib 0x000000018c105930 fcntl + 88
            timja Tim Jacomb added a comment - Apple docs: https://developer.apple.com/documentation/xcode/understanding-the-exception-types-in-a-crash-report
            timja Tim Jacomb added a comment -

            This looks to be the correct fix, there's no need to set `FD_CLOEXEC` on the file descriptor that already has it set

            diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            index 391998cdff..97288d5fad 100644
            --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            @@ -76,7 +76,9 @@ public class UnixLifecycle extends Lifecycle {
                     int sz = LIBC.getdtablesize();
                     for(int i=3; i<sz; i++) {
                         int flags = LIBC.fcntl(i, F_GETFD);
            -            if(flags<0) continue;
            +            if (flags < 0 || flags == FD_CLOEXEC) {
            +                continue;
            +            }
                         LIBC.fcntl(i, F_SETFD,flags| FD_CLOEXEC);
                     }
            

            It gets further after the above, it now fails at:

            2021-10-04 21:48:05.292+0000 [id=1]	INFO	winstone.Logger#logInternal: Jetty shutdown successfully
            java.io.IOException: Failed to start Jetty
            	at winstone.Launcher.<init>(Launcher.java:194)
            	at winstone.Launcher.main(Launcher.java:369)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
            	at Main._main(Main.java:311)
            	at Main.main(Main.java:114)
            Caused by: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
            	at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
            	at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
            	at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at org.eclipse.jetty.server.Server.doStart(Server.java:401)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at winstone.Launcher.<init>(Launcher.java:192)
            	... 7 more
            Caused by: java.net.BindException: Address already in use
            	at java.base/sun.nio.ch.Net.bind0(Native Method)
            	at java.base/sun.nio.ch.Net.bind(Net.java:455)
            	at java.base/sun.nio.ch.Net.bind(Net.java:447)
            	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
            	at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
            	... 14 more
            2021-10-04 21:48:05.292+0000 [id=1]	SEVERE	winstone.Logger#logInternal: Container startup failed
            java.net.BindException: Address already in use
            	at java.base/sun.nio.ch.Net.bind0(Native Method)
            	at java.base/sun.nio.ch.Net.bind(Net.java:455)
            	at java.base/sun.nio.ch.Net.bind(Net.java:447)
            	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
            	at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
            Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
            	at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
            	at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
            	at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at org.eclipse.jetty.server.Server.doStart(Server.java:401)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at winstone.Launcher.<init>(Launcher.java:192)
            Caused: java.io.IOException: Failed to start Jetty
            	at winstone.Launcher.<init>(Launcher.java:194)
            	at winstone.Launcher.main(Launcher.java:369)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
            	at Main._main(Main.java:311)
            	at Main.main(Main.java:114)
            
            timja Tim Jacomb added a comment - This looks to be the correct fix, there's no need to set `FD_CLOEXEC` on the file descriptor that already has it set diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java index 391998cdff..97288d5fad 100644 --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java @@ -76,7 +76,9 @@ public class UnixLifecycle extends Lifecycle { int sz = LIBC.getdtablesize(); for ( int i=3; i<sz; i++) { int flags = LIBC.fcntl(i, F_GETFD); - if (flags<0) continue ; + if (flags < 0 || flags == FD_CLOEXEC) { + continue ; + } LIBC.fcntl(i, F_SETFD,flags| FD_CLOEXEC); } It gets further after the above, it now fails at: 2021-10-04 21:48:05.292+0000 [id=1] INFO winstone.Logger#logInternal: Jetty shutdown successfully java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:194) at winstone.Launcher.main(Launcher.java:369) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:311) at Main.main(Main.java:114) Caused by: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:192) ... 7 more Caused by: java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) ... 14 more 2021-10-04 21:48:05.292+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:192) Caused: java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:194) at winstone.Launcher.main(Launcher.java:369) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:311) at Main.main(Main.java:114)
            timja Tim Jacomb added a comment -

            Any thoughts on where to check for that?

            This might be where it's supposed to be shutdown: https://github.com/jenkinsci/winstone/blob/825889d1a93a49feabca297308fff7ecafba70cd/src/main/java/winstone/ShutdownHook.java#L15

            timja Tim Jacomb added a comment - Any thoughts on where to check for that? This might be where it's supposed to be shutdown: https://github.com/jenkinsci/winstone/blob/825889d1a93a49feabca297308fff7ecafba70cd/src/main/java/winstone/ShutdownHook.java#L15
            basil Basil Crow added a comment - - edited

            timja Good insight! I think you're almost there, but your fix isn't quite right. The flags is a bit field, so you can't check if FD_CLOEXEC is set on it with flags == FD_CLOEXEC – that is probably missing a lot of stuff including open sockets, which is probably why you then get the "Address already in use" error. You want to use bitwise arithmetic to check if the FD_CLOEXEC bit is set on the flags while ignoring the other bits that you don't care about.

            basil Basil Crow added a comment - - edited timja Good insight! I think you're almost there, but your fix isn't quite right. The flags is a bit field, so you can't check if FD_CLOEXEC is set on it with flags == FD_CLOEXEC – that is probably missing a lot of stuff including open sockets, which is probably why you then get the "Address already in use" error. You want to use bitwise arithmetic to check if the FD_CLOEXEC bit is set on the flags while ignoring the other bits that you don't care about.
            basil Basil Crow added a comment -

            timja Come to think of it, you might not be almost there after all, because I see no evidence that whatever file descriptor macOS is freaking out about has the FD_CLOEXEC already set on it – that seems like an assumption on your part based on a faulty understanding of how the bit is represented in the flags. I don't see any reason why your theory would be false but I also don't see any reason why it would be true. So back to the drawing board...

            What is clear to me is this is some weird Apple thing where certain file descriptors are "guarded resources" where "the operating system marks a file descriptor as guarded so that normal file descriptor APIs can’t modify them." So when we try to set FD_CLOEXEC on it, the OS "helpfully" crashes the process rather than trusting that we know what we're doing. I had been initially worried that we were hitting some JNA issue, but this seems unlikely. We're bumping into some macOS restriction. My main question is: "what file is so special that macOS forbids us from closing it?" Can you run lsof -p ${PID} and please let me know what file handle is so sacred that Apple does not allow developers to close it.

            basil Basil Crow added a comment - timja Come to think of it, you might not be almost there after all, because I see no evidence that whatever file descriptor macOS is freaking out about has the FD_CLOEXEC already set on it – that seems like an assumption on your part based on a faulty understanding of how the bit is represented in the flags. I don't see any reason why your theory would be false but I also don't see any reason why it would be true. So back to the drawing board... What is clear to me is this is some weird Apple thing where certain file descriptors are "guarded resources" where "the operating system marks a file descriptor as guarded so that normal file descriptor APIs can’t modify them." So when we try to set FD_CLOEXEC on it, the OS "helpfully" crashes the process rather than trusting that we know what we're doing. I had been initially worried that we were hitting some JNA issue, but this seems unlikely. We're bumping into some macOS restriction. My main question is: "what file is so special that macOS forbids us from closing it?" Can you run lsof -p ${PID } and please let me know what file handle is so sacred that Apple does not allow developers to close it.
            timja Tim Jacomb added a comment - - edited

            It's not very informative...

            java    37417 timja  844   NPOLICY
            

            Only reference I found was someone asking what it is:
            https://developer.apple.com/forums/thread/659475

            The evidence I saw was from logging out the flags returned by 'F_GETFD' and seeing that the failing process had '1' returned which is FD_CLOEXEC from what I understand, but I haven't had to do bitwise arithmetic in a long time so likely very faulty assumption.

            (There was also one other process with the flag 1 which was the preceeding FD)

            timja Tim Jacomb added a comment - - edited It's not very informative... java 37417 timja 844 NPOLICY Only reference I found was someone asking what it is: https://developer.apple.com/forums/thread/659475 The evidence I saw was from logging out the flags returned by 'F_GETFD' and seeing that the failing process had '1' returned which is FD_CLOEXEC from what I understand, but I haven't had to do bitwise arithmetic in a long time so likely very faulty assumption. (There was also one other process with the flag 1 which was the preceeding FD)
            basil Basil Crow added a comment -

            Maybe you'd have more luck figuring out what this weird file descriptor is with DTrace. sudo opensnoop -p ${PID} will log every file descriptor as it's opened, giving you a bunch of information about it like its file descriptor and its path. It's based on DTrace, so you'll have to disable System Integrity Protection (SIP) in order to use it. (It's this kind of thing that drove me nuts which is why I eventually quit using macOS. Now I run Linux and can do system-level tracing to my heart's content.)

            basil Basil Crow added a comment - Maybe you'd have more luck figuring out what this weird file descriptor is with DTrace. sudo opensnoop -p ${PID } will log every file descriptor as it's opened, giving you a bunch of information about it like its file descriptor and its path. It's based on DTrace, so you'll have to disable System Integrity Protection (SIP) in order to use it. (It's this kind of thing that drove me nuts which is why I eventually quit using macOS. Now I run Linux and can do system-level tracing to my heart's content.)
            basil Basil Crow added a comment -

            timja You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY? Check out this page:

            A major addition to Darwin's network stack are Network Extension Control Policies (NECPs), added to Darwin 14. NECPs are described in bsd/net/necp.c as "..high-level policy sessions, which are ingested into low-level kernel policies that control and tag traffic at the application, socket, and IP layers". In other words, NECPs enable user mode programs to control the kernel's network routing and scheduling decisions. Naturally, these also allow QoS.

            […]

            Starting with Darwin 16, just about every network-enabled process in the system uses NECPs, oftentimes without the developer even knowing what they are. This is because libnetwork.dylib calls necp_open() (#501) as part of its initialization (specifically, from nw_endpoint_handler_start). This creates a necp client, a file descriptor of type NPOLICY, which is readily visible in the output of lsof(1) or procexp ..fds. The descriptor does not offer the traditional operations (read(2)/write(2)/ioctl(2)), and only supports select(2), or use in a kqueue.

            […]

            Darwin 17 extends the idea of NECP client descriptors, and adds the NECP session (also an NPOLICY file descriptor^*^). These descriptors are created with necp_session_open (#522), and support just the close(2) operation (which deletes the associated session).

            So it looks like an NPOLICY file descriptor could be either an NECP client or an NECP session. Now, if trying to set FD_CLOEXEC on one of these is triggering a EXC_GUARD/GUARD_TYPE_FD "guarded resource protection", then my next question would be: who created the NECP client/session, and why is it guarded? More importantly, how can we get it to release the resource before we exec?

            basil Basil Crow added a comment - timja You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY ? Check out this page : A major addition to Darwin's network stack are Network Extension Control Policies (NECPs), added to Darwin 14. NECPs are described in bsd/net/necp.c as "..high-level policy sessions, which are ingested into low-level kernel policies that control and tag traffic at the application, socket, and IP layers". In other words, NECPs enable user mode programs to control the kernel's network routing and scheduling decisions. Naturally, these also allow QoS. […] Starting with Darwin 16, just about every network-enabled process in the system uses NECPs, oftentimes without the developer even knowing what they are. This is because libnetwork.dylib calls necp_open() (#501) as part of its initialization (specifically, from nw_endpoint_handler_start ). This creates a necp client , a file descriptor of type NPOLICY , which is readily visible in the output of lsof(1) or procexp ..fds . The descriptor does not offer the traditional operations ( read(2)/write(2)/ioctl(2) ), and only supports select(2) , or use in a kqueue . […] Darwin 17 extends the idea of NECP client descriptors, and adds the NECP session (also an NPOLICY file descriptor^*^). These descriptors are created with necp_session_open (#522), and support just the close(2) operation (which deletes the associated session). So it looks like an NPOLICY file descriptor could be either an NECP client or an NECP session . Now, if trying to set FD_CLOEXEC on one of these is triggering a EXC_GUARD / GUARD_TYPE_FD "guarded resource protection", then my next question would be: who created the NECP client/session, and why is it guarded? More importantly, how can we get it to release the resource before we exec ?
            timja Tim Jacomb added a comment -

            > You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY?
            Sure is a strong word here, this sort of thing is quite out of comfort zone, but yes it appears so.

            These are the related entries from the last restart I just did:

            java    40095 timja  840u     IPv6 0x7a007b86cdf33161       0t0                 TCP localhost:http-alt->localhost:58463 (ESTABLISHED)
            java    40095 timja  841u     IPv6 0x7a007b86c2092461       0t0                 TCP *:ndmp (LISTEN)
            java    40095 timja  842u    systm 0x7a007b936831da81       0t0                     [ctl com.apple.netsrc id 6 unit 26]
            java    40095 timja  843u     unix 0x7a007b86a4514369       0t0                     ->0x7a007b86a450efd1
            java    40095 timja  844   NPOLICY
            java    40095 timja  845u     IPv6 0x7a007b86c208eb01       0t0                 TCP 192.168.4.151:58480->20.190.160.136:https (ESTABLISHED)
            java    40095 timja  846u     IPv6 0x7a007b86c208f7c1       0t0                 TCP 192.168.4.151:58484->20.190.160.136:https (ESTABLISHED)
            java    40095 timja  847u     IPv6 0x7a007b86c2091141       0t0                 TCP 192.168.4.151:58488->40.126.41.96:https (ESTABLISHED)
            java    40095 timja  848u     IPv6 0x7a007b86cdf36461       0t0                 TCP localhost:http-alt->localhost:58490 (ESTABLISHED)
            java    40095 timja  849u     IPv6 0x7a007b86cdf337c1       0t0                 TCP localhost:http-alt->localhost:58491 (ESTABLISHED)
            java    40095 timja  850u     IPv6 0x7a007b86cdf34481       0t0                 TCP localhost:http-alt->localhost:58493 (ESTABLISHED)
            java    40095 timja  851u     IPv6 0x7a007b86cdaebe21       0t0                 TCP localhost:http-alt->localhost:58494 (ESTABLISHED)
            java    40095 timja  852u     IPv6 0x7a007b86cdaed7a1       0t0                 TCP localhost:http-alt->localhost:58495 (ESTABLISHED)
            

            and the relevant part of the crash log:

            Exception Type:        EXC_GUARD
            Exception Codes:       0x400000040000034c, 0xe20c7f41b3616e04
            Exception Subtype:     GUARD_TYPE_FD, id=0xe20c7f41b3616e04, fd=844, flavor=0x00000004 (NOCLOEXEC)
            
            timja Tim Jacomb added a comment - > You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY? Sure is a strong word here, this sort of thing is quite out of comfort zone, but yes it appears so. These are the related entries from the last restart I just did: java 40095 timja 840u IPv6 0x7a007b86cdf33161 0t0 TCP localhost:http-alt->localhost:58463 (ESTABLISHED) java 40095 timja 841u IPv6 0x7a007b86c2092461 0t0 TCP *:ndmp (LISTEN) java 40095 timja 842u systm 0x7a007b936831da81 0t0 [ctl com.apple.netsrc id 6 unit 26] java 40095 timja 843u unix 0x7a007b86a4514369 0t0 ->0x7a007b86a450efd1 java 40095 timja 844 NPOLICY java 40095 timja 845u IPv6 0x7a007b86c208eb01 0t0 TCP 192.168.4.151:58480->20.190.160.136:https (ESTABLISHED) java 40095 timja 846u IPv6 0x7a007b86c208f7c1 0t0 TCP 192.168.4.151:58484->20.190.160.136:https (ESTABLISHED) java 40095 timja 847u IPv6 0x7a007b86c2091141 0t0 TCP 192.168.4.151:58488->40.126.41.96:https (ESTABLISHED) java 40095 timja 848u IPv6 0x7a007b86cdf36461 0t0 TCP localhost:http-alt->localhost:58490 (ESTABLISHED) java 40095 timja 849u IPv6 0x7a007b86cdf337c1 0t0 TCP localhost:http-alt->localhost:58491 (ESTABLISHED) java 40095 timja 850u IPv6 0x7a007b86cdf34481 0t0 TCP localhost:http-alt->localhost:58493 (ESTABLISHED) java 40095 timja 851u IPv6 0x7a007b86cdaebe21 0t0 TCP localhost:http-alt->localhost:58494 (ESTABLISHED) java 40095 timja 852u IPv6 0x7a007b86cdaed7a1 0t0 TCP localhost:http-alt->localhost:58495 (ESTABLISHED) and the relevant part of the crash log: Exception Type: EXC_GUARD Exception Codes: 0x400000040000034c, 0xe20c7f41b3616e04 Exception Subtype: GUARD_TYPE_FD, id=0xe20c7f41b3616e04, fd=844, flavor=0x00000004 (NOCLOEXEC)
            timja Tim Jacomb added a comment -

            This is the result from opensnoop: https://gist.github.com/timja/65e67174af7a3a3f2e721bbeef2a9c5f

            On this run the FD that caused it to crash was 847, and was of type 'NPOLICY'

            timja Tim Jacomb added a comment - This is the result from opensnoop: https://gist.github.com/timja/65e67174af7a3a3f2e721bbeef2a9c5f On this run the FD that caused it to crash was 847, and was of type 'NPOLICY'
            timja Tim Jacomb added a comment -

            Interesting I found that the file descriptor is only created after logging in to Jenkins,

            If you disable auth and restart it doesn't get killed
            and after some more digging it seems to be the azure-ad plugin is starting that file descriptor somehow.
            The built-in security realm doesn't have the problem.

            But the restart doesn't work either way it still hits the:

            2021-10-05 07:26:21.441+0000 [id=1]	SEVERE	winstone.Logger#logInternal: Container startup failed
            java.net.BindException: Address already in use
            	at java.base/sun.nio.ch.Net.bind0(Native Method)
            	at java.base/sun.nio.ch.Net.bind(Net.java:455)
            	at java.base/sun.nio.ch.Net.bind(Net.java:447)
            	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
            	at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
            Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
            	at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
            	at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
            	at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at org.eclipse.jetty.server.Server.doStart(Server.java:401)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at winstone.Launcher.<init>(Launcher.java:192)
            Caused: java.io.IOException: Failed to start Jetty
            	at winstone.Launcher.<init>(Launcher.java:194)
            	at winstone.Launcher.main(Launcher.java:369)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
            	at Main._main(Main.java:311)
            	at Main.main(Main.java:114)
            
            timja Tim Jacomb added a comment - Interesting I found that the file descriptor is only created after logging in to Jenkins, If you disable auth and restart it doesn't get killed and after some more digging it seems to be the azure-ad plugin is starting that file descriptor somehow. The built-in security realm doesn't have the problem. But the restart doesn't work either way it still hits the: 2021-10-05 07:26:21.441+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:192) Caused: java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:194) at winstone.Launcher.main(Launcher.java:369) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:311) at Main.main(Main.java:114)
            basil Basil Crow added a comment -

            timja Setting aside the Azure AD issue, it is interesting that with the built-in security realm the restart fails with java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080. I looked into how this works (correctly) on my Linux system. Before restarting Jenkins, here are the open TCP file descriptors:

            $ sudo lsof -P -p 3916237 | grep TCP
            java    3916237 basil    5u     IPv6            5821933       0t0     TCP localhost:8080->localhost:49530 (ESTABLISHED)
            java    3916237 basil    6u     IPv4            5807038       0t0     TCP localhost:5005->localhost:48452 (ESTABLISHED)
            java    3916237 basil  120u     IPv6            5812861       0t0     TCP *:8080 (LISTEN)
            

            The call to LIBC.fcntl(i, F_SETFD,flags | FD_CLOEXEC) marks them as CLOEXEC, then the call to LIBC.execvp execs to self, during which the file descriptors are closed and the socket released. During startup there are no open TCP file descriptors, so Jetty has no problem binding to the port.

            At what point is this going off the rails on M1?

            Note that winstone.Launcher#shutdown isn't executed in this process, since this isn't a "regular" JVM shutdown where the shutdown hooks are executed. Whatever Winstone/Jetty might be doing to gracefully shut down, I can't see how closing all file descriptors and calling exec would be less than that - there's really not much left to the process once its program text and open file handles are gone. Unless there's some weird Apple resource that winstone.Launcher#shutdown releases which isn't a file descriptor or in the process's address space - which would be quite strange.

            basil Basil Crow added a comment - timja Setting aside the Azure AD issue, it is interesting that with the built-in security realm the restart fails with java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 . I looked into how this works (correctly) on my Linux system. Before restarting Jenkins, here are the open TCP file descriptors: $ sudo lsof -P -p 3916237 | grep TCP java 3916237 basil 5u IPv6 5821933 0t0 TCP localhost:8080->localhost:49530 (ESTABLISHED) java 3916237 basil 6u IPv4 5807038 0t0 TCP localhost:5005->localhost:48452 (ESTABLISHED) java 3916237 basil 120u IPv6 5812861 0t0 TCP *:8080 (LISTEN) The call to LIBC.fcntl(i, F_SETFD,flags | FD_CLOEXEC) marks them as CLOEXEC , then the call to LIBC.execvp execs to self, during which the file descriptors are closed and the socket released. During startup there are no open TCP file descriptors, so Jetty has no problem binding to the port. At what point is this going off the rails on M1? Note that winstone.Launcher#shutdown isn't executed in this process, since this isn't a "regular" JVM shutdown where the shutdown hooks are executed. Whatever Winstone/Jetty might be doing to gracefully shut down, I can't see how closing all file descriptors and calling exec would be less than that - there's really not much left to the process once its program text and open file handles are gone. Unless there's some weird Apple resource that winstone.Launcher#shutdown releases which isn't a file descriptor or in the process's address space - which would be quite strange.
            basil Basil Crow added a comment -

            I can reproduce the problem on an M1 system with a stock Jenkins installation. Tracing the process with dtruss(1) I cannot see any calls to necp_session_open, but I see a call to necp_open:

              927/0x364b:      3733      12      8 necp_open(0x0, 0x0, 0x0)          = 3 0
            
                          0x18e4d1b78
                          0x1926e07ec
                          0x1926df85c
                          0x18f1429c8
                          0x102ee0450
                          0x102ede014
                          0x102eddf68
                          0x102eddea0
                          0x18e25c1d8
                          0x18e25abf8
                          0x18e34ac6c
                          0x18e366f68
                          0x18e352208
                          0x18e367cb8
                          0x18e35d708
                          0x18e505304
                          0x18e504018
            

            For some reason DTrace's ustack() isn't resolving symbols, but I got the symbols from lldb(1):

            target create /opt/homebrew/opt/openjdk@11/bin/java
            process launch --stop-at-entry -- -jar war/target/jenkins.war
            process handle -p true -s false SIGBUS
            process handle -p true -s false SIGSEGV
            process handle -p true -s false SIGSTOP
            break set -n necp_open -s libsystem_kernel.dylib
            continue
            bt
            * thread #53, name = 'Java: GrapeHack.hack', stop reason = breakpoint 1.1
              * frame #0: 0x000000018e4d1b70 libsystem_kernel.dylib`necp_open
                frame #1: 0x0000000192715b2c libnetwork.dylib`nw_path_shared_necp_fd + 220
                frame #2: 0x00000001926e07ec libnetwork.dylib`nw_path_evaluator_evaluate + 540
                frame #3: 0x00000001926df85c libnetwork.dylib`nw_path_create_evaluator_for_endpoint + 108
                frame #4: 0x0000000192e1e8cc libnetwork.dylib`nw_nat64_v4_address_requires_synthesis + 252
                frame #5: 0x000000018e5298e4 libsystem_info.dylib`si_addrinfo + 2308
                frame #6: 0x000000018e528f34 libsystem_info.dylib`getaddrinfo + 168
                frame #7: 0x00000001021702c4 libnet.dylib`Java_java_net_Inet6AddressImpl_lookupAllHostAddr + 144
                frame #8: 0x000000011f90bbfc
                frame #9: 0x000000011f906fc8
                frame #10: 0x000000011f906fc8
                frame #11: 0x000000011f906ee0
                frame #12: 0x000000011f906fc8
                frame #13: 0x000000011f906ee0
                frame #14: 0x000000011f906ee0
                frame #15: 0x000000011f906ee0
                frame #16: 0x000000011f90719c
                frame #17: 0x000000011f90719c
                frame #18: 0x000000011f90719c
                frame #19: 0x000000011f900144
                frame #20: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
                frame #21: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
                frame #22: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392
                frame #23: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232
                frame #24: 0x00000001274e1360
                frame #25: 0x00000001203c8c84
                frame #26: 0x000000011f906ee0
                frame #27: 0x000000011f906ee0
                frame #28: 0x000000011f906ee0
                frame #29: 0x000000011f90719c
                frame #30: 0x000000011f9071f4
                frame #31: 0x000000011f9071f4
                frame #32: 0x00000001200a6434
                frame #33: 0x000000011f906ab0
                frame #34: 0x000000011f9067e4
                frame #35: 0x000000011f90678c
                frame #36: 0x000000011f9071f4
                frame #37: 0x000000011f90719c
                frame #38: 0x000000011f90719c
                frame #39: 0x000000011f90719c
                frame #40: 0x000000011f90719c
                frame #41: 0x000000011f90719c
                frame #42: 0x000000011f90719c
                frame #43: 0x000000011f90719c
                frame #44: 0x000000011f90719c
                frame #45: 0x000000011f906fc8
                frame #46: 0x000000011f906ee0
                frame #47: 0x000000011f906ee0
                frame #48: 0x000000011f906fc8
                frame #49: 0x000000011f900144
                frame #50: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
                frame #51: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
                frame #52: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392
                frame #53: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232
                frame #54: 0x00000001274e1360
                frame #55: 0x00000001203c8c84
                frame #56: 0x000000011f906ee0
                frame #57: 0x000000011f900144
                frame #58: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
                frame #59: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
                frame #60: 0x00000001028d4138 libjvm.dylib`Reflection::invoke_method(oopDesc*, Handle, objArrayHandle, Thread*) + 276
                frame #61: 0x00000001026d038c libjvm.dylib`JVM_InvokeMethod + 508
                frame #62: 0x00000001276a1270
                frame #63: 0x000000012097c16c
                frame #64: 0x000000011f90719c
                frame #65: 0x000000011f9071f4
                frame #66: 0x000000011f90719c
                frame #67: 0x000000011f90719c
                frame #68: 0x000000011f9071f4
                frame #69: 0x000000011f9071f4
                frame #70: 0x000000011f9071f4
                frame #71: 0x000000011f90719c
                frame #72: 0x000000011f9071f4
                frame #73: 0x000000011f900144
                frame #74: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
                frame #75: 0x00000001026468d0 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, Thread*) + 236
                frame #76: 0x0000000102646998 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*) + 100
                frame #77: 0x00000001026cd0dc libjvm.dylib`thread_entry(JavaThread*, Thread*) + 120
                frame #78: 0x0000000102973134 libjvm.dylib`JavaThread::thread_main_inner() + 124
                frame #79: 0x0000000102972f64 libjvm.dylib`JavaThread::run() + 376
                frame #80: 0x0000000102970c88 libjvm.dylib`Thread::call_run() + 120
                frame #81: 0x0000000102874b94 libjvm.dylib`thread_native_entry(Thread*) + 316
                frame #82: 0x000000018e509240 libsystem_pthread.dylib`_pthread_start + 148
            

            So some thread named GrapeHack.hack is calling Java_java_net_Inet6AddressImpl_lookupAllHostAddr, which eventually calls necp_open, which creates a file descriptor with NPOLICY, which we then set FD_CLOEXEC on, which we then crash on when exec'ing.

            I looked into disabling IPv6 in Java and tried running Java with -Djava.net.preferIPv4Stack=true. That eliminates the calls to necp_open and the NPOLICY FD:

            $ sudo lsof -p `pgrep java` | grep NPOLICY
            $
            

            That eliminates the crash, but Jenkins still doesn't restart properly. Now I'm getting:

            2022-02-22 06:38:34.228+0000 [id=1]	SEVERE	winstone.Logger#logInternal: Container startup failed
            java.net.BindException: Address already in use
            	at java.base/sun.nio.ch.Net.bind0(Native Method)
            	at java.base/sun.nio.ch.Net.bind(Net.java:455)
            	at java.base/sun.nio.ch.Net.bind(Net.java:447)
            	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
            	at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
            Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
            	at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
            	at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
            	at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at org.eclipse.jetty.server.Server.doStart(Server.java:401)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at winstone.Launcher.<init>(Launcher.java:198)
            Caused: java.io.IOException: Failed to start Jetty
            	at winstone.Launcher.<init>(Launcher.java:200)
            	at winstone.Launcher.main(Launcher.java:376)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
            	at Main._main(Main.java:304)
            	at Main.main(Main.java:108)
            

            A mystery for another day.

            basil Basil Crow added a comment - I can reproduce the problem on an M1 system with a stock Jenkins installation. Tracing the process with dtruss(1) I cannot see any calls to necp_session_open , but I see a call to necp_open : 927/0x364b: 3733 12 8 necp_open(0x0, 0x0, 0x0) = 3 0 0x18e4d1b78 0x1926e07ec 0x1926df85c 0x18f1429c8 0x102ee0450 0x102ede014 0x102eddf68 0x102eddea0 0x18e25c1d8 0x18e25abf8 0x18e34ac6c 0x18e366f68 0x18e352208 0x18e367cb8 0x18e35d708 0x18e505304 0x18e504018 For some reason DTrace's ustack() isn't resolving symbols, but I got the symbols from lldb(1) : target create /opt/homebrew/opt/openjdk@11/bin/java process launch --stop-at-entry -- -jar war/target/jenkins.war process handle -p true -s false SIGBUS process handle -p true -s false SIGSEGV process handle -p true -s false SIGSTOP break set -n necp_open -s libsystem_kernel.dylib continue bt * thread #53, name = 'Java: GrapeHack.hack', stop reason = breakpoint 1.1 * frame #0: 0x000000018e4d1b70 libsystem_kernel.dylib`necp_open frame #1: 0x0000000192715b2c libnetwork.dylib`nw_path_shared_necp_fd + 220 frame #2: 0x00000001926e07ec libnetwork.dylib`nw_path_evaluator_evaluate + 540 frame #3: 0x00000001926df85c libnetwork.dylib`nw_path_create_evaluator_for_endpoint + 108 frame #4: 0x0000000192e1e8cc libnetwork.dylib`nw_nat64_v4_address_requires_synthesis + 252 frame #5: 0x000000018e5298e4 libsystem_info.dylib`si_addrinfo + 2308 frame #6: 0x000000018e528f34 libsystem_info.dylib`getaddrinfo + 168 frame #7: 0x00000001021702c4 libnet.dylib`Java_java_net_Inet6AddressImpl_lookupAllHostAddr + 144 frame #8: 0x000000011f90bbfc frame #9: 0x000000011f906fc8 frame #10: 0x000000011f906fc8 frame #11: 0x000000011f906ee0 frame #12: 0x000000011f906fc8 frame #13: 0x000000011f906ee0 frame #14: 0x000000011f906ee0 frame #15: 0x000000011f906ee0 frame #16: 0x000000011f90719c frame #17: 0x000000011f90719c frame #18: 0x000000011f90719c frame #19: 0x000000011f900144 frame #20: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #21: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #22: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392 frame #23: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232 frame #24: 0x00000001274e1360 frame #25: 0x00000001203c8c84 frame #26: 0x000000011f906ee0 frame #27: 0x000000011f906ee0 frame #28: 0x000000011f906ee0 frame #29: 0x000000011f90719c frame #30: 0x000000011f9071f4 frame #31: 0x000000011f9071f4 frame #32: 0x00000001200a6434 frame #33: 0x000000011f906ab0 frame #34: 0x000000011f9067e4 frame #35: 0x000000011f90678c frame #36: 0x000000011f9071f4 frame #37: 0x000000011f90719c frame #38: 0x000000011f90719c frame #39: 0x000000011f90719c frame #40: 0x000000011f90719c frame #41: 0x000000011f90719c frame #42: 0x000000011f90719c frame #43: 0x000000011f90719c frame #44: 0x000000011f90719c frame #45: 0x000000011f906fc8 frame #46: 0x000000011f906ee0 frame #47: 0x000000011f906ee0 frame #48: 0x000000011f906fc8 frame #49: 0x000000011f900144 frame #50: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #51: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #52: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392 frame #53: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232 frame #54: 0x00000001274e1360 frame #55: 0x00000001203c8c84 frame #56: 0x000000011f906ee0 frame #57: 0x000000011f900144 frame #58: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #59: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #60: 0x00000001028d4138 libjvm.dylib`Reflection::invoke_method(oopDesc*, Handle, objArrayHandle, Thread*) + 276 frame #61: 0x00000001026d038c libjvm.dylib`JVM_InvokeMethod + 508 frame #62: 0x00000001276a1270 frame #63: 0x000000012097c16c frame #64: 0x000000011f90719c frame #65: 0x000000011f9071f4 frame #66: 0x000000011f90719c frame #67: 0x000000011f90719c frame #68: 0x000000011f9071f4 frame #69: 0x000000011f9071f4 frame #70: 0x000000011f9071f4 frame #71: 0x000000011f90719c frame #72: 0x000000011f9071f4 frame #73: 0x000000011f900144 frame #74: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #75: 0x00000001026468d0 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, Thread*) + 236 frame #76: 0x0000000102646998 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*) + 100 frame #77: 0x00000001026cd0dc libjvm.dylib`thread_entry(JavaThread*, Thread*) + 120 frame #78: 0x0000000102973134 libjvm.dylib`JavaThread::thread_main_inner() + 124 frame #79: 0x0000000102972f64 libjvm.dylib`JavaThread::run() + 376 frame #80: 0x0000000102970c88 libjvm.dylib`Thread::call_run() + 120 frame #81: 0x0000000102874b94 libjvm.dylib`thread_native_entry(Thread*) + 316 frame #82: 0x000000018e509240 libsystem_pthread.dylib`_pthread_start + 148 So some thread named GrapeHack.hack is calling Java_java_net_Inet6AddressImpl_lookupAllHostAddr , which eventually calls necp_open , which creates a file descriptor with NPOLICY , which we then set FD_CLOEXEC on, which we then crash on when exec'ing. I looked into disabling IPv6 in Java and tried running Java with -Djava.net.preferIPv4Stack=true . That eliminates the calls to necp_open and the NPOLICY FD: $ sudo lsof -p `pgrep java` | grep NPOLICY $ That eliminates the crash, but Jenkins still doesn't restart properly. Now I'm getting: 2022-02-22 06:38:34.228+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:198) Caused: java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:200) at winstone.Launcher.main(Launcher.java:376) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:304) at Main.main(Main.java:108) A mystery for another day.
            basil Basil Crow added a comment -

            More late night debugging with java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war... after instrumenting Winstone with

            diff --git a/src/main/java/winstone/Launcher.java b/src/main/java/winstone/Launcher.java
            index 2657c71..2c0b72c 100644
            --- a/src/main/java/winstone/Launcher.java
            +++ b/src/main/java/winstone/Launcher.java
            @@ -195,6 +195,7 @@ public class Launcher implements Runnable {
                         }
             
                         try {
            +                Thread.sleep(10 * 1000L);
                             server.start();
                         } catch (Exception e) {
                             throw new IOException("Failed to start Jetty",e);
            

            I could confirm that after the restart not only was the port still open but all the other files were also still open. It appears as if nothing got closed before the exec. Instrumenting Jenkins with

            diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            index ca8f50b9f5..b6d28ea313 100644
            --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            @@ -82,6 +82,7 @@ public class UnixLifecycle extends Lifecycle {
                     }
             
                     // exec to self
            +        Thread.sleep(10 * 1000L);
                     String exe = args.get(0);
                     LIBC.execvp(exe, new StringArray(args.toArray(new String[0])));
                     throw new IOException("Failed to exec '" + exe + "' " + LIBC.strerror(Native.getLastError()));
            

            and running sudo lsof +f g -p $PID I could confirm that no file descriptor had the CX (close-on-exec) flag. We are simply failing to set the close-on-exec flag.

            Watching the fcntl system calls with dtruss -t fcntl I see:

            fcntl(0x3, 0x1, 0x0)             = 0 0
            fcntl(0x3, 0x2, 0xFFFFFFFFC5EC5B50)              = 0 0
            fcntl(0x4, 0x1, 0x0)             = 0 0
            fcntl(0x4, 0x2, 0x29ECA00)               = 0 0
            fcntl(0x5, 0x1, 0x0)             = 0 0
            fcntl(0x5, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            fcntl(0x6, 0x1, 0x0)             = 0 0
            fcntl(0x6, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            fcntl(0x7, 0x1, 0x0)             = 0 0
            fcntl(0x7, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            fcntl(0x8, 0x1, 0x0)             = 0 0
            fcntl(0x8, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            fcntl(0x9, 0x1, 0x0)             = 0 0
            fcntl(0x9, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            fcntl(0xA, 0x1, 0x0)             = 0 0
            fcntl(0xA, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            

            and on and on. The calls where the second argument is 0x2 correspond to F_SETFD. Most of them have 0xFFFFFFFFC5EC5B50 as the third argument, including the one for the TCP socket.

            For comparison I wrote a simple Python program:

            import fcntl
            import time
            
            with open('foo.txt') as foo:
                time.sleep(10)
                print('about to fcntl')
                flags = fcntl.fcntl(foo.fileno(), fcntl.F_GETFD)
                fcntl.fcntl(foo.fileno(), fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC)
                print('finished fcntl')
                time.sleep(10)
            

            I confirmed that after the F_SETFD the file descriptor had CX set in lsof +f g as expected. The dtruss -t fcntl output here was:

            fcntl(0x3, 0x1, 0x0)		 = 0 0
            fcntl(0x3, 0x2, 0x1)		 = 0 0
            

            So clearly Jenkins is in the wrong here by calling SET_FD with 0xFFFFFFFFC5EC5AB0 as the flags. It should be 0x1.

            Funnily enough I got this working very easily with JNR (which I am trying to get rid of in core...):

            diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            index ca8f50b9f5..a68bc4ef7b 100644
            --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            @@ -26,18 +26,19 @@ package hudson.lifecycle;
             
             import static hudson.util.jna.GNUCLibrary.FD_CLOEXEC;
             import static hudson.util.jna.GNUCLibrary.F_GETFD;
            -import static hudson.util.jna.GNUCLibrary.F_SETFD;
             import static hudson.util.jna.GNUCLibrary.LIBC;
             
             import com.sun.jna.Native;
             import com.sun.jna.StringArray;
             import hudson.Platform;
            +import hudson.os.PosixAPI;
             import java.io.IOException;
             import java.util.List;
             import java.util.logging.Level;
             import java.util.logging.Logger;
             import jenkins.model.Jenkins;
             import jenkins.util.JavaVMArguments;
            +import jnr.constants.platform.Fcntl;
             
             /**
              * {@link Lifecycle} implementation when Hudson runs on the embedded
            @@ -78,7 +79,7 @@ public class UnixLifecycle extends Lifecycle {
                     for (int i = 3; i < sz; i++) {
                         int flags = LIBC.fcntl(i, F_GETFD);
                         if (flags < 0) continue;
            -            LIBC.fcntl(i, F_SETFD, flags | FD_CLOEXEC);
            +            PosixAPI.jnr().fcntlInt(i, Fcntl.F_SETFD, flags | FD_CLOEXEC);
                     }
             
                     // exec to self
            

            With that, java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war is fully working and the restart succeeds.

            To summarize, there are two separate bugs here:

            1. When running without -Djava.net.preferIPv4Stack=true, calls to Java_java_net_Inet6AddressImpl_lookupAllHostAddr call necp_open, which creates a file descriptor with NPOLICY, and if we then set FD_CLOEXEC on that file descriptor the process crashes. I am not sure offhand what the best way to deal with this is. Perhaps we should just skip those file descriptors? But would need to make sure they don't cause problems after exec'ing. Needs more investigation. In any case, -Djava.net.preferIPv4Stack=true works around the problem.
            2. Our call to fcntl(fd, F_SETFD has a bogus flags argument. The above code for JNR shows one way to fix this. I'm sure this can also be done with JNA. I am just too tired right now to come up with the JNA version.
            basil Basil Crow added a comment - More late night debugging with java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war ... after instrumenting Winstone with diff --git a/src/main/java/winstone/Launcher.java b/src/main/java/winstone/Launcher.java index 2657c71..2c0b72c 100644 --- a/src/main/java/winstone/Launcher.java +++ b/src/main/java/winstone/Launcher.java @@ -195,6 +195,7 @@ public class Launcher implements Runnable { } try { + Thread.sleep(10 * 1000L); server.start(); } catch (Exception e) { throw new IOException("Failed to start Jetty",e); I could confirm that after the restart not only was the port still open but all the other files were also still open. It appears as if nothing got closed before the exec . Instrumenting Jenkins with diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java index ca8f50b9f5..b6d28ea313 100644 --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java @@ -82,6 +82,7 @@ public class UnixLifecycle extends Lifecycle { } // exec to self + Thread.sleep(10 * 1000L); String exe = args.get(0); LIBC.execvp(exe, new StringArray(args.toArray(new String[0]))); throw new IOException("Failed to exec '" + exe + "' " + LIBC.strerror(Native.getLastError())); and running sudo lsof +f g -p $PID I could confirm that no file descriptor had the CX (close-on-exec) flag. We are simply failing to set the close-on-exec flag. Watching the fcntl system calls with dtruss -t fcntl I see: fcntl(0x3, 0x1, 0x0) = 0 0 fcntl(0x3, 0x2, 0xFFFFFFFFC5EC5B50) = 0 0 fcntl(0x4, 0x1, 0x0) = 0 0 fcntl(0x4, 0x2, 0x29ECA00) = 0 0 fcntl(0x5, 0x1, 0x0) = 0 0 fcntl(0x5, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x6, 0x1, 0x0) = 0 0 fcntl(0x6, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x7, 0x1, 0x0) = 0 0 fcntl(0x7, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x8, 0x1, 0x0) = 0 0 fcntl(0x8, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x9, 0x1, 0x0) = 0 0 fcntl(0x9, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0xA, 0x1, 0x0) = 0 0 fcntl(0xA, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 and on and on. The calls where the second argument is 0x2 correspond to F_SETFD. Most of them have 0xFFFFFFFFC5EC5B50 as the third argument, including the one for the TCP socket. For comparison I wrote a simple Python program: import fcntl import time with open('foo.txt') as foo: time.sleep(10) print('about to fcntl') flags = fcntl.fcntl(foo.fileno(), fcntl.F_GETFD) fcntl.fcntl(foo.fileno(), fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC) print('finished fcntl') time.sleep(10) I confirmed that after the F_SETFD the file descriptor had CX set in lsof +f g as expected. The dtruss -t fcntl output here was: fcntl(0x3, 0x1, 0x0) = 0 0 fcntl(0x3, 0x2, 0x1) = 0 0 So clearly Jenkins is in the wrong here by calling SET_FD with 0xFFFFFFFFC5EC5AB0 as the flags. It should be 0x1 . Funnily enough I got this working very easily with JNR (which I am trying to get rid of in core...): diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java index ca8f50b9f5..a68bc4ef7b 100644 --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java @@ -26,18 +26,19 @@ package hudson.lifecycle; import static hudson.util.jna.GNUCLibrary.FD_CLOEXEC; import static hudson.util.jna.GNUCLibrary.F_GETFD; -import static hudson.util.jna.GNUCLibrary.F_SETFD; import static hudson.util.jna.GNUCLibrary.LIBC; import com.sun.jna.Native; import com.sun.jna.StringArray; import hudson.Platform; +import hudson.os.PosixAPI; import java.io.IOException; import java.util.List; import java.util.logging.Level; import java.util.logging.Logger; import jenkins.model.Jenkins; import jenkins.util.JavaVMArguments; +import jnr.constants.platform.Fcntl; /** * {@link Lifecycle} implementation when Hudson runs on the embedded @@ -78,7 +79,7 @@ public class UnixLifecycle extends Lifecycle { for (int i = 3; i < sz; i++) { int flags = LIBC.fcntl(i, F_GETFD); if (flags < 0) continue; - LIBC.fcntl(i, F_SETFD, flags | FD_CLOEXEC); + PosixAPI.jnr().fcntlInt(i, Fcntl.F_SETFD, flags | FD_CLOEXEC); } // exec to self With that, java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war is fully working and the restart succeeds. To summarize, there are two separate bugs here: When running without -Djava.net.preferIPv4Stack=true , calls to Java_java_net_Inet6AddressImpl_lookupAllHostAddr call necp_open , which creates a file descriptor with NPOLICY , and if we then set FD_CLOEXEC on that file descriptor the process crashes. I am not sure offhand what the best way to deal with this is. Perhaps we should just skip those file descriptors? But would need to make sure they don't cause problems after exec'ing. Needs more investigation. In any case, -Djava.net.preferIPv4Stack=true works around the problem. Our call to fcntl(fd, F_SETFD has a bogus flags argument. The above code for JNR shows one way to fix this. I'm sure this can also be done with JNA. I am just too tired right now to come up with the JNA version.
            basil Basil Crow added a comment -

            Of note is that fcntl is a variadic function whose flags argument is an unsigned 64-bit integer. There is probably something wrong with our JNA signatures. Ironically, when I do a Google search for "fcntl jna", all I can find is the (incorrect) code in Jenkins.

            basil Basil Crow added a comment - Of note is that fcntl is a variadic function whose flags argument is an unsigned 64-bit integer. There is probably something wrong with our JNA signatures. Ironically, when I do a Google search for "fcntl jna", all I can find is the (incorrect) code in Jenkins.

            People

              Unassigned Unassigned
              transcendd Jonatan
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: