Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65911

Jenkins shuts down instead of restarting on Mac M1

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Minor
    • Resolution: Unresolved
    • core
    • 2.297, 2.289.2

    Description

      This issue appeared after migrating our Jenkins instance from a Macbook to a new Mac mini (M1). I could also reproduce it with a clean install on the M1 machine.

      It has been present in the last few versions (at least).

      Steps to reproduce:

      1. Go to http://[JENKINS_URL]/updateCenter/
      2. Check "Restart Jenkins when installation is complete and no jobs are running" (regardless of whether there's an upgrade or not)
      3. Observe what happens in the terminal - after 10 seconds, Jenkins process is killed instead of restarted.

      This is how the log looks like:

      2021-06-17 08:27:47.284+0000 [id=1066]	INFO	jenkins.model.Jenkins$20#run: Restart in 10 seconds
      2021-06-17 08:27:57.287+0000 [id=1066]	INFO	jenkins.model.Jenkins$20#run: Restarting VM as requested by admin
      2021-06-17 08:27:57.287+0000 [id=1066]	INFO	jenkins.model.Jenkins#cleanUp: Stopping Jenkins
      2021-06-17 08:27:57.339+0000 [id=1066]	INFO	jenkins.model.Jenkins$16#onAttained: Started termination
      2021-06-17 08:27:57.370+0000 [id=1066]	INFO	jenkins.model.Jenkins$16#onAttained: Completed termination
      2021-06-17 08:27:57.370+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpDisconnectComputers: Starting node disconnection
      2021-06-17 08:27:57.379+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpShutdownPluginManager: Stopping plugin manager
      2021-06-17 08:27:57.440+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpPersistQueue: Persisting build queue
      2021-06-17 08:27:57.443+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpAwaitDisconnects: Waiting for node disconnection completion
      2021-06-17 08:27:57.443+0000 [id=1066]	INFO	jenkins.model.Jenkins#cleanUp: Jenkins stopped
      [1]    72539 killed     jenkins
      

      Jenkins runs fine when manually started again (with "jenkins" command).

      Attachments

        Issue Links

          Activity

            timja Tim Jacomb added a comment -

            Interesting I found that the file descriptor is only created after logging in to Jenkins,

            If you disable auth and restart it doesn't get killed
            and after some more digging it seems to be the azure-ad plugin is starting that file descriptor somehow.
            The built-in security realm doesn't have the problem.

            But the restart doesn't work either way it still hits the:

            2021-10-05 07:26:21.441+0000 [id=1]	SEVERE	winstone.Logger#logInternal: Container startup failed
            java.net.BindException: Address already in use
            	at java.base/sun.nio.ch.Net.bind0(Native Method)
            	at java.base/sun.nio.ch.Net.bind(Net.java:455)
            	at java.base/sun.nio.ch.Net.bind(Net.java:447)
            	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
            	at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
            Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
            	at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
            	at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
            	at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at org.eclipse.jetty.server.Server.doStart(Server.java:401)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at winstone.Launcher.<init>(Launcher.java:192)
            Caused: java.io.IOException: Failed to start Jetty
            	at winstone.Launcher.<init>(Launcher.java:194)
            	at winstone.Launcher.main(Launcher.java:369)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
            	at Main._main(Main.java:311)
            	at Main.main(Main.java:114)
            
            timja Tim Jacomb added a comment - Interesting I found that the file descriptor is only created after logging in to Jenkins, If you disable auth and restart it doesn't get killed and after some more digging it seems to be the azure-ad plugin is starting that file descriptor somehow. The built-in security realm doesn't have the problem. But the restart doesn't work either way it still hits the: 2021-10-05 07:26:21.441+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:192) Caused: java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:194) at winstone.Launcher.main(Launcher.java:369) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:311) at Main.main(Main.java:114)
            basil Basil Crow added a comment -

            timja Setting aside the Azure AD issue, it is interesting that with the built-in security realm the restart fails with java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080. I looked into how this works (correctly) on my Linux system. Before restarting Jenkins, here are the open TCP file descriptors:

            $ sudo lsof -P -p 3916237 | grep TCP
            java    3916237 basil    5u     IPv6            5821933       0t0     TCP localhost:8080->localhost:49530 (ESTABLISHED)
            java    3916237 basil    6u     IPv4            5807038       0t0     TCP localhost:5005->localhost:48452 (ESTABLISHED)
            java    3916237 basil  120u     IPv6            5812861       0t0     TCP *:8080 (LISTEN)
            

            The call to LIBC.fcntl(i, F_SETFD,flags | FD_CLOEXEC) marks them as CLOEXEC, then the call to LIBC.execvp execs to self, during which the file descriptors are closed and the socket released. During startup there are no open TCP file descriptors, so Jetty has no problem binding to the port.

            At what point is this going off the rails on M1?

            Note that winstone.Launcher#shutdown isn't executed in this process, since this isn't a "regular" JVM shutdown where the shutdown hooks are executed. Whatever Winstone/Jetty might be doing to gracefully shut down, I can't see how closing all file descriptors and calling exec would be less than that - there's really not much left to the process once its program text and open file handles are gone. Unless there's some weird Apple resource that winstone.Launcher#shutdown releases which isn't a file descriptor or in the process's address space - which would be quite strange.

            basil Basil Crow added a comment - timja Setting aside the Azure AD issue, it is interesting that with the built-in security realm the restart fails with java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 . I looked into how this works (correctly) on my Linux system. Before restarting Jenkins, here are the open TCP file descriptors: $ sudo lsof -P -p 3916237 | grep TCP java 3916237 basil 5u IPv6 5821933 0t0 TCP localhost:8080->localhost:49530 (ESTABLISHED) java 3916237 basil 6u IPv4 5807038 0t0 TCP localhost:5005->localhost:48452 (ESTABLISHED) java 3916237 basil 120u IPv6 5812861 0t0 TCP *:8080 (LISTEN) The call to LIBC.fcntl(i, F_SETFD,flags | FD_CLOEXEC) marks them as CLOEXEC , then the call to LIBC.execvp execs to self, during which the file descriptors are closed and the socket released. During startup there are no open TCP file descriptors, so Jetty has no problem binding to the port. At what point is this going off the rails on M1? Note that winstone.Launcher#shutdown isn't executed in this process, since this isn't a "regular" JVM shutdown where the shutdown hooks are executed. Whatever Winstone/Jetty might be doing to gracefully shut down, I can't see how closing all file descriptors and calling exec would be less than that - there's really not much left to the process once its program text and open file handles are gone. Unless there's some weird Apple resource that winstone.Launcher#shutdown releases which isn't a file descriptor or in the process's address space - which would be quite strange.
            basil Basil Crow added a comment -

            I can reproduce the problem on an M1 system with a stock Jenkins installation. Tracing the process with dtruss(1) I cannot see any calls to necp_session_open, but I see a call to necp_open:

              927/0x364b:      3733      12      8 necp_open(0x0, 0x0, 0x0)          = 3 0
            
                          0x18e4d1b78
                          0x1926e07ec
                          0x1926df85c
                          0x18f1429c8
                          0x102ee0450
                          0x102ede014
                          0x102eddf68
                          0x102eddea0
                          0x18e25c1d8
                          0x18e25abf8
                          0x18e34ac6c
                          0x18e366f68
                          0x18e352208
                          0x18e367cb8
                          0x18e35d708
                          0x18e505304
                          0x18e504018
            

            For some reason DTrace's ustack() isn't resolving symbols, but I got the symbols from lldb(1):

            target create /opt/homebrew/opt/openjdk@11/bin/java
            process launch --stop-at-entry -- -jar war/target/jenkins.war
            process handle -p true -s false SIGBUS
            process handle -p true -s false SIGSEGV
            process handle -p true -s false SIGSTOP
            break set -n necp_open -s libsystem_kernel.dylib
            continue
            bt
            * thread #53, name = 'Java: GrapeHack.hack', stop reason = breakpoint 1.1
              * frame #0: 0x000000018e4d1b70 libsystem_kernel.dylib`necp_open
                frame #1: 0x0000000192715b2c libnetwork.dylib`nw_path_shared_necp_fd + 220
                frame #2: 0x00000001926e07ec libnetwork.dylib`nw_path_evaluator_evaluate + 540
                frame #3: 0x00000001926df85c libnetwork.dylib`nw_path_create_evaluator_for_endpoint + 108
                frame #4: 0x0000000192e1e8cc libnetwork.dylib`nw_nat64_v4_address_requires_synthesis + 252
                frame #5: 0x000000018e5298e4 libsystem_info.dylib`si_addrinfo + 2308
                frame #6: 0x000000018e528f34 libsystem_info.dylib`getaddrinfo + 168
                frame #7: 0x00000001021702c4 libnet.dylib`Java_java_net_Inet6AddressImpl_lookupAllHostAddr + 144
                frame #8: 0x000000011f90bbfc
                frame #9: 0x000000011f906fc8
                frame #10: 0x000000011f906fc8
                frame #11: 0x000000011f906ee0
                frame #12: 0x000000011f906fc8
                frame #13: 0x000000011f906ee0
                frame #14: 0x000000011f906ee0
                frame #15: 0x000000011f906ee0
                frame #16: 0x000000011f90719c
                frame #17: 0x000000011f90719c
                frame #18: 0x000000011f90719c
                frame #19: 0x000000011f900144
                frame #20: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
                frame #21: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
                frame #22: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392
                frame #23: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232
                frame #24: 0x00000001274e1360
                frame #25: 0x00000001203c8c84
                frame #26: 0x000000011f906ee0
                frame #27: 0x000000011f906ee0
                frame #28: 0x000000011f906ee0
                frame #29: 0x000000011f90719c
                frame #30: 0x000000011f9071f4
                frame #31: 0x000000011f9071f4
                frame #32: 0x00000001200a6434
                frame #33: 0x000000011f906ab0
                frame #34: 0x000000011f9067e4
                frame #35: 0x000000011f90678c
                frame #36: 0x000000011f9071f4
                frame #37: 0x000000011f90719c
                frame #38: 0x000000011f90719c
                frame #39: 0x000000011f90719c
                frame #40: 0x000000011f90719c
                frame #41: 0x000000011f90719c
                frame #42: 0x000000011f90719c
                frame #43: 0x000000011f90719c
                frame #44: 0x000000011f90719c
                frame #45: 0x000000011f906fc8
                frame #46: 0x000000011f906ee0
                frame #47: 0x000000011f906ee0
                frame #48: 0x000000011f906fc8
                frame #49: 0x000000011f900144
                frame #50: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
                frame #51: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
                frame #52: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392
                frame #53: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232
                frame #54: 0x00000001274e1360
                frame #55: 0x00000001203c8c84
                frame #56: 0x000000011f906ee0
                frame #57: 0x000000011f900144
                frame #58: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
                frame #59: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
                frame #60: 0x00000001028d4138 libjvm.dylib`Reflection::invoke_method(oopDesc*, Handle, objArrayHandle, Thread*) + 276
                frame #61: 0x00000001026d038c libjvm.dylib`JVM_InvokeMethod + 508
                frame #62: 0x00000001276a1270
                frame #63: 0x000000012097c16c
                frame #64: 0x000000011f90719c
                frame #65: 0x000000011f9071f4
                frame #66: 0x000000011f90719c
                frame #67: 0x000000011f90719c
                frame #68: 0x000000011f9071f4
                frame #69: 0x000000011f9071f4
                frame #70: 0x000000011f9071f4
                frame #71: 0x000000011f90719c
                frame #72: 0x000000011f9071f4
                frame #73: 0x000000011f900144
                frame #74: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
                frame #75: 0x00000001026468d0 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, Thread*) + 236
                frame #76: 0x0000000102646998 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*) + 100
                frame #77: 0x00000001026cd0dc libjvm.dylib`thread_entry(JavaThread*, Thread*) + 120
                frame #78: 0x0000000102973134 libjvm.dylib`JavaThread::thread_main_inner() + 124
                frame #79: 0x0000000102972f64 libjvm.dylib`JavaThread::run() + 376
                frame #80: 0x0000000102970c88 libjvm.dylib`Thread::call_run() + 120
                frame #81: 0x0000000102874b94 libjvm.dylib`thread_native_entry(Thread*) + 316
                frame #82: 0x000000018e509240 libsystem_pthread.dylib`_pthread_start + 148
            

            So some thread named GrapeHack.hack is calling Java_java_net_Inet6AddressImpl_lookupAllHostAddr, which eventually calls necp_open, which creates a file descriptor with NPOLICY, which we then set FD_CLOEXEC on, which we then crash on when exec'ing.

            I looked into disabling IPv6 in Java and tried running Java with -Djava.net.preferIPv4Stack=true. That eliminates the calls to necp_open and the NPOLICY FD:

            $ sudo lsof -p `pgrep java` | grep NPOLICY
            $
            

            That eliminates the crash, but Jenkins still doesn't restart properly. Now I'm getting:

            2022-02-22 06:38:34.228+0000 [id=1]	SEVERE	winstone.Logger#logInternal: Container startup failed
            java.net.BindException: Address already in use
            	at java.base/sun.nio.ch.Net.bind0(Native Method)
            	at java.base/sun.nio.ch.Net.bind(Net.java:455)
            	at java.base/sun.nio.ch.Net.bind(Net.java:447)
            	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
            	at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
            Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
            	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
            	at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
            	at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
            	at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at org.eclipse.jetty.server.Server.doStart(Server.java:401)
            	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
            	at winstone.Launcher.<init>(Launcher.java:198)
            Caused: java.io.IOException: Failed to start Jetty
            	at winstone.Launcher.<init>(Launcher.java:200)
            	at winstone.Launcher.main(Launcher.java:376)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
            	at Main._main(Main.java:304)
            	at Main.main(Main.java:108)
            

            A mystery for another day.

            basil Basil Crow added a comment - I can reproduce the problem on an M1 system with a stock Jenkins installation. Tracing the process with dtruss(1) I cannot see any calls to necp_session_open , but I see a call to necp_open : 927/0x364b: 3733 12 8 necp_open(0x0, 0x0, 0x0) = 3 0 0x18e4d1b78 0x1926e07ec 0x1926df85c 0x18f1429c8 0x102ee0450 0x102ede014 0x102eddf68 0x102eddea0 0x18e25c1d8 0x18e25abf8 0x18e34ac6c 0x18e366f68 0x18e352208 0x18e367cb8 0x18e35d708 0x18e505304 0x18e504018 For some reason DTrace's ustack() isn't resolving symbols, but I got the symbols from lldb(1) : target create /opt/homebrew/opt/openjdk@11/bin/java process launch --stop-at-entry -- -jar war/target/jenkins.war process handle -p true -s false SIGBUS process handle -p true -s false SIGSEGV process handle -p true -s false SIGSTOP break set -n necp_open -s libsystem_kernel.dylib continue bt * thread #53, name = 'Java: GrapeHack.hack', stop reason = breakpoint 1.1 * frame #0: 0x000000018e4d1b70 libsystem_kernel.dylib`necp_open frame #1: 0x0000000192715b2c libnetwork.dylib`nw_path_shared_necp_fd + 220 frame #2: 0x00000001926e07ec libnetwork.dylib`nw_path_evaluator_evaluate + 540 frame #3: 0x00000001926df85c libnetwork.dylib`nw_path_create_evaluator_for_endpoint + 108 frame #4: 0x0000000192e1e8cc libnetwork.dylib`nw_nat64_v4_address_requires_synthesis + 252 frame #5: 0x000000018e5298e4 libsystem_info.dylib`si_addrinfo + 2308 frame #6: 0x000000018e528f34 libsystem_info.dylib`getaddrinfo + 168 frame #7: 0x00000001021702c4 libnet.dylib`Java_java_net_Inet6AddressImpl_lookupAllHostAddr + 144 frame #8: 0x000000011f90bbfc frame #9: 0x000000011f906fc8 frame #10: 0x000000011f906fc8 frame #11: 0x000000011f906ee0 frame #12: 0x000000011f906fc8 frame #13: 0x000000011f906ee0 frame #14: 0x000000011f906ee0 frame #15: 0x000000011f906ee0 frame #16: 0x000000011f90719c frame #17: 0x000000011f90719c frame #18: 0x000000011f90719c frame #19: 0x000000011f900144 frame #20: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #21: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #22: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392 frame #23: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232 frame #24: 0x00000001274e1360 frame #25: 0x00000001203c8c84 frame #26: 0x000000011f906ee0 frame #27: 0x000000011f906ee0 frame #28: 0x000000011f906ee0 frame #29: 0x000000011f90719c frame #30: 0x000000011f9071f4 frame #31: 0x000000011f9071f4 frame #32: 0x00000001200a6434 frame #33: 0x000000011f906ab0 frame #34: 0x000000011f9067e4 frame #35: 0x000000011f90678c frame #36: 0x000000011f9071f4 frame #37: 0x000000011f90719c frame #38: 0x000000011f90719c frame #39: 0x000000011f90719c frame #40: 0x000000011f90719c frame #41: 0x000000011f90719c frame #42: 0x000000011f90719c frame #43: 0x000000011f90719c frame #44: 0x000000011f90719c frame #45: 0x000000011f906fc8 frame #46: 0x000000011f906ee0 frame #47: 0x000000011f906ee0 frame #48: 0x000000011f906fc8 frame #49: 0x000000011f900144 frame #50: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #51: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #52: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392 frame #53: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232 frame #54: 0x00000001274e1360 frame #55: 0x00000001203c8c84 frame #56: 0x000000011f906ee0 frame #57: 0x000000011f900144 frame #58: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #59: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #60: 0x00000001028d4138 libjvm.dylib`Reflection::invoke_method(oopDesc*, Handle, objArrayHandle, Thread*) + 276 frame #61: 0x00000001026d038c libjvm.dylib`JVM_InvokeMethod + 508 frame #62: 0x00000001276a1270 frame #63: 0x000000012097c16c frame #64: 0x000000011f90719c frame #65: 0x000000011f9071f4 frame #66: 0x000000011f90719c frame #67: 0x000000011f90719c frame #68: 0x000000011f9071f4 frame #69: 0x000000011f9071f4 frame #70: 0x000000011f9071f4 frame #71: 0x000000011f90719c frame #72: 0x000000011f9071f4 frame #73: 0x000000011f900144 frame #74: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #75: 0x00000001026468d0 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, Thread*) + 236 frame #76: 0x0000000102646998 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*) + 100 frame #77: 0x00000001026cd0dc libjvm.dylib`thread_entry(JavaThread*, Thread*) + 120 frame #78: 0x0000000102973134 libjvm.dylib`JavaThread::thread_main_inner() + 124 frame #79: 0x0000000102972f64 libjvm.dylib`JavaThread::run() + 376 frame #80: 0x0000000102970c88 libjvm.dylib`Thread::call_run() + 120 frame #81: 0x0000000102874b94 libjvm.dylib`thread_native_entry(Thread*) + 316 frame #82: 0x000000018e509240 libsystem_pthread.dylib`_pthread_start + 148 So some thread named GrapeHack.hack is calling Java_java_net_Inet6AddressImpl_lookupAllHostAddr , which eventually calls necp_open , which creates a file descriptor with NPOLICY , which we then set FD_CLOEXEC on, which we then crash on when exec'ing. I looked into disabling IPv6 in Java and tried running Java with -Djava.net.preferIPv4Stack=true . That eliminates the calls to necp_open and the NPOLICY FD: $ sudo lsof -p `pgrep java` | grep NPOLICY $ That eliminates the crash, but Jenkins still doesn't restart properly. Now I'm getting: 2022-02-22 06:38:34.228+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:198) Caused: java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:200) at winstone.Launcher.main(Launcher.java:376) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:304) at Main.main(Main.java:108) A mystery for another day.
            basil Basil Crow added a comment -

            More late night debugging with java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war... after instrumenting Winstone with

            diff --git a/src/main/java/winstone/Launcher.java b/src/main/java/winstone/Launcher.java
            index 2657c71..2c0b72c 100644
            --- a/src/main/java/winstone/Launcher.java
            +++ b/src/main/java/winstone/Launcher.java
            @@ -195,6 +195,7 @@ public class Launcher implements Runnable {
                         }
             
                         try {
            +                Thread.sleep(10 * 1000L);
                             server.start();
                         } catch (Exception e) {
                             throw new IOException("Failed to start Jetty",e);
            

            I could confirm that after the restart not only was the port still open but all the other files were also still open. It appears as if nothing got closed before the exec. Instrumenting Jenkins with

            diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            index ca8f50b9f5..b6d28ea313 100644
            --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            @@ -82,6 +82,7 @@ public class UnixLifecycle extends Lifecycle {
                     }
             
                     // exec to self
            +        Thread.sleep(10 * 1000L);
                     String exe = args.get(0);
                     LIBC.execvp(exe, new StringArray(args.toArray(new String[0])));
                     throw new IOException("Failed to exec '" + exe + "' " + LIBC.strerror(Native.getLastError()));
            

            and running sudo lsof +f g -p $PID I could confirm that no file descriptor had the CX (close-on-exec) flag. We are simply failing to set the close-on-exec flag.

            Watching the fcntl system calls with dtruss -t fcntl I see:

            fcntl(0x3, 0x1, 0x0)             = 0 0
            fcntl(0x3, 0x2, 0xFFFFFFFFC5EC5B50)              = 0 0
            fcntl(0x4, 0x1, 0x0)             = 0 0
            fcntl(0x4, 0x2, 0x29ECA00)               = 0 0
            fcntl(0x5, 0x1, 0x0)             = 0 0
            fcntl(0x5, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            fcntl(0x6, 0x1, 0x0)             = 0 0
            fcntl(0x6, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            fcntl(0x7, 0x1, 0x0)             = 0 0
            fcntl(0x7, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            fcntl(0x8, 0x1, 0x0)             = 0 0
            fcntl(0x8, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            fcntl(0x9, 0x1, 0x0)             = 0 0
            fcntl(0x9, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            fcntl(0xA, 0x1, 0x0)             = 0 0
            fcntl(0xA, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
            

            and on and on. The calls where the second argument is 0x2 correspond to F_SETFD. Most of them have 0xFFFFFFFFC5EC5B50 as the third argument, including the one for the TCP socket.

            For comparison I wrote a simple Python program:

            import fcntl
            import time
            
            with open('foo.txt') as foo:
                time.sleep(10)
                print('about to fcntl')
                flags = fcntl.fcntl(foo.fileno(), fcntl.F_GETFD)
                fcntl.fcntl(foo.fileno(), fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC)
                print('finished fcntl')
                time.sleep(10)
            

            I confirmed that after the F_SETFD the file descriptor had CX set in lsof +f g as expected. The dtruss -t fcntl output here was:

            fcntl(0x3, 0x1, 0x0)		 = 0 0
            fcntl(0x3, 0x2, 0x1)		 = 0 0
            

            So clearly Jenkins is in the wrong here by calling SET_FD with 0xFFFFFFFFC5EC5AB0 as the flags. It should be 0x1.

            Funnily enough I got this working very easily with JNR (which I am trying to get rid of in core...):

            diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            index ca8f50b9f5..a68bc4ef7b 100644
            --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
            @@ -26,18 +26,19 @@ package hudson.lifecycle;
             
             import static hudson.util.jna.GNUCLibrary.FD_CLOEXEC;
             import static hudson.util.jna.GNUCLibrary.F_GETFD;
            -import static hudson.util.jna.GNUCLibrary.F_SETFD;
             import static hudson.util.jna.GNUCLibrary.LIBC;
             
             import com.sun.jna.Native;
             import com.sun.jna.StringArray;
             import hudson.Platform;
            +import hudson.os.PosixAPI;
             import java.io.IOException;
             import java.util.List;
             import java.util.logging.Level;
             import java.util.logging.Logger;
             import jenkins.model.Jenkins;
             import jenkins.util.JavaVMArguments;
            +import jnr.constants.platform.Fcntl;
             
             /**
              * {@link Lifecycle} implementation when Hudson runs on the embedded
            @@ -78,7 +79,7 @@ public class UnixLifecycle extends Lifecycle {
                     for (int i = 3; i < sz; i++) {
                         int flags = LIBC.fcntl(i, F_GETFD);
                         if (flags < 0) continue;
            -            LIBC.fcntl(i, F_SETFD, flags | FD_CLOEXEC);
            +            PosixAPI.jnr().fcntlInt(i, Fcntl.F_SETFD, flags | FD_CLOEXEC);
                     }
             
                     // exec to self
            

            With that, java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war is fully working and the restart succeeds.

            To summarize, there are two separate bugs here:

            1. When running without -Djava.net.preferIPv4Stack=true, calls to Java_java_net_Inet6AddressImpl_lookupAllHostAddr call necp_open, which creates a file descriptor with NPOLICY, and if we then set FD_CLOEXEC on that file descriptor the process crashes. I am not sure offhand what the best way to deal with this is. Perhaps we should just skip those file descriptors? But would need to make sure they don't cause problems after exec'ing. Needs more investigation. In any case, -Djava.net.preferIPv4Stack=true works around the problem.
            2. Our call to fcntl(fd, F_SETFD has a bogus flags argument. The above code for JNR shows one way to fix this. I'm sure this can also be done with JNA. I am just too tired right now to come up with the JNA version.
            basil Basil Crow added a comment - More late night debugging with java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war ... after instrumenting Winstone with diff --git a/src/main/java/winstone/Launcher.java b/src/main/java/winstone/Launcher.java index 2657c71..2c0b72c 100644 --- a/src/main/java/winstone/Launcher.java +++ b/src/main/java/winstone/Launcher.java @@ -195,6 +195,7 @@ public class Launcher implements Runnable { } try { + Thread.sleep(10 * 1000L); server.start(); } catch (Exception e) { throw new IOException("Failed to start Jetty",e); I could confirm that after the restart not only was the port still open but all the other files were also still open. It appears as if nothing got closed before the exec . Instrumenting Jenkins with diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java index ca8f50b9f5..b6d28ea313 100644 --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java @@ -82,6 +82,7 @@ public class UnixLifecycle extends Lifecycle { } // exec to self + Thread.sleep(10 * 1000L); String exe = args.get(0); LIBC.execvp(exe, new StringArray(args.toArray(new String[0]))); throw new IOException("Failed to exec '" + exe + "' " + LIBC.strerror(Native.getLastError())); and running sudo lsof +f g -p $PID I could confirm that no file descriptor had the CX (close-on-exec) flag. We are simply failing to set the close-on-exec flag. Watching the fcntl system calls with dtruss -t fcntl I see: fcntl(0x3, 0x1, 0x0) = 0 0 fcntl(0x3, 0x2, 0xFFFFFFFFC5EC5B50) = 0 0 fcntl(0x4, 0x1, 0x0) = 0 0 fcntl(0x4, 0x2, 0x29ECA00) = 0 0 fcntl(0x5, 0x1, 0x0) = 0 0 fcntl(0x5, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x6, 0x1, 0x0) = 0 0 fcntl(0x6, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x7, 0x1, 0x0) = 0 0 fcntl(0x7, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x8, 0x1, 0x0) = 0 0 fcntl(0x8, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x9, 0x1, 0x0) = 0 0 fcntl(0x9, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0xA, 0x1, 0x0) = 0 0 fcntl(0xA, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 and on and on. The calls where the second argument is 0x2 correspond to F_SETFD. Most of them have 0xFFFFFFFFC5EC5B50 as the third argument, including the one for the TCP socket. For comparison I wrote a simple Python program: import fcntl import time with open('foo.txt') as foo: time.sleep(10) print('about to fcntl') flags = fcntl.fcntl(foo.fileno(), fcntl.F_GETFD) fcntl.fcntl(foo.fileno(), fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC) print('finished fcntl') time.sleep(10) I confirmed that after the F_SETFD the file descriptor had CX set in lsof +f g as expected. The dtruss -t fcntl output here was: fcntl(0x3, 0x1, 0x0) = 0 0 fcntl(0x3, 0x2, 0x1) = 0 0 So clearly Jenkins is in the wrong here by calling SET_FD with 0xFFFFFFFFC5EC5AB0 as the flags. It should be 0x1 . Funnily enough I got this working very easily with JNR (which I am trying to get rid of in core...): diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java index ca8f50b9f5..a68bc4ef7b 100644 --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java @@ -26,18 +26,19 @@ package hudson.lifecycle; import static hudson.util.jna.GNUCLibrary.FD_CLOEXEC; import static hudson.util.jna.GNUCLibrary.F_GETFD; -import static hudson.util.jna.GNUCLibrary.F_SETFD; import static hudson.util.jna.GNUCLibrary.LIBC; import com.sun.jna.Native; import com.sun.jna.StringArray; import hudson.Platform; +import hudson.os.PosixAPI; import java.io.IOException; import java.util.List; import java.util.logging.Level; import java.util.logging.Logger; import jenkins.model.Jenkins; import jenkins.util.JavaVMArguments; +import jnr.constants.platform.Fcntl; /** * {@link Lifecycle} implementation when Hudson runs on the embedded @@ -78,7 +79,7 @@ public class UnixLifecycle extends Lifecycle { for (int i = 3; i < sz; i++) { int flags = LIBC.fcntl(i, F_GETFD); if (flags < 0) continue; - LIBC.fcntl(i, F_SETFD, flags | FD_CLOEXEC); + PosixAPI.jnr().fcntlInt(i, Fcntl.F_SETFD, flags | FD_CLOEXEC); } // exec to self With that, java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war is fully working and the restart succeeds. To summarize, there are two separate bugs here: When running without -Djava.net.preferIPv4Stack=true , calls to Java_java_net_Inet6AddressImpl_lookupAllHostAddr call necp_open , which creates a file descriptor with NPOLICY , and if we then set FD_CLOEXEC on that file descriptor the process crashes. I am not sure offhand what the best way to deal with this is. Perhaps we should just skip those file descriptors? But would need to make sure they don't cause problems after exec'ing. Needs more investigation. In any case, -Djava.net.preferIPv4Stack=true works around the problem. Our call to fcntl(fd, F_SETFD has a bogus flags argument. The above code for JNR shows one way to fix this. I'm sure this can also be done with JNA. I am just too tired right now to come up with the JNA version.
            basil Basil Crow added a comment -

            Of note is that fcntl is a variadic function whose flags argument is an unsigned 64-bit integer. There is probably something wrong with our JNA signatures. Ironically, when I do a Google search for "fcntl jna", all I can find is the (incorrect) code in Jenkins.

            basil Basil Crow added a comment - Of note is that fcntl is a variadic function whose flags argument is an unsigned 64-bit integer. There is probably something wrong with our JNA signatures. Ironically, when I do a Google search for "fcntl jna", all I can find is the (incorrect) code in Jenkins.

            People

              Unassigned Unassigned
              transcendd Jonatan
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: