[JENKINS-65911] Jenkins shuts down instead of restarting on Mac M1

Type: Bug
Resolution: Fixed
Priority: Minor
Component/s: core
Labels:
- MacMini
- mac
- restart
Environment:

Hide
Jenkins: 2.298
OS: Mac OS X - 11.0
---
jsch:0.1.55.2
ws-cleanup:0.39
blueocean-commons:1.24.7
mercurial:2.14
echarts-api:5.1.0-2
structs:1.23
sse-gateway:1.24
conditional-buildstep:1.4.1
apache-httpcomponents-client-4-api:4.5.13-1.0
parameterized-trigger:2.40
sshd:3.0.3
pipeline-model-extensions:1.8.4
build-with-parameters:1.5.1
workflow-aggregator:2.6
mailer:1.34
git:4.7.1
handy-uri-templates-2-api:2.1.8-1.0
theme-manager:0.6
bootstrap4-api:4.6.0-3
jquery3-api:3.6.0-1
delivery-pipeline-plugin:1.4.2
command-launcher:1.6
dark-theme:0.0.12
workflow-api:2.42
docker-build-publish:1.3.3
extended-choice-parameter:0.82
workflow-job:2.40
ssh-credentials:1.18.2
authentication-tokens:1.4
blueocean-rest-impl:1.24.7
github-branch-source:2.10.4
htmlpublisher:1.25
javadoc:1.6
workflow-cps-global-lib:2.19
blueocean-web:1.24.7
jackson2-api:2.12.3
disable-failed-job:1.15
ssh-slaves:1.31.7
docker-workflow:1.26
pipeline-stage-tags-metadata:1.8.4
blueocean-pipeline-scm-api:1.24.7
pipeline-milestone-step:1.3.2
credentials:2.3.19
docker-java-api:3.1.5.2
cloudbees-bitbucket-branch-source:2.9.9
jquery:1.12.4-1
github:1.33.1
lockable-resources:2.10
jquery-detached:1.2.1
blueocean-personalization:1.24.7
workflow-scm-step:2.12
matrix-auth:2.6.6
matrix-project:1.18
pipeline-stage-step:2.5
pipeline-build-step:2.13
antisamy-markup-formatter:2.1
pipeline-input-step:2.12
bouncycastle-api:2.20
checks-api:1.7.0
handlebars:3.0.8
blueocean:1.24.7
variant:1.4
momentjs:1.1.1
blueocean-jwt:1.24.7
plain-credentials:1.7
docker-commons:1.17
aws-java-sdk:1.11.995
docker-plugin:1.2.2
jobConfigHistory:2.27
copyartifact:1.46
git-client:3.7.1
timestamper:1.13
gradle:1.36
pipeline-rest-api:2.19
workflow-basic-steps:2.23
github-api:1.123
blueocean-i18n:1.24.7
ldap:2.6
blueocean-events:1.24.7
jaxb:2.3.0.1
blueocean-core-js:1.24.7
maven-plugin:3.10
aws-credentials:1.29
thinBackup:1.10
blueocean-config:1.24.7
blueocean-github-pipeline:1.24.7
font-awesome-api:5.15.3-2
credentials-binding:1.24
pipeline-model-definition:1.8.4
config-file-provider:3.8.0
pipeline-stage-view:2.19
sonar:2.13.1
token-macro:2.15
slack:2.47
blueocean-display-url:2.4.1
workflow-multibranch:2.23
script-security:1.76
git-server:1.9
snakeyaml-api:1.27.0
docker-build-step:2.7
discard-old-build:1.05
ssh-agent:1.22
workflow-step-api:2.23
run-condition:1.5
rebuild:1.32
okhttp-api:3.14.9
pipeline-graph-analysis:1.10
blueocean-git-pipeline:1.24.7
pipeline-model-api:1.8.4
plugin-util-api:2.2.0
jenkins-design-language:1.24.7
windows-slaves:1.8
workflow-cps:2.90
popper-api:1.16.1-2
blueocean-autofavorite:1.2.4
workflow-durable-task-step:2.38
trilead-api:1.0.13
branch-api:2.6.4
jdk-tool:1.5
whitesource:21.1.2
cloudbees-folder:6.15
blueocean-pipeline-editor:1.24.7
Parameterized-Remote-Trigger:3.1.5.1
blueocean-dashboard:1.24.7
jjwt-api:0.11.2-9.c8b45b8bb173
durable-task:1.36
kpp-management-plugin:1.0.0
junit:1.49
pam-auth:1.6
pubsub-light:1.14
next-build-number:1.6
scm-api:2.6.4
blueocean-pipeline-api-impl:1.24.7
ace-editor:1.1
display-url-api:2.3.4
workflow-support:3.8
resource-disposer:0.15
blueocean-rest:1.24.7
build-timeout:1.20
favorite:2.3.3
versionnumber:1.9
blueocean-bitbucket-pipeline:1.24.7
mapdb-api:1.0.9.0
git-parameter:0.9.13

Also reproducible with a minimal amount of plugins and clean install.
Installed with homebrew, running as "jenkins" command (not a brew service).
Local instance behind a firewall and accessed through the machine's IP.

Show
Jenkins: 2.298 OS: Mac OS X - 11.0 --- jsch:0.1.55.2 ws-cleanup:0.39 blueocean-commons:1.24.7 mercurial:2.14 echarts-api:5.1.0-2 structs:1.23 sse-gateway:1.24 conditional-buildstep:1.4.1 apache-httpcomponents-client-4-api:4.5.13-1.0 parameterized-trigger:2.40 sshd:3.0.3 pipeline-model-extensions:1.8.4 build-with-parameters:1.5.1 workflow-aggregator:2.6 mailer:1.34 git:4.7.1 handy-uri-templates-2-api:2.1.8-1.0 theme-manager:0.6 bootstrap4-api:4.6.0-3 jquery3-api:3.6.0-1 delivery-pipeline-plugin:1.4.2 command-launcher:1.6 dark-theme:0.0.12 workflow-api:2.42 docker-build-publish:1.3.3 extended-choice-parameter:0.82 workflow-job:2.40 ssh-credentials:1.18.2 authentication-tokens:1.4 blueocean-rest-impl:1.24.7 github-branch-source:2.10.4 htmlpublisher:1.25 javadoc:1.6 workflow-cps-global-lib:2.19 blueocean-web:1.24.7 jackson2-api:2.12.3 disable-failed-job:1.15 ssh-slaves:1.31.7 docker-workflow:1.26 pipeline-stage-tags-metadata:1.8.4 blueocean-pipeline-scm-api:1.24.7 pipeline-milestone-step:1.3.2 credentials:2.3.19 docker-java-api:3.1.5.2 cloudbees-bitbucket-branch-source:2.9.9 jquery:1.12.4-1 github:1.33.1 lockable-resources:2.10 jquery-detached:1.2.1 blueocean-personalization:1.24.7 workflow-scm-step:2.12 matrix-auth:2.6.6 matrix-project:1.18 pipeline-stage-step:2.5 pipeline-build-step:2.13 antisamy-markup-formatter:2.1 pipeline-input-step:2.12 bouncycastle-api:2.20 checks-api:1.7.0 handlebars:3.0.8 blueocean:1.24.7 variant:1.4 momentjs:1.1.1 blueocean-jwt:1.24.7 plain-credentials:1.7 docker-commons:1.17 aws-java-sdk:1.11.995 docker-plugin:1.2.2 jobConfigHistory:2.27 copyartifact:1.46 git-client:3.7.1 timestamper:1.13 gradle:1.36 pipeline-rest-api:2.19 workflow-basic-steps:2.23 github-api:1.123 blueocean-i18n:1.24.7 ldap:2.6 blueocean-events:1.24.7 jaxb:2.3.0.1 blueocean-core-js:1.24.7 maven-plugin:3.10 aws-credentials:1.29 thinBackup:1.10 blueocean-config:1.24.7 blueocean-github-pipeline:1.24.7 font-awesome-api:5.15.3-2 credentials-binding:1.24 pipeline-model-definition:1.8.4 config-file-provider:3.8.0 pipeline-stage-view:2.19 sonar:2.13.1 token-macro:2.15 slack:2.47 blueocean-display-url:2.4.1 workflow-multibranch:2.23 script-security:1.76 git-server:1.9 snakeyaml-api:1.27.0 docker-build-step:2.7 discard-old-build:1.05 ssh-agent:1.22 workflow-step-api:2.23 run-condition:1.5 rebuild:1.32 okhttp-api:3.14.9 pipeline-graph-analysis:1.10 blueocean-git-pipeline:1.24.7 pipeline-model-api:1.8.4 plugin-util-api:2.2.0 jenkins-design-language:1.24.7 windows-slaves:1.8 workflow-cps:2.90 popper-api:1.16.1-2 blueocean-autofavorite:1.2.4 workflow-durable-task-step:2.38 trilead-api:1.0.13 branch-api:2.6.4 jdk-tool:1.5 whitesource:21.1.2 cloudbees-folder:6.15 blueocean-pipeline-editor:1.24.7 Parameterized-Remote-Trigger:3.1.5.1 blueocean-dashboard:1.24.7 jjwt-api:0.11.2-9.c8b45b8bb173 durable-task:1.36 kpp-management-plugin:1.0.0 junit:1.49 pam-auth:1.6 pubsub-light:1.14 next-build-number:1.6 scm-api:2.6.4 blueocean-pipeline-api-impl:1.24.7 ace-editor:1.1 display-url-api:2.3.4 workflow-support:3.8 resource-disposer:0.15 blueocean-rest:1.24.7 build-timeout:1.20 favorite:2.3.3 versionnumber:1.9 blueocean-bitbucket-pipeline:1.24.7 mapdb-api:1.0.9.0 git-parameter:0.9.13 Also reproducible with a minimal amount of plugins and clean install. Installed with homebrew, running as "jenkins" command (not a brew service). Local instance behind a firewall and accessed through the machine's IP.

Similar Issues:
Powered by SuggestiMate

Show
Released As:
2.449

This issue appeared after migrating our Jenkins instance from a Macbook to a new Mac mini (M1). I could also reproduce it with a clean install on the M1 machine.

It has been present in the last few versions (at least).

Steps to reproduce:

Go to http://[JENKINS_URL]/updateCenter/
Check "Restart Jenkins when installation is complete and no jobs are running" (regardless of whether there's an upgrade or not)
Observe what happens in the terminal - after 10 seconds, Jenkins process is killed instead of restarted.

This is how the log looks like:

2021-06-17 08:27:47.284+0000 [id=1066]	INFO	jenkins.model.Jenkins$20#run: Restart in 10 seconds
2021-06-17 08:27:57.287+0000 [id=1066]	INFO	jenkins.model.Jenkins$20#run: Restarting VM as requested by admin
2021-06-17 08:27:57.287+0000 [id=1066]	INFO	jenkins.model.Jenkins#cleanUp: Stopping Jenkins
2021-06-17 08:27:57.339+0000 [id=1066]	INFO	jenkins.model.Jenkins$16#onAttained: Started termination
2021-06-17 08:27:57.370+0000 [id=1066]	INFO	jenkins.model.Jenkins$16#onAttained: Completed termination
2021-06-17 08:27:57.370+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpDisconnectComputers: Starting node disconnection
2021-06-17 08:27:57.379+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpShutdownPluginManager: Stopping plugin manager
2021-06-17 08:27:57.440+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpPersistQueue: Persisting build queue
2021-06-17 08:27:57.443+0000 [id=1066]	INFO	jenkins.model.Jenkins#_cleanUpAwaitDisconnects: Waiting for node disconnection completion
2021-06-17 08:27:57.443+0000 [id=1066]	INFO	jenkins.model.Jenkins#cleanUp: Jenkins stopped
[1]    72539 killed     jenkins

Jenkins runs fine when manually started again (with "jenkins" command).

is related to

JENKINS-65195 Unexpected process termination by agent.jar on Apple Big Sur

Closed

Jonatan added a comment - 2021-06-29 14:25

This seems to be fixed in the 2.299 version, though it's not documented.

Jonatan added a comment - 2021-06-29 14:25 This seems to be fixed in the 2.299 version, though it's not documented.

Mark Waite added a comment - 2021-07-02 02:53

It is documented in the 2.297 changelog and in the 2.289.2 changelog

Mark Waite added a comment - 2021-07-02 02:53 It is documented in the 2.297 changelog and in the 2.289.2 changelog

Tim Jacomb added a comment - 2021-10-04 20:19

This was not fixed, it was another issue

Tim Jacomb added a comment - 2021-10-04 20:19 This was not fixed, it was another issue

Tim Jacomb added a comment - 2021-10-04 20:21

I've been debugging it.

It fails on this line:
https://github.com/jenkinsci/jenkins/blob/195dacbf556fdf0a393d5d139df27ffba3a82b11/core/src/main/java/hudson/lifecycle/UnixLifecycle.java#L80

The file descriptor it fails on changes in different runs normally between 820-840

Catching error doesn't help:

try {} catch (Error e)

Not sure if basil would have any ideas on what to try from here =/

Tim Jacomb added a comment - 2021-10-04 20:21 I've been debugging it. It fails on this line: https://github.com/jenkinsci/jenkins/blob/195dacbf556fdf0a393d5d139df27ffba3a82b11/core/src/main/java/hudson/lifecycle/UnixLifecycle.java#L80 The file descriptor it fails on changes in different runs normally between 820-840 Catching error doesn't help: try {} catch (Error e) Not sure if basil would have any ideas on what to try from here =/

Basil Crow added a comment - 2021-10-04 20:38

timja Unfortunately I don't have access to a macOS system to do real debugging on, and debugging by telegram isn't very efficient. But I can try to help.

The logic makes sense at a high level: we are about to exec, so to avoid leaking open file handles we need to set the close-on-exec flag for all open file descriptors. That this not only failing in the first place but indeed crashing the process is quite surprising.

The most direct form of analysis would be to work our way backwards from the crash. Do you have a stack trace, registers, or any other diagnostic information for the crash? I don't really know where macOS puts these things, but this page seems to have some answers.

Another piece of interesting information might be which file descriptor we are trying to set the close-on-exec flag on prior to the crash. If you can put a breakpoint at the line you mentioned and run lsof to list the process's open file descriptors, then keep going until it crashes, then tell us what type of file descriptor caused the crash, that might be interesting. E.g. was it a socket or a regular file, if it was a regular file what was its path, etc.

Basil Crow added a comment - 2021-10-04 20:38 timja Unfortunately I don't have access to a macOS system to do real debugging on, and debugging by telegram isn't very efficient. But I can try to help. The logic makes sense at a high level: we are about to exec , so to avoid leaking open file handles we need to set the close-on-exec flag for all open file descriptors. That this not only failing in the first place but indeed crashing the process is quite surprising. The most direct form of analysis would be to work our way backwards from the crash. Do you have a stack trace, registers, or any other diagnostic information for the crash? I don't really know where macOS puts these things, but this page seems to have some answers . Another piece of interesting information might be which file descriptor we are trying to set the close-on-exec flag on prior to the crash. If you can put a breakpoint at the line you mentioned and run lsof to list the process's open file descriptors, then keep going until it crashes, then tell us what type of file descriptor caused the crash, that might be interesting. E.g. was it a socket or a regular file, if it was a regular file what was its path, etc.

Tim Jacomb added a comment - 2021-10-04 20:40

Thanks I’ll add some more debug info, currently I’ve just got the ID, there’s no stacktrace

Tim Jacomb added a comment - 2021-10-04 20:40 Thanks I’ll add some more debug info, currently I’ve just got the ID, there’s no stacktrace

Tim Jacomb added a comment - 2021-10-04 21:27

Interesting there was a crash log in the 'Crash reports'

Exception Type:        EXC_GUARD
Exception Codes:       GUARD_TYPE_FD

Looks like some apple protection on one of the files

Full log at: https://gist.github.com/timja/0167c76924c5e44a09c64a545c87ef8b

Tim Jacomb added a comment - 2021-10-04 21:27 Interesting there was a crash log in the 'Crash reports' Exception Type: EXC_GUARD Exception Codes: GUARD_TYPE_FD Looks like some apple protection on one of the files Full log at: https://gist.github.com/timja/0167c76924c5e44a09c64a545c87ef8b

Tim Jacomb added a comment - 2021-10-04 21:31

The thread always crashes at:

Thread 64 Crashed:: Java: restart thread
0   libsystem_kernel.dylib        	0x000000018c105964 __fcntl + 8
1   libsystem_kernel.dylib        	0x000000018c105930 fcntl + 88

Tim Jacomb added a comment - 2021-10-04 21:31 The thread always crashes at: Thread 64 Crashed:: Java: restart thread 0 libsystem_kernel.dylib 0x000000018c105964 __fcntl + 8 1 libsystem_kernel.dylib 0x000000018c105930 fcntl + 88

Tim Jacomb added a comment - 2021-10-04 21:38

Apple docs: https://developer.apple.com/documentation/xcode/understanding-the-exception-types-in-a-crash-report

Tim Jacomb added a comment - 2021-10-04 21:38 Apple docs: https://developer.apple.com/documentation/xcode/understanding-the-exception-types-in-a-crash-report

Tim Jacomb added a comment - 2021-10-04 21:52

This looks to be the correct fix, there's no need to set `FD_CLOEXEC` on the file descriptor that already has it set

diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
index 391998cdff..97288d5fad 100644
--- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
+++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
@@ -76,7 +76,9 @@ public class UnixLifecycle extends Lifecycle {
         int sz = LIBC.getdtablesize();
         for(int i=3; i<sz; i++) {
             int flags = LIBC.fcntl(i, F_GETFD);
-            if(flags<0) continue;
+            if (flags < 0 || flags == FD_CLOEXEC) {
+                continue;
+            }
             LIBC.fcntl(i, F_SETFD,flags| FD_CLOEXEC);
         }

It gets further after the above, it now fails at:

2021-10-04 21:48:05.292+0000 [id=1]	INFO	winstone.Logger#logInternal: Jetty shutdown successfully
java.io.IOException: Failed to start Jetty
	at winstone.Launcher.<init>(Launcher.java:194)
	at winstone.Launcher.main(Launcher.java:369)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at Main._main(Main.java:311)
	at Main.main(Main.java:114)
Caused by: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
	at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
	at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
	at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
	at org.eclipse.jetty.server.Server.doStart(Server.java:401)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
	at winstone.Launcher.<init>(Launcher.java:192)
	... 7 more
Caused by: java.net.BindException: Address already in use
	at java.base/sun.nio.ch.Net.bind0(Native Method)
	at java.base/sun.nio.ch.Net.bind(Net.java:455)
	at java.base/sun.nio.ch.Net.bind(Net.java:447)
	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
	at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
	... 14 more
2021-10-04 21:48:05.292+0000 [id=1]	SEVERE	winstone.Logger#logInternal: Container startup failed
java.net.BindException: Address already in use
	at java.base/sun.nio.ch.Net.bind0(Native Method)
	at java.base/sun.nio.ch.Net.bind(Net.java:455)
	at java.base/sun.nio.ch.Net.bind(Net.java:447)
	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
	at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
	at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
	at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
	at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
	at org.eclipse.jetty.server.Server.doStart(Server.java:401)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
	at winstone.Launcher.<init>(Launcher.java:192)
Caused: java.io.IOException: Failed to start Jetty
	at winstone.Launcher.<init>(Launcher.java:194)
	at winstone.Launcher.main(Launcher.java:369)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at Main._main(Main.java:311)
	at Main.main(Main.java:114)

Tim Jacomb added a comment - 2021-10-04 21:52 This looks to be the correct fix, there's no need to set `FD_CLOEXEC` on the file descriptor that already has it set diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java index 391998cdff..97288d5fad 100644 --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java @@ -76,7 +76,9 @@ public class UnixLifecycle extends Lifecycle { int sz = LIBC.getdtablesize(); for ( int i=3; i<sz; i++) { int flags = LIBC.fcntl(i, F_GETFD); - if (flags<0) continue ; + if (flags < 0 || flags == FD_CLOEXEC) { + continue ; + } LIBC.fcntl(i, F_SETFD,flags| FD_CLOEXEC); } It gets further after the above, it now fails at: 2021-10-04 21:48:05.292+0000 [id=1] INFO winstone.Logger#logInternal: Jetty shutdown successfully java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:194) at winstone.Launcher.main(Launcher.java:369) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:311) at Main.main(Main.java:114) Caused by: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:192) ... 7 more Caused by: java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) ... 14 more 2021-10-04 21:48:05.292+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:192) Caused: java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:194) at winstone.Launcher.main(Launcher.java:369) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:311) at Main.main(Main.java:114)

Tim Jacomb added a comment - 2021-10-04 22:02

Any thoughts on where to check for that?

This might be where it's supposed to be shutdown: https://github.com/jenkinsci/winstone/blob/825889d1a93a49feabca297308fff7ecafba70cd/src/main/java/winstone/ShutdownHook.java#L15

Tim Jacomb added a comment - 2021-10-04 22:02 Any thoughts on where to check for that? This might be where it's supposed to be shutdown: https://github.com/jenkinsci/winstone/blob/825889d1a93a49feabca297308fff7ecafba70cd/src/main/java/winstone/ShutdownHook.java#L15

Basil Crow added a comment - 2021-10-04 22:02 - edited

timja Good insight! I think you're almost there, but your fix isn't quite right. The flags is a bit field, so you can't check if FD_CLOEXEC is set on it with flags == FD_CLOEXEC – that is probably missing a lot of stuff including open sockets, which is probably why you then get the "Address already in use" error. You want to use bitwise arithmetic to check if the FD_CLOEXEC bit is set on the flags while ignoring the other bits that you don't care about.

Basil Crow added a comment - 2021-10-04 22:02 - edited timja Good insight! I think you're almost there, but your fix isn't quite right. The flags is a bit field, so you can't check if FD_CLOEXEC is set on it with flags == FD_CLOEXEC – that is probably missing a lot of stuff including open sockets, which is probably why you then get the "Address already in use" error. You want to use bitwise arithmetic to check if the FD_CLOEXEC bit is set on the flags while ignoring the other bits that you don't care about.

Basil Crow added a comment - 2021-10-04 22:09

timja Come to think of it, you might not be almost there after all, because I see no evidence that whatever file descriptor macOS is freaking out about has the FD_CLOEXEC already set on it – that seems like an assumption on your part based on a faulty understanding of how the bit is represented in the flags. I don't see any reason why your theory would be false but I also don't see any reason why it would be true. So back to the drawing board...

What is clear to me is this is some weird Apple thing where certain file descriptors are "guarded resources" where "the operating system marks a file descriptor as guarded so that normal file descriptor APIs can’t modify them." So when we try to set FD_CLOEXEC on it, the OS "helpfully" crashes the process rather than trusting that we know what we're doing. I had been initially worried that we were hitting some JNA issue, but this seems unlikely. We're bumping into some macOS restriction. My main question is: "what file is so special that macOS forbids us from closing it?" Can you run lsof -p ${PID} and please let me know what file handle is so sacred that Apple does not allow developers to close it.

Basil Crow added a comment - 2021-10-04 22:09 timja Come to think of it, you might not be almost there after all, because I see no evidence that whatever file descriptor macOS is freaking out about has the FD_CLOEXEC already set on it – that seems like an assumption on your part based on a faulty understanding of how the bit is represented in the flags. I don't see any reason why your theory would be false but I also don't see any reason why it would be true. So back to the drawing board... What is clear to me is this is some weird Apple thing where certain file descriptors are "guarded resources" where "the operating system marks a file descriptor as guarded so that normal file descriptor APIs can’t modify them." So when we try to set FD_CLOEXEC on it, the OS "helpfully" crashes the process rather than trusting that we know what we're doing. I had been initially worried that we were hitting some JNA issue, but this seems unlikely. We're bumping into some macOS restriction. My main question is: "what file is so special that macOS forbids us from closing it?" Can you run lsof -p ${PID } and please let me know what file handle is so sacred that Apple does not allow developers to close it.

Tim Jacomb added a comment - 2021-10-04 22:19 - edited

It's not very informative...

java    37417 timja  844   NPOLICY

Only reference I found was someone asking what it is:
https://developer.apple.com/forums/thread/659475

The evidence I saw was from logging out the flags returned by 'F_GETFD' and seeing that the failing process had '1' returned which is FD_CLOEXEC from what I understand, but I haven't had to do bitwise arithmetic in a long time so likely very faulty assumption.

(There was also one other process with the flag 1 which was the preceeding FD)

Tim Jacomb added a comment - 2021-10-04 22:19 - edited It's not very informative... java 37417 timja 844 NPOLICY Only reference I found was someone asking what it is: https://developer.apple.com/forums/thread/659475 The evidence I saw was from logging out the flags returned by 'F_GETFD' and seeing that the failing process had '1' returned which is FD_CLOEXEC from what I understand, but I haven't had to do bitwise arithmetic in a long time so likely very faulty assumption. (There was also one other process with the flag 1 which was the preceeding FD)

Basil Crow added a comment - 2021-10-04 22:42

Maybe you'd have more luck figuring out what this weird file descriptor is with DTrace. sudo opensnoop -p ${PID} will log every file descriptor as it's opened, giving you a bunch of information about it like its file descriptor and its path. It's based on DTrace, so you'll have to disable System Integrity Protection (SIP) in order to use it. (It's this kind of thing that drove me nuts which is why I eventually quit using macOS. Now I run Linux and can do system-level tracing to my heart's content.)

Basil Crow added a comment - 2021-10-04 22:42 Maybe you'd have more luck figuring out what this weird file descriptor is with DTrace. sudo opensnoop -p ${PID } will log every file descriptor as it's opened, giving you a bunch of information about it like its file descriptor and its path. It's based on DTrace, so you'll have to disable System Integrity Protection (SIP) in order to use it. (It's this kind of thing that drove me nuts which is why I eventually quit using macOS. Now I run Linux and can do system-level tracing to my heart's content.)

Basil Crow added a comment - 2021-10-04 23:50

timja You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY? Check out this page:

A major addition to Darwin's network stack are Network Extension Control Policies (NECPs), added to Darwin 14. NECPs are described in bsd/net/necp.c as "..high-level policy sessions, which are ingested into low-level kernel policies that control and tag traffic at the application, socket, and IP layers". In other words, NECPs enable user mode programs to control the kernel's network routing and scheduling decisions. Naturally, these also allow QoS.

[…]

Starting with Darwin 16, just about every network-enabled process in the system uses NECPs, oftentimes without the developer even knowing what they are. This is because libnetwork.dylib calls necp_open() (#501) as part of its initialization (specifically, from nw_endpoint_handler_start). This creates a necp client, a file descriptor of type NPOLICY, which is readily visible in the output of lsof(1) or procexp ..fds. The descriptor does not offer the traditional operations (read(2)/write(2)/ioctl(2)), and only supports select(2), or use in a kqueue.

[…]

Darwin 17 extends the idea of NECP client descriptors, and adds the NECP session (also an NPOLICY file descriptor^*^). These descriptors are created with necp_session_open (#522), and support just the close(2) operation (which deletes the associated session).

So it looks like an NPOLICY file descriptor could be either an NECP client or an NECP session. Now, if trying to set FD_CLOEXEC on one of these is triggering a EXC_GUARD/GUARD_TYPE_FD "guarded resource protection", then my next question would be: who created the NECP client/session, and why is it guarded? More importantly, how can we get it to release the resource before we exec?

Basil Crow added a comment - 2021-10-04 23:50 timja You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY ? Check out this page : A major addition to Darwin's network stack are Network Extension Control Policies (NECPs), added to Darwin 14. NECPs are described in bsd/net/necp.c as "..high-level policy sessions, which are ingested into low-level kernel policies that control and tag traffic at the application, socket, and IP layers". In other words, NECPs enable user mode programs to control the kernel's network routing and scheduling decisions. Naturally, these also allow QoS. […] Starting with Darwin 16, just about every network-enabled process in the system uses NECPs, oftentimes without the developer even knowing what they are. This is because libnetwork.dylib calls necp_open() (#501) as part of its initialization (specifically, from nw_endpoint_handler_start ). This creates a necp client , a file descriptor of type NPOLICY , which is readily visible in the output of lsof(1) or procexp ..fds . The descriptor does not offer the traditional operations ( read(2)/write(2)/ioctl(2) ), and only supports select(2) , or use in a kqueue . […] Darwin 17 extends the idea of NECP client descriptors, and adds the NECP session (also an NPOLICY file descriptor^*^). These descriptors are created with necp_session_open (#522), and support just the close(2) operation (which deletes the associated session). So it looks like an NPOLICY file descriptor could be either an NECP client or an NECP session . Now, if trying to set FD_CLOEXEC on one of these is triggering a EXC_GUARD / GUARD_TYPE_FD "guarded resource protection", then my next question would be: who created the NECP client/session, and why is it guarded? More importantly, how can we get it to release the resource before we exec ?

Tim Jacomb added a comment - 2021-10-05 06:53

> You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY?
Sure is a strong word here, this sort of thing is quite out of comfort zone, but yes it appears so.

These are the related entries from the last restart I just did:

java    40095 timja  840u     IPv6 0x7a007b86cdf33161       0t0                 TCP localhost:http-alt->localhost:58463 (ESTABLISHED)
java    40095 timja  841u     IPv6 0x7a007b86c2092461       0t0                 TCP *:ndmp (LISTEN)
java    40095 timja  842u    systm 0x7a007b936831da81       0t0                     [ctl com.apple.netsrc id 6 unit 26]
java    40095 timja  843u     unix 0x7a007b86a4514369       0t0                     ->0x7a007b86a450efd1
java    40095 timja  844   NPOLICY
java    40095 timja  845u     IPv6 0x7a007b86c208eb01       0t0                 TCP 192.168.4.151:58480->20.190.160.136:https (ESTABLISHED)
java    40095 timja  846u     IPv6 0x7a007b86c208f7c1       0t0                 TCP 192.168.4.151:58484->20.190.160.136:https (ESTABLISHED)
java    40095 timja  847u     IPv6 0x7a007b86c2091141       0t0                 TCP 192.168.4.151:58488->40.126.41.96:https (ESTABLISHED)
java    40095 timja  848u     IPv6 0x7a007b86cdf36461       0t0                 TCP localhost:http-alt->localhost:58490 (ESTABLISHED)
java    40095 timja  849u     IPv6 0x7a007b86cdf337c1       0t0                 TCP localhost:http-alt->localhost:58491 (ESTABLISHED)
java    40095 timja  850u     IPv6 0x7a007b86cdf34481       0t0                 TCP localhost:http-alt->localhost:58493 (ESTABLISHED)
java    40095 timja  851u     IPv6 0x7a007b86cdaebe21       0t0                 TCP localhost:http-alt->localhost:58494 (ESTABLISHED)
java    40095 timja  852u     IPv6 0x7a007b86cdaed7a1       0t0                 TCP localhost:http-alt->localhost:58495 (ESTABLISHED)

and the relevant part of the crash log:

Exception Type:        EXC_GUARD
Exception Codes:       0x400000040000034c, 0xe20c7f41b3616e04
Exception Subtype:     GUARD_TYPE_FD, id=0xe20c7f41b3616e04, fd=844, flavor=0x00000004 (NOCLOEXEC)

Tim Jacomb added a comment - 2021-10-05 06:53 > You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY? Sure is a strong word here, this sort of thing is quite out of comfort zone, but yes it appears so. These are the related entries from the last restart I just did: java 40095 timja 840u IPv6 0x7a007b86cdf33161 0t0 TCP localhost:http-alt->localhost:58463 (ESTABLISHED) java 40095 timja 841u IPv6 0x7a007b86c2092461 0t0 TCP *:ndmp (LISTEN) java 40095 timja 842u systm 0x7a007b936831da81 0t0 [ctl com.apple.netsrc id 6 unit 26] java 40095 timja 843u unix 0x7a007b86a4514369 0t0 ->0x7a007b86a450efd1 java 40095 timja 844 NPOLICY java 40095 timja 845u IPv6 0x7a007b86c208eb01 0t0 TCP 192.168.4.151:58480->20.190.160.136:https (ESTABLISHED) java 40095 timja 846u IPv6 0x7a007b86c208f7c1 0t0 TCP 192.168.4.151:58484->20.190.160.136:https (ESTABLISHED) java 40095 timja 847u IPv6 0x7a007b86c2091141 0t0 TCP 192.168.4.151:58488->40.126.41.96:https (ESTABLISHED) java 40095 timja 848u IPv6 0x7a007b86cdf36461 0t0 TCP localhost:http-alt->localhost:58490 (ESTABLISHED) java 40095 timja 849u IPv6 0x7a007b86cdf337c1 0t0 TCP localhost:http-alt->localhost:58491 (ESTABLISHED) java 40095 timja 850u IPv6 0x7a007b86cdf34481 0t0 TCP localhost:http-alt->localhost:58493 (ESTABLISHED) java 40095 timja 851u IPv6 0x7a007b86cdaebe21 0t0 TCP localhost:http-alt->localhost:58494 (ESTABLISHED) java 40095 timja 852u IPv6 0x7a007b86cdaed7a1 0t0 TCP localhost:http-alt->localhost:58495 (ESTABLISHED) and the relevant part of the crash log: Exception Type: EXC_GUARD Exception Codes: 0x400000040000034c, 0xe20c7f41b3616e04 Exception Subtype: GUARD_TYPE_FD, id=0xe20c7f41b3616e04, fd=844, flavor=0x00000004 (NOCLOEXEC)

Tim Jacomb added a comment - 2021-10-05 07:12

This is the result from opensnoop: https://gist.github.com/timja/65e67174af7a3a3f2e721bbeef2a9c5f

On this run the FD that caused it to crash was 847, and was of type 'NPOLICY'

Tim Jacomb added a comment - 2021-10-05 07:12 This is the result from opensnoop: https://gist.github.com/timja/65e67174af7a3a3f2e721bbeef2a9c5f On this run the FD that caused it to crash was 847, and was of type 'NPOLICY'

Tim Jacomb added a comment - 2021-10-05 07:27

Interesting I found that the file descriptor is only created after logging in to Jenkins,

If you disable auth and restart it doesn't get killed
and after some more digging it seems to be the azure-ad plugin is starting that file descriptor somehow.
The built-in security realm doesn't have the problem.

But the restart doesn't work either way it still hits the:

2021-10-05 07:26:21.441+0000 [id=1]	SEVERE	winstone.Logger#logInternal: Container startup failed
java.net.BindException: Address already in use
	at java.base/sun.nio.ch.Net.bind0(Native Method)
	at java.base/sun.nio.ch.Net.bind(Net.java:455)
	at java.base/sun.nio.ch.Net.bind(Net.java:447)
	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
	at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
	at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
	at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
	at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
	at org.eclipse.jetty.server.Server.doStart(Server.java:401)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
	at winstone.Launcher.<init>(Launcher.java:192)
Caused: java.io.IOException: Failed to start Jetty
	at winstone.Launcher.<init>(Launcher.java:194)
	at winstone.Launcher.main(Launcher.java:369)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at Main._main(Main.java:311)
	at Main.main(Main.java:114)

Tim Jacomb added a comment - 2021-10-05 07:27 Interesting I found that the file descriptor is only created after logging in to Jenkins, If you disable auth and restart it doesn't get killed and after some more digging it seems to be the azure-ad plugin is starting that file descriptor somehow. The built-in security realm doesn't have the problem. But the restart doesn't work either way it still hits the: 2021-10-05 07:26:21.441+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:192) Caused: java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:194) at winstone.Launcher.main(Launcher.java:369) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:311) at Main.main(Main.java:114)

Basil Crow added a comment - 2021-10-23 19:42

timja Setting aside the Azure AD issue, it is interesting that with the built-in security realm the restart fails with java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080. I looked into how this works (correctly) on my Linux system. Before restarting Jenkins, here are the open TCP file descriptors:

$ sudo lsof -P -p 3916237 | grep TCP
java    3916237 basil    5u     IPv6            5821933       0t0     TCP localhost:8080->localhost:49530 (ESTABLISHED)
java    3916237 basil    6u     IPv4            5807038       0t0     TCP localhost:5005->localhost:48452 (ESTABLISHED)
java    3916237 basil  120u     IPv6            5812861       0t0     TCP *:8080 (LISTEN)

The call to LIBC.fcntl(i, F_SETFD,flags | FD_CLOEXEC) marks them as CLOEXEC, then the call to LIBC.execvp execs to self, during which the file descriptors are closed and the socket released. During startup there are no open TCP file descriptors, so Jetty has no problem binding to the port.

At what point is this going off the rails on M1?

Note that winstone.Launcher#shutdown isn't executed in this process, since this isn't a "regular" JVM shutdown where the shutdown hooks are executed. Whatever Winstone/Jetty might be doing to gracefully shut down, I can't see how closing all file descriptors and calling exec would be less than that - there's really not much left to the process once its program text and open file handles are gone. Unless there's some weird Apple resource that winstone.Launcher#shutdown releases which isn't a file descriptor or in the process's address space - which would be quite strange.

Basil Crow added a comment - 2021-10-23 19:42 timja Setting aside the Azure AD issue, it is interesting that with the built-in security realm the restart fails with java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 . I looked into how this works (correctly) on my Linux system. Before restarting Jenkins, here are the open TCP file descriptors: $ sudo lsof -P -p 3916237 | grep TCP java 3916237 basil 5u IPv6 5821933 0t0 TCP localhost:8080->localhost:49530 (ESTABLISHED) java 3916237 basil 6u IPv4 5807038 0t0 TCP localhost:5005->localhost:48452 (ESTABLISHED) java 3916237 basil 120u IPv6 5812861 0t0 TCP *:8080 (LISTEN) The call to LIBC.fcntl(i, F_SETFD,flags | FD_CLOEXEC) marks them as CLOEXEC , then the call to LIBC.execvp execs to self, during which the file descriptors are closed and the socket released. During startup there are no open TCP file descriptors, so Jetty has no problem binding to the port. At what point is this going off the rails on M1? Note that winstone.Launcher#shutdown isn't executed in this process, since this isn't a "regular" JVM shutdown where the shutdown hooks are executed. Whatever Winstone/Jetty might be doing to gracefully shut down, I can't see how closing all file descriptors and calling exec would be less than that - there's really not much left to the process once its program text and open file handles are gone. Unless there's some weird Apple resource that winstone.Launcher#shutdown releases which isn't a file descriptor or in the process's address space - which would be quite strange.

Basil Crow added a comment - 2022-02-22 06:45

I can reproduce the problem on an M1 system with a stock Jenkins installation. Tracing the process with dtruss(1) I cannot see any calls to necp_session_open, but I see a call to necp_open:

  927/0x364b:      3733      12      8 necp_open(0x0, 0x0, 0x0)          = 3 0

              0x18e4d1b78
              0x1926e07ec
              0x1926df85c
              0x18f1429c8
              0x102ee0450
              0x102ede014
              0x102eddf68
              0x102eddea0
              0x18e25c1d8
              0x18e25abf8
              0x18e34ac6c
              0x18e366f68
              0x18e352208
              0x18e367cb8
              0x18e35d708
              0x18e505304
              0x18e504018

For some reason DTrace's ustack() isn't resolving symbols, but I got the symbols from lldb(1):

target create /opt/homebrew/opt/openjdk@11/bin/java
process launch --stop-at-entry -- -jar war/target/jenkins.war
process handle -p true -s false SIGBUS
process handle -p true -s false SIGSEGV
process handle -p true -s false SIGSTOP
break set -n necp_open -s libsystem_kernel.dylib
continue
bt
* thread #53, name = 'Java: GrapeHack.hack', stop reason = breakpoint 1.1
  * frame #0: 0x000000018e4d1b70 libsystem_kernel.dylib`necp_open
    frame #1: 0x0000000192715b2c libnetwork.dylib`nw_path_shared_necp_fd + 220
    frame #2: 0x00000001926e07ec libnetwork.dylib`nw_path_evaluator_evaluate + 540
    frame #3: 0x00000001926df85c libnetwork.dylib`nw_path_create_evaluator_for_endpoint + 108
    frame #4: 0x0000000192e1e8cc libnetwork.dylib`nw_nat64_v4_address_requires_synthesis + 252
    frame #5: 0x000000018e5298e4 libsystem_info.dylib`si_addrinfo + 2308
    frame #6: 0x000000018e528f34 libsystem_info.dylib`getaddrinfo + 168
    frame #7: 0x00000001021702c4 libnet.dylib`Java_java_net_Inet6AddressImpl_lookupAllHostAddr + 144
    frame #8: 0x000000011f90bbfc
    frame #9: 0x000000011f906fc8
    frame #10: 0x000000011f906fc8
    frame #11: 0x000000011f906ee0
    frame #12: 0x000000011f906fc8
    frame #13: 0x000000011f906ee0
    frame #14: 0x000000011f906ee0
    frame #15: 0x000000011f906ee0
    frame #16: 0x000000011f90719c
    frame #17: 0x000000011f90719c
    frame #18: 0x000000011f90719c
    frame #19: 0x000000011f900144
    frame #20: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
    frame #21: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
    frame #22: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392
    frame #23: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232
    frame #24: 0x00000001274e1360
    frame #25: 0x00000001203c8c84
    frame #26: 0x000000011f906ee0
    frame #27: 0x000000011f906ee0
    frame #28: 0x000000011f906ee0
    frame #29: 0x000000011f90719c
    frame #30: 0x000000011f9071f4
    frame #31: 0x000000011f9071f4
    frame #32: 0x00000001200a6434
    frame #33: 0x000000011f906ab0
    frame #34: 0x000000011f9067e4
    frame #35: 0x000000011f90678c
    frame #36: 0x000000011f9071f4
    frame #37: 0x000000011f90719c
    frame #38: 0x000000011f90719c
    frame #39: 0x000000011f90719c
    frame #40: 0x000000011f90719c
    frame #41: 0x000000011f90719c
    frame #42: 0x000000011f90719c
    frame #43: 0x000000011f90719c
    frame #44: 0x000000011f90719c
    frame #45: 0x000000011f906fc8
    frame #46: 0x000000011f906ee0
    frame #47: 0x000000011f906ee0
    frame #48: 0x000000011f906fc8
    frame #49: 0x000000011f900144
    frame #50: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
    frame #51: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
    frame #52: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392
    frame #53: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232
    frame #54: 0x00000001274e1360
    frame #55: 0x00000001203c8c84
    frame #56: 0x000000011f906ee0
    frame #57: 0x000000011f900144
    frame #58: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
    frame #59: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796
    frame #60: 0x00000001028d4138 libjvm.dylib`Reflection::invoke_method(oopDesc*, Handle, objArrayHandle, Thread*) + 276
    frame #61: 0x00000001026d038c libjvm.dylib`JVM_InvokeMethod + 508
    frame #62: 0x00000001276a1270
    frame #63: 0x000000012097c16c
    frame #64: 0x000000011f90719c
    frame #65: 0x000000011f9071f4
    frame #66: 0x000000011f90719c
    frame #67: 0x000000011f90719c
    frame #68: 0x000000011f9071f4
    frame #69: 0x000000011f9071f4
    frame #70: 0x000000011f9071f4
    frame #71: 0x000000011f90719c
    frame #72: 0x000000011f9071f4
    frame #73: 0x000000011f900144
    frame #74: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736
    frame #75: 0x00000001026468d0 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, Thread*) + 236
    frame #76: 0x0000000102646998 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*) + 100
    frame #77: 0x00000001026cd0dc libjvm.dylib`thread_entry(JavaThread*, Thread*) + 120
    frame #78: 0x0000000102973134 libjvm.dylib`JavaThread::thread_main_inner() + 124
    frame #79: 0x0000000102972f64 libjvm.dylib`JavaThread::run() + 376
    frame #80: 0x0000000102970c88 libjvm.dylib`Thread::call_run() + 120
    frame #81: 0x0000000102874b94 libjvm.dylib`thread_native_entry(Thread*) + 316
    frame #82: 0x000000018e509240 libsystem_pthread.dylib`_pthread_start + 148

So some thread named GrapeHack.hack is calling Java_java_net_Inet6AddressImpl_lookupAllHostAddr, which eventually calls necp_open, which creates a file descriptor with NPOLICY, which we then set FD_CLOEXEC on, which we then crash on when exec'ing.

I looked into disabling IPv6 in Java and tried running Java with -Djava.net.preferIPv4Stack=true. That eliminates the calls to necp_open and the NPOLICY FD:

$ sudo lsof -p `pgrep java` | grep NPOLICY
$

That eliminates the crash, but Jenkins still doesn't restart properly. Now I'm getting:

2022-02-22 06:38:34.228+0000 [id=1]	SEVERE	winstone.Logger#logInternal: Container startup failed
java.net.BindException: Address already in use
	at java.base/sun.nio.ch.Net.bind0(Native Method)
	at java.base/sun.nio.ch.Net.bind(Net.java:455)
	at java.base/sun.nio.ch.Net.bind(Net.java:447)
	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
	at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080
	at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
	at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
	at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
	at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
	at org.eclipse.jetty.server.Server.doStart(Server.java:401)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
	at winstone.Launcher.<init>(Launcher.java:198)
Caused: java.io.IOException: Failed to start Jetty
	at winstone.Launcher.<init>(Launcher.java:200)
	at winstone.Launcher.main(Launcher.java:376)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at Main._main(Main.java:304)
	at Main.main(Main.java:108)

A mystery for another day.

Basil Crow added a comment - 2022-02-22 06:45 I can reproduce the problem on an M1 system with a stock Jenkins installation. Tracing the process with dtruss(1) I cannot see any calls to necp_session_open , but I see a call to necp_open : 927/0x364b: 3733 12 8 necp_open(0x0, 0x0, 0x0) = 3 0 0x18e4d1b78 0x1926e07ec 0x1926df85c 0x18f1429c8 0x102ee0450 0x102ede014 0x102eddf68 0x102eddea0 0x18e25c1d8 0x18e25abf8 0x18e34ac6c 0x18e366f68 0x18e352208 0x18e367cb8 0x18e35d708 0x18e505304 0x18e504018 For some reason DTrace's ustack() isn't resolving symbols, but I got the symbols from lldb(1) : target create /opt/homebrew/opt/openjdk@11/bin/java process launch --stop-at-entry -- -jar war/target/jenkins.war process handle -p true -s false SIGBUS process handle -p true -s false SIGSEGV process handle -p true -s false SIGSTOP break set -n necp_open -s libsystem_kernel.dylib continue bt * thread #53, name = 'Java: GrapeHack.hack', stop reason = breakpoint 1.1 * frame #0: 0x000000018e4d1b70 libsystem_kernel.dylib`necp_open frame #1: 0x0000000192715b2c libnetwork.dylib`nw_path_shared_necp_fd + 220 frame #2: 0x00000001926e07ec libnetwork.dylib`nw_path_evaluator_evaluate + 540 frame #3: 0x00000001926df85c libnetwork.dylib`nw_path_create_evaluator_for_endpoint + 108 frame #4: 0x0000000192e1e8cc libnetwork.dylib`nw_nat64_v4_address_requires_synthesis + 252 frame #5: 0x000000018e5298e4 libsystem_info.dylib`si_addrinfo + 2308 frame #6: 0x000000018e528f34 libsystem_info.dylib`getaddrinfo + 168 frame #7: 0x00000001021702c4 libnet.dylib`Java_java_net_Inet6AddressImpl_lookupAllHostAddr + 144 frame #8: 0x000000011f90bbfc frame #9: 0x000000011f906fc8 frame #10: 0x000000011f906fc8 frame #11: 0x000000011f906ee0 frame #12: 0x000000011f906fc8 frame #13: 0x000000011f906ee0 frame #14: 0x000000011f906ee0 frame #15: 0x000000011f906ee0 frame #16: 0x000000011f90719c frame #17: 0x000000011f90719c frame #18: 0x000000011f90719c frame #19: 0x000000011f900144 frame #20: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #21: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #22: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392 frame #23: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232 frame #24: 0x00000001274e1360 frame #25: 0x00000001203c8c84 frame #26: 0x000000011f906ee0 frame #27: 0x000000011f906ee0 frame #28: 0x000000011f906ee0 frame #29: 0x000000011f90719c frame #30: 0x000000011f9071f4 frame #31: 0x000000011f9071f4 frame #32: 0x00000001200a6434 frame #33: 0x000000011f906ab0 frame #34: 0x000000011f9067e4 frame #35: 0x000000011f90678c frame #36: 0x000000011f9071f4 frame #37: 0x000000011f90719c frame #38: 0x000000011f90719c frame #39: 0x000000011f90719c frame #40: 0x000000011f90719c frame #41: 0x000000011f90719c frame #42: 0x000000011f90719c frame #43: 0x000000011f90719c frame #44: 0x000000011f90719c frame #45: 0x000000011f906fc8 frame #46: 0x000000011f906ee0 frame #47: 0x000000011f906ee0 frame #48: 0x000000011f906fc8 frame #49: 0x000000011f900144 frame #50: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #51: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #52: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392 frame #53: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232 frame #54: 0x00000001274e1360 frame #55: 0x00000001203c8c84 frame #56: 0x000000011f906ee0 frame #57: 0x000000011f900144 frame #58: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #59: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #60: 0x00000001028d4138 libjvm.dylib`Reflection::invoke_method(oopDesc*, Handle, objArrayHandle, Thread*) + 276 frame #61: 0x00000001026d038c libjvm.dylib`JVM_InvokeMethod + 508 frame #62: 0x00000001276a1270 frame #63: 0x000000012097c16c frame #64: 0x000000011f90719c frame #65: 0x000000011f9071f4 frame #66: 0x000000011f90719c frame #67: 0x000000011f90719c frame #68: 0x000000011f9071f4 frame #69: 0x000000011f9071f4 frame #70: 0x000000011f9071f4 frame #71: 0x000000011f90719c frame #72: 0x000000011f9071f4 frame #73: 0x000000011f900144 frame #74: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #75: 0x00000001026468d0 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, Thread*) + 236 frame #76: 0x0000000102646998 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*) + 100 frame #77: 0x00000001026cd0dc libjvm.dylib`thread_entry(JavaThread*, Thread*) + 120 frame #78: 0x0000000102973134 libjvm.dylib`JavaThread::thread_main_inner() + 124 frame #79: 0x0000000102972f64 libjvm.dylib`JavaThread::run() + 376 frame #80: 0x0000000102970c88 libjvm.dylib`Thread::call_run() + 120 frame #81: 0x0000000102874b94 libjvm.dylib`thread_native_entry(Thread*) + 316 frame #82: 0x000000018e509240 libsystem_pthread.dylib`_pthread_start + 148 So some thread named GrapeHack.hack is calling Java_java_net_Inet6AddressImpl_lookupAllHostAddr , which eventually calls necp_open , which creates a file descriptor with NPOLICY , which we then set FD_CLOEXEC on, which we then crash on when exec'ing. I looked into disabling IPv6 in Java and tried running Java with -Djava.net.preferIPv4Stack=true . That eliminates the calls to necp_open and the NPOLICY FD: $ sudo lsof -p `pgrep java` | grep NPOLICY $ That eliminates the crash, but Jenkins still doesn't restart properly. Now I'm getting: 2022-02-22 06:38:34.228+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:198) Caused: java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:200) at winstone.Launcher.main(Launcher.java:376) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:304) at Main.main(Main.java:108) A mystery for another day.

Basil Crow added a comment - 2022-02-24 07:07

More late night debugging with java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war... after instrumenting Winstone with

diff --git a/src/main/java/winstone/Launcher.java b/src/main/java/winstone/Launcher.java
index 2657c71..2c0b72c 100644
--- a/src/main/java/winstone/Launcher.java
+++ b/src/main/java/winstone/Launcher.java
@@ -195,6 +195,7 @@ public class Launcher implements Runnable {
             }
 
             try {
+                Thread.sleep(10 * 1000L);
                 server.start();
             } catch (Exception e) {
                 throw new IOException("Failed to start Jetty",e);

I could confirm that after the restart not only was the port still open but all the other files were also still open. It appears as if nothing got closed before the exec. Instrumenting Jenkins with

diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
index ca8f50b9f5..b6d28ea313 100644
--- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
+++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
@@ -82,6 +82,7 @@ public class UnixLifecycle extends Lifecycle {
         }
 
         // exec to self
+        Thread.sleep(10 * 1000L);
         String exe = args.get(0);
         LIBC.execvp(exe, new StringArray(args.toArray(new String[0])));
         throw new IOException("Failed to exec '" + exe + "' " + LIBC.strerror(Native.getLastError()));

and running sudo lsof +f g -p $PID I could confirm that no file descriptor had the CX (close-on-exec) flag. We are simply failing to set the close-on-exec flag.

Watching the fcntl system calls with dtruss -t fcntl I see:

fcntl(0x3, 0x1, 0x0)             = 0 0
fcntl(0x3, 0x2, 0xFFFFFFFFC5EC5B50)              = 0 0
fcntl(0x4, 0x1, 0x0)             = 0 0
fcntl(0x4, 0x2, 0x29ECA00)               = 0 0
fcntl(0x5, 0x1, 0x0)             = 0 0
fcntl(0x5, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
fcntl(0x6, 0x1, 0x0)             = 0 0
fcntl(0x6, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
fcntl(0x7, 0x1, 0x0)             = 0 0
fcntl(0x7, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
fcntl(0x8, 0x1, 0x0)             = 0 0
fcntl(0x8, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
fcntl(0x9, 0x1, 0x0)             = 0 0
fcntl(0x9, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0
fcntl(0xA, 0x1, 0x0)             = 0 0
fcntl(0xA, 0x2, 0xFFFFFFFFC5EC5AB0)              = 0 0

and on and on. The calls where the second argument is 0x2 correspond to F_SETFD. Most of them have 0xFFFFFFFFC5EC5B50 as the third argument, including the one for the TCP socket.

For comparison I wrote a simple Python program:

import fcntl
import time

with open('foo.txt') as foo:
    time.sleep(10)
    print('about to fcntl')
    flags = fcntl.fcntl(foo.fileno(), fcntl.F_GETFD)
    fcntl.fcntl(foo.fileno(), fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC)
    print('finished fcntl')
    time.sleep(10)

I confirmed that after the F_SETFD the file descriptor had CX set in lsof +f g as expected. The dtruss -t fcntl output here was:

fcntl(0x3, 0x1, 0x0)		 = 0 0
fcntl(0x3, 0x2, 0x1)		 = 0 0

So clearly Jenkins is in the wrong here by calling SET_FD with 0xFFFFFFFFC5EC5AB0 as the flags. It should be 0x1.

Funnily enough I got this working very easily with JNR (which I am trying to get rid of in core...):

diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
index ca8f50b9f5..a68bc4ef7b 100644
--- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
+++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java
@@ -26,18 +26,19 @@ package hudson.lifecycle;
 
 import static hudson.util.jna.GNUCLibrary.FD_CLOEXEC;
 import static hudson.util.jna.GNUCLibrary.F_GETFD;
-import static hudson.util.jna.GNUCLibrary.F_SETFD;
 import static hudson.util.jna.GNUCLibrary.LIBC;
 
 import com.sun.jna.Native;
 import com.sun.jna.StringArray;
 import hudson.Platform;
+import hudson.os.PosixAPI;
 import java.io.IOException;
 import java.util.List;
 import java.util.logging.Level;
 import java.util.logging.Logger;
 import jenkins.model.Jenkins;
 import jenkins.util.JavaVMArguments;
+import jnr.constants.platform.Fcntl;
 
 /**
  * {@link Lifecycle} implementation when Hudson runs on the embedded
@@ -78,7 +79,7 @@ public class UnixLifecycle extends Lifecycle {
         for (int i = 3; i < sz; i++) {
             int flags = LIBC.fcntl(i, F_GETFD);
             if (flags < 0) continue;
-            LIBC.fcntl(i, F_SETFD, flags | FD_CLOEXEC);
+            PosixAPI.jnr().fcntlInt(i, Fcntl.F_SETFD, flags | FD_CLOEXEC);
         }
 
         // exec to self

With that, java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war is fully working and the restart succeeds.

To summarize, there are two separate bugs here:

When running without -Djava.net.preferIPv4Stack=true, calls to Java_java_net_Inet6AddressImpl_lookupAllHostAddr call necp_open, which creates a file descriptor with NPOLICY, and if we then set FD_CLOEXEC on that file descriptor the process crashes. I am not sure offhand what the best way to deal with this is. Perhaps we should just skip those file descriptors? But would need to make sure they don't cause problems after exec'ing. Needs more investigation. In any case, -Djava.net.preferIPv4Stack=true works around the problem.
Our call to fcntl(fd, F_SETFD has a bogus flags argument. The above code for JNR shows one way to fix this. I'm sure this can also be done with JNA. I am just too tired right now to come up with the JNA version.

Basil Crow added a comment - 2022-02-24 07:07 More late night debugging with java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war ... after instrumenting Winstone with diff --git a/src/main/java/winstone/Launcher.java b/src/main/java/winstone/Launcher.java index 2657c71..2c0b72c 100644 --- a/src/main/java/winstone/Launcher.java +++ b/src/main/java/winstone/Launcher.java @@ -195,6 +195,7 @@ public class Launcher implements Runnable { } try { + Thread.sleep(10 * 1000L); server.start(); } catch (Exception e) { throw new IOException("Failed to start Jetty",e); I could confirm that after the restart not only was the port still open but all the other files were also still open. It appears as if nothing got closed before the exec . Instrumenting Jenkins with diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java index ca8f50b9f5..b6d28ea313 100644 --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java @@ -82,6 +82,7 @@ public class UnixLifecycle extends Lifecycle { } // exec to self + Thread.sleep(10 * 1000L); String exe = args.get(0); LIBC.execvp(exe, new StringArray(args.toArray(new String[0]))); throw new IOException("Failed to exec '" + exe + "' " + LIBC.strerror(Native.getLastError())); and running sudo lsof +f g -p $PID I could confirm that no file descriptor had the CX (close-on-exec) flag. We are simply failing to set the close-on-exec flag. Watching the fcntl system calls with dtruss -t fcntl I see: fcntl(0x3, 0x1, 0x0) = 0 0 fcntl(0x3, 0x2, 0xFFFFFFFFC5EC5B50) = 0 0 fcntl(0x4, 0x1, 0x0) = 0 0 fcntl(0x4, 0x2, 0x29ECA00) = 0 0 fcntl(0x5, 0x1, 0x0) = 0 0 fcntl(0x5, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x6, 0x1, 0x0) = 0 0 fcntl(0x6, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x7, 0x1, 0x0) = 0 0 fcntl(0x7, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x8, 0x1, 0x0) = 0 0 fcntl(0x8, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x9, 0x1, 0x0) = 0 0 fcntl(0x9, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0xA, 0x1, 0x0) = 0 0 fcntl(0xA, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 and on and on. The calls where the second argument is 0x2 correspond to F_SETFD. Most of them have 0xFFFFFFFFC5EC5B50 as the third argument, including the one for the TCP socket. For comparison I wrote a simple Python program: import fcntl import time with open('foo.txt') as foo: time.sleep(10) print('about to fcntl') flags = fcntl.fcntl(foo.fileno(), fcntl.F_GETFD) fcntl.fcntl(foo.fileno(), fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC) print('finished fcntl') time.sleep(10) I confirmed that after the F_SETFD the file descriptor had CX set in lsof +f g as expected. The dtruss -t fcntl output here was: fcntl(0x3, 0x1, 0x0) = 0 0 fcntl(0x3, 0x2, 0x1) = 0 0 So clearly Jenkins is in the wrong here by calling SET_FD with 0xFFFFFFFFC5EC5AB0 as the flags. It should be 0x1 . Funnily enough I got this working very easily with JNR (which I am trying to get rid of in core...): diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java index ca8f50b9f5..a68bc4ef7b 100644 --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java @@ -26,18 +26,19 @@ package hudson.lifecycle; import static hudson.util.jna.GNUCLibrary.FD_CLOEXEC; import static hudson.util.jna.GNUCLibrary.F_GETFD; -import static hudson.util.jna.GNUCLibrary.F_SETFD; import static hudson.util.jna.GNUCLibrary.LIBC; import com.sun.jna.Native; import com.sun.jna.StringArray; import hudson.Platform; +import hudson.os.PosixAPI; import java.io.IOException; import java.util.List; import java.util.logging.Level; import java.util.logging.Logger; import jenkins.model.Jenkins; import jenkins.util.JavaVMArguments; +import jnr.constants.platform.Fcntl; /** * {@link Lifecycle} implementation when Hudson runs on the embedded @@ -78,7 +79,7 @@ public class UnixLifecycle extends Lifecycle { for (int i = 3; i < sz; i++) { int flags = LIBC.fcntl(i, F_GETFD); if (flags < 0) continue; - LIBC.fcntl(i, F_SETFD, flags | FD_CLOEXEC); + PosixAPI.jnr().fcntlInt(i, Fcntl.F_SETFD, flags | FD_CLOEXEC); } // exec to self With that, java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war is fully working and the restart succeeds. To summarize, there are two separate bugs here: When running without -Djava.net.preferIPv4Stack=true , calls to Java_java_net_Inet6AddressImpl_lookupAllHostAddr call necp_open , which creates a file descriptor with NPOLICY , and if we then set FD_CLOEXEC on that file descriptor the process crashes. I am not sure offhand what the best way to deal with this is. Perhaps we should just skip those file descriptors? But would need to make sure they don't cause problems after exec'ing. Needs more investigation. In any case, -Djava.net.preferIPv4Stack=true works around the problem. Our call to fcntl(fd, F_SETFD has a bogus flags argument. The above code for JNR shows one way to fix this. I'm sure this can also be done with JNA. I am just too tired right now to come up with the JNA version.

Basil Crow added a comment - 2022-02-24 07:29

Of note is that fcntl is a variadic function whose flags argument is an unsigned 64-bit integer. There is probably something wrong with our JNA signatures. Ironically, when I do a Google search for "fcntl jna", all I can find is the (incorrect) code in Jenkins.

Basil Crow added a comment - 2022-02-24 07:29 Of note is that fcntl is a variadic function whose flags argument is an unsigned 64-bit integer. There is probably something wrong with our JNA signatures. Ironically, when I do a Google search for "fcntl jna", all I can find is the (incorrect) code in Jenkins.

Basil Crow added a comment - 2024-03-08 21:55

A few years later, I looked at this again and can't reproduce the original problem on OpenJDK 11 (Homebrew) on my macOS system, but I could reproduce an fcntl issue in a unit test, fixed in https://github.com/jenkinsci/jenkins/pull/9026. Perhaps someone who is still experiencing the issue can verify that https://github.com/jenkinsci/jenkins/pull/9026 resolves the problem, or at least does not make things any worse.

Basil Crow added a comment - 2024-03-08 21:55 A few years later, I looked at this again and can't reproduce the original problem on OpenJDK 11 (Homebrew) on my macOS system, but I could reproduce an fcntl issue in a unit test, fixed in https://github.com/jenkinsci/jenkins/pull/9026 . Perhaps someone who is still experiencing the issue can verify that https://github.com/jenkinsci/jenkins/pull/9026 resolves the problem, or at least does not make things any worse.

Basil Crow added a comment - 2024-03-11 23:13

This ticket contains a number of different problems, and is no longer reproducible for me end-to-end, although one specific problem previous identified has been fixed toward 2.449. To facilitate efficient troubleshooting, I am closing this ticket. If issues persist, please file new ticket(s) as necessary with specific symptoms.

Basil Crow added a comment - 2024-03-11 23:13 This ticket contains a number of different problems, and is no longer reproducible for me end-to-end, although one specific problem previous identified has been fixed toward 2.449. To facilitate efficient troubleshooting, I am closing this ticket. If issues persist, please file new ticket(s) as necessary with specific symptoms.

Mark Thormann added a comment - 2024-03-12 14:30

In my case, it turned out that when Jenkins was installed via Homebrew but not upgraded via Homebrew (i.e. through the Jenkins UI), Jenkins would not restart. When we updated via Homebrew past a certain version (can't remember which), everything just started working. Having two different upgrade paths to keep up w/the changelog is a bit confusing. Feels like we should force Homebrew installs to update via Homebrew?

Mark Thormann added a comment - 2024-03-12 14:30 In my case, it turned out that when Jenkins was installed via Homebrew but not upgraded via Homebrew (i.e. through the Jenkins UI), Jenkins would not restart. When we updated via Homebrew past a certain version (can't remember which), everything just started working. Having two different upgrade paths to keep up w/the changelog is a bit confusing. Feels like we should force Homebrew installs to update via Homebrew?

Daniel Beck added a comment - 2024-03-12 15:44

Having two different upgrade paths to keep up w/the changelog is a bit confusing. Feels like we should force Homebrew installs to update via Homebrew?

The Homebrew package could remove write permission on jenkins.war, then Jenkins will not offer to update itself. Note that that package is not maintained by the Jenkins project.

Daniel Beck added a comment - 2024-03-12 15:44 Having two different upgrade paths to keep up w/the changelog is a bit confusing. Feels like we should force Homebrew installs to update via Homebrew? The Homebrew package could remove write permission on jenkins.war , then Jenkins will not offer to update itself. Note that that package is not maintained by the Jenkins project.

Basil Crow added a comment - 2024-03-12 17:00

Or someone could contribute launchd integration that parallels the systemd integration (waiting until the service is fully running before the start command finishes, restarting by exiting a specific exit code to let the service manager restart the process, etc.). In any case, that would be a topic for a separate ticket.

Basil Crow added a comment - 2024-03-12 17:00 Or someone could contribute launchd integration that parallels the systemd integration (waiting until the service is fully running before the start command finishes, restarting by exiting a specific exit code to let the service manager restart the process, etc.). In any case, that would be a topic for a separate ticket.

Assignee:: Basil Crow

Reporter:: Jonatan

Votes:: 3 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2021-06-17 08:32

Updated:: 2024-03-12 17:00

Resolved:: 2024-03-11 23:13

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Jonatan added a comment - 2021-06-29 14:25

Expand comment: Jonatan added a comment - 2021-06-29 14:25

Collapse comment: Mark Waite added a comment - 2021-07-02 02:53

Expand comment: Mark Waite added a comment - 2021-07-02 02:53

Collapse comment: Tim Jacomb added a comment - 2021-10-04 20:19

Expand comment: Tim Jacomb added a comment - 2021-10-04 20:19

Collapse comment: Tim Jacomb added a comment - 2021-10-04 20:21

Expand comment: Tim Jacomb added a comment - 2021-10-04 20:21

Collapse comment: Basil Crow added a comment - 2021-10-04 20:38

Expand comment: Basil Crow added a comment - 2021-10-04 20:38

Collapse comment: Tim Jacomb added a comment - 2021-10-04 20:40

Expand comment: Tim Jacomb added a comment - 2021-10-04 20:40

Collapse comment: Tim Jacomb added a comment - 2021-10-04 21:27

Expand comment: Tim Jacomb added a comment - 2021-10-04 21:27

Collapse comment: Tim Jacomb added a comment - 2021-10-04 21:31

Expand comment: Tim Jacomb added a comment - 2021-10-04 21:31

Collapse comment: Tim Jacomb added a comment - 2021-10-04 21:38

Expand comment: Tim Jacomb added a comment - 2021-10-04 21:38

Collapse comment: Tim Jacomb added a comment - 2021-10-04 21:52

Expand comment: Tim Jacomb added a comment - 2021-10-04 21:52

Collapse comment: Tim Jacomb added a comment - 2021-10-04 22:02

Expand comment: Tim Jacomb added a comment - 2021-10-04 22:02

Collapse comment: Basil Crow added a comment - 2021-10-04 22:02, Edited by Basil Crow - 2021-10-04 22:05

Expand comment: Basil Crow added a comment - 2021-10-04 22:02, Edited by Basil Crow - 2021-10-04 22:05

Collapse comment: Basil Crow added a comment - 2021-10-04 22:09

Expand comment: Basil Crow added a comment - 2021-10-04 22:09

Collapse comment: Tim Jacomb added a comment - 2021-10-04 22:19, Edited by Tim Jacomb - 2021-10-04 22:19

Expand comment: Tim Jacomb added a comment - 2021-10-04 22:19, Edited by Tim Jacomb - 2021-10-04 22:19

Collapse comment: Basil Crow added a comment - 2021-10-04 22:42

Expand comment: Basil Crow added a comment - 2021-10-04 22:42

Collapse comment: Basil Crow added a comment - 2021-10-04 23:50

Expand comment: Basil Crow added a comment - 2021-10-04 23:50

Collapse comment: Tim Jacomb added a comment - 2021-10-05 06:53

Expand comment: Tim Jacomb added a comment - 2021-10-05 06:53

Collapse comment: Tim Jacomb added a comment - 2021-10-05 07:12

Expand comment: Tim Jacomb added a comment - 2021-10-05 07:12

Collapse comment: Tim Jacomb added a comment - 2021-10-05 07:27

Expand comment: Tim Jacomb added a comment - 2021-10-05 07:27

Collapse comment: Basil Crow added a comment - 2021-10-23 19:42

Expand comment: Basil Crow added a comment - 2021-10-23 19:42

Collapse comment: Basil Crow added a comment - 2022-02-22 06:45

Expand comment: Basil Crow added a comment - 2022-02-22 06:45

Collapse comment: Basil Crow added a comment - 2022-02-24 07:07

Expand comment: Basil Crow added a comment - 2022-02-24 07:07

Collapse comment: Basil Crow added a comment - 2022-02-24 07:29

Expand comment: Basil Crow added a comment - 2022-02-24 07:29

Collapse comment: Basil Crow added a comment - 2024-03-08 21:55

Expand comment: Basil Crow added a comment - 2024-03-08 21:55

Collapse comment: Basil Crow added a comment - 2024-03-11 23:13

Expand comment: Basil Crow added a comment - 2024-03-11 23:13

Collapse comment: Mark Thormann added a comment - 2024-03-12 14:30

Expand comment: Mark Thormann added a comment - 2024-03-12 14:30

Collapse comment: Daniel Beck added a comment - 2024-03-12 15:44

Expand comment: Daniel Beck added a comment - 2024-03-12 15:44

Collapse comment: Basil Crow added a comment - 2024-03-12 17:00

Expand comment: Basil Crow added a comment - 2024-03-12 17:00

People

Dates