-
Bug
-
Resolution: Fixed
-
Minor
-
HideJenkins: 2.298
OS: Mac OS X - 11.0
---
jsch:0.1.55.2
ws-cleanup:0.39
blueocean-commons:1.24.7
mercurial:2.14
echarts-api:5.1.0-2
structs:1.23
sse-gateway:1.24
conditional-buildstep:1.4.1
apache-httpcomponents-client-4-api:4.5.13-1.0
parameterized-trigger:2.40
sshd:3.0.3
pipeline-model-extensions:1.8.4
build-with-parameters:1.5.1
workflow-aggregator:2.6
mailer:1.34
git:4.7.1
handy-uri-templates-2-api:2.1.8-1.0
theme-manager:0.6
bootstrap4-api:4.6.0-3
jquery3-api:3.6.0-1
delivery-pipeline-plugin:1.4.2
command-launcher:1.6
dark-theme:0.0.12
workflow-api:2.42
docker-build-publish:1.3.3
extended-choice-parameter:0.82
workflow-job:2.40
ssh-credentials:1.18.2
authentication-tokens:1.4
blueocean-rest-impl:1.24.7
github-branch-source:2.10.4
htmlpublisher:1.25
javadoc:1.6
workflow-cps-global-lib:2.19
blueocean-web:1.24.7
jackson2-api:2.12.3
disable-failed-job:1.15
ssh-slaves:1.31.7
docker-workflow:1.26
pipeline-stage-tags-metadata:1.8.4
blueocean-pipeline-scm-api:1.24.7
pipeline-milestone-step:1.3.2
credentials:2.3.19
docker-java-api:3.1.5.2
cloudbees-bitbucket-branch-source:2.9.9
jquery:1.12.4-1
github:1.33.1
lockable-resources:2.10
jquery-detached:1.2.1
blueocean-personalization:1.24.7
workflow-scm-step:2.12
matrix-auth:2.6.6
matrix-project:1.18
pipeline-stage-step:2.5
pipeline-build-step:2.13
antisamy-markup-formatter:2.1
pipeline-input-step:2.12
bouncycastle-api:2.20
checks-api:1.7.0
handlebars:3.0.8
blueocean:1.24.7
variant:1.4
momentjs:1.1.1
blueocean-jwt:1.24.7
plain-credentials:1.7
docker-commons:1.17
aws-java-sdk:1.11.995
docker-plugin:1.2.2
jobConfigHistory:2.27
copyartifact:1.46
git-client:3.7.1
timestamper:1.13
gradle:1.36
pipeline-rest-api:2.19
workflow-basic-steps:2.23
github-api:1.123
blueocean-i18n:1.24.7
ldap:2.6
blueocean-events:1.24.7
jaxb:2.3.0.1
blueocean-core-js:1.24.7
maven-plugin:3.10
aws-credentials:1.29
thinBackup:1.10
blueocean-config:1.24.7
blueocean-github-pipeline:1.24.7
font-awesome-api:5.15.3-2
credentials-binding:1.24
pipeline-model-definition:1.8.4
config-file-provider:3.8.0
pipeline-stage-view:2.19
sonar:2.13.1
token-macro:2.15
slack:2.47
blueocean-display-url:2.4.1
workflow-multibranch:2.23
script-security:1.76
git-server:1.9
snakeyaml-api:1.27.0
docker-build-step:2.7
discard-old-build:1.05
ssh-agent:1.22
workflow-step-api:2.23
run-condition:1.5
rebuild:1.32
okhttp-api:3.14.9
pipeline-graph-analysis:1.10
blueocean-git-pipeline:1.24.7
pipeline-model-api:1.8.4
plugin-util-api:2.2.0
jenkins-design-language:1.24.7
windows-slaves:1.8
workflow-cps:2.90
popper-api:1.16.1-2
blueocean-autofavorite:1.2.4
workflow-durable-task-step:2.38
trilead-api:1.0.13
branch-api:2.6.4
jdk-tool:1.5
whitesource:21.1.2
cloudbees-folder:6.15
blueocean-pipeline-editor:1.24.7
Parameterized-Remote-Trigger:3.1.5.1
blueocean-dashboard:1.24.7
jjwt-api:0.11.2-9.c8b45b8bb173
durable-task:1.36
kpp-management-plugin:1.0.0
junit:1.49
pam-auth:1.6
pubsub-light:1.14
next-build-number:1.6
scm-api:2.6.4
blueocean-pipeline-api-impl:1.24.7
ace-editor:1.1
display-url-api:2.3.4
workflow-support:3.8
resource-disposer:0.15
blueocean-rest:1.24.7
build-timeout:1.20
favorite:2.3.3
versionnumber:1.9
blueocean-bitbucket-pipeline:1.24.7
mapdb-api:1.0.9.0
git-parameter:0.9.13
Also reproducible with a minimal amount of plugins and clean install.
Installed with homebrew, running as "jenkins" command (not a brew service).
Local instance behind a firewall and accessed through the machine's IP.ShowJenkins: 2.298 OS: Mac OS X - 11.0 --- jsch:0.1.55.2 ws-cleanup:0.39 blueocean-commons:1.24.7 mercurial:2.14 echarts-api:5.1.0-2 structs:1.23 sse-gateway:1.24 conditional-buildstep:1.4.1 apache-httpcomponents-client-4-api:4.5.13-1.0 parameterized-trigger:2.40 sshd:3.0.3 pipeline-model-extensions:1.8.4 build-with-parameters:1.5.1 workflow-aggregator:2.6 mailer:1.34 git:4.7.1 handy-uri-templates-2-api:2.1.8-1.0 theme-manager:0.6 bootstrap4-api:4.6.0-3 jquery3-api:3.6.0-1 delivery-pipeline-plugin:1.4.2 command-launcher:1.6 dark-theme:0.0.12 workflow-api:2.42 docker-build-publish:1.3.3 extended-choice-parameter:0.82 workflow-job:2.40 ssh-credentials:1.18.2 authentication-tokens:1.4 blueocean-rest-impl:1.24.7 github-branch-source:2.10.4 htmlpublisher:1.25 javadoc:1.6 workflow-cps-global-lib:2.19 blueocean-web:1.24.7 jackson2-api:2.12.3 disable-failed-job:1.15 ssh-slaves:1.31.7 docker-workflow:1.26 pipeline-stage-tags-metadata:1.8.4 blueocean-pipeline-scm-api:1.24.7 pipeline-milestone-step:1.3.2 credentials:2.3.19 docker-java-api:3.1.5.2 cloudbees-bitbucket-branch-source:2.9.9 jquery:1.12.4-1 github:1.33.1 lockable-resources:2.10 jquery-detached:1.2.1 blueocean-personalization:1.24.7 workflow-scm-step:2.12 matrix-auth:2.6.6 matrix-project:1.18 pipeline-stage-step:2.5 pipeline-build-step:2.13 antisamy-markup-formatter:2.1 pipeline-input-step:2.12 bouncycastle-api:2.20 checks-api:1.7.0 handlebars:3.0.8 blueocean:1.24.7 variant:1.4 momentjs:1.1.1 blueocean-jwt:1.24.7 plain-credentials:1.7 docker-commons:1.17 aws-java-sdk:1.11.995 docker-plugin:1.2.2 jobConfigHistory:2.27 copyartifact:1.46 git-client:3.7.1 timestamper:1.13 gradle:1.36 pipeline-rest-api:2.19 workflow-basic-steps:2.23 github-api:1.123 blueocean-i18n:1.24.7 ldap:2.6 blueocean-events:1.24.7 jaxb:2.3.0.1 blueocean-core-js:1.24.7 maven-plugin:3.10 aws-credentials:1.29 thinBackup:1.10 blueocean-config:1.24.7 blueocean-github-pipeline:1.24.7 font-awesome-api:5.15.3-2 credentials-binding:1.24 pipeline-model-definition:1.8.4 config-file-provider:3.8.0 pipeline-stage-view:2.19 sonar:2.13.1 token-macro:2.15 slack:2.47 blueocean-display-url:2.4.1 workflow-multibranch:2.23 script-security:1.76 git-server:1.9 snakeyaml-api:1.27.0 docker-build-step:2.7 discard-old-build:1.05 ssh-agent:1.22 workflow-step-api:2.23 run-condition:1.5 rebuild:1.32 okhttp-api:3.14.9 pipeline-graph-analysis:1.10 blueocean-git-pipeline:1.24.7 pipeline-model-api:1.8.4 plugin-util-api:2.2.0 jenkins-design-language:1.24.7 windows-slaves:1.8 workflow-cps:2.90 popper-api:1.16.1-2 blueocean-autofavorite:1.2.4 workflow-durable-task-step:2.38 trilead-api:1.0.13 branch-api:2.6.4 jdk-tool:1.5 whitesource:21.1.2 cloudbees-folder:6.15 blueocean-pipeline-editor:1.24.7 Parameterized-Remote-Trigger:3.1.5.1 blueocean-dashboard:1.24.7 jjwt-api:0.11.2-9.c8b45b8bb173 durable-task:1.36 kpp-management-plugin:1.0.0 junit:1.49 pam-auth:1.6 pubsub-light:1.14 next-build-number:1.6 scm-api:2.6.4 blueocean-pipeline-api-impl:1.24.7 ace-editor:1.1 display-url-api:2.3.4 workflow-support:3.8 resource-disposer:0.15 blueocean-rest:1.24.7 build-timeout:1.20 favorite:2.3.3 versionnumber:1.9 blueocean-bitbucket-pipeline:1.24.7 mapdb-api:1.0.9.0 git-parameter:0.9.13 Also reproducible with a minimal amount of plugins and clean install. Installed with homebrew, running as "jenkins" command (not a brew service). Local instance behind a firewall and accessed through the machine's IP.
-
Powered by SuggestiMate -
2.449
This issue appeared after migrating our Jenkins instance from a Macbook to a new Mac mini (M1). I could also reproduce it with a clean install on the M1 machine.
It has been present in the last few versions (at least).
Steps to reproduce:
- Go to http://[JENKINS_URL]/updateCenter/
- Check "Restart Jenkins when installation is complete and no jobs are running" (regardless of whether there's an upgrade or not)
- Observe what happens in the terminal - after 10 seconds, Jenkins process is killed instead of restarted.
This is how the log looks like:
2021-06-17 08:27:47.284+0000 [id=1066] INFO jenkins.model.Jenkins$20#run: Restart in 10 seconds
2021-06-17 08:27:57.287+0000 [id=1066] INFO jenkins.model.Jenkins$20#run: Restarting VM as requested by admin
2021-06-17 08:27:57.287+0000 [id=1066] INFO jenkins.model.Jenkins#cleanUp: Stopping Jenkins
2021-06-17 08:27:57.339+0000 [id=1066] INFO jenkins.model.Jenkins$16#onAttained: Started termination
2021-06-17 08:27:57.370+0000 [id=1066] INFO jenkins.model.Jenkins$16#onAttained: Completed termination
2021-06-17 08:27:57.370+0000 [id=1066] INFO jenkins.model.Jenkins#_cleanUpDisconnectComputers: Starting node disconnection
2021-06-17 08:27:57.379+0000 [id=1066] INFO jenkins.model.Jenkins#_cleanUpShutdownPluginManager: Stopping plugin manager
2021-06-17 08:27:57.440+0000 [id=1066] INFO jenkins.model.Jenkins#_cleanUpPersistQueue: Persisting build queue
2021-06-17 08:27:57.443+0000 [id=1066] INFO jenkins.model.Jenkins#_cleanUpAwaitDisconnects: Waiting for node disconnection completion
2021-06-17 08:27:57.443+0000 [id=1066] INFO jenkins.model.Jenkins#cleanUp: Jenkins stopped
[1] 72539 killed jenkins
Jenkins runs fine when manually started again (with "jenkins" command).
[JENKINS-65911] Jenkins shuts down instead of restarting on Mac M1
I've been debugging it.
It fails on this line:
https://github.com/jenkinsci/jenkins/blob/195dacbf556fdf0a393d5d139df27ffba3a82b11/core/src/main/java/hudson/lifecycle/UnixLifecycle.java#L80
The file descriptor it fails on changes in different runs normally between 820-840
Catching error doesn't help:
try {} catch (Error e)
Not sure if basil would have any ideas on what to try from here =/
timja Unfortunately I don't have access to a macOS system to do real debugging on, and debugging by telegram isn't very efficient. But I can try to help.
The logic makes sense at a high level: we are about to exec, so to avoid leaking open file handles we need to set the close-on-exec flag for all open file descriptors. That this not only failing in the first place but indeed crashing the process is quite surprising.
The most direct form of analysis would be to work our way backwards from the crash. Do you have a stack trace, registers, or any other diagnostic information for the crash? I don't really know where macOS puts these things, but this page seems to have some answers.
Another piece of interesting information might be which file descriptor we are trying to set the close-on-exec flag on prior to the crash. If you can put a breakpoint at the line you mentioned and run lsof to list the process's open file descriptors, then keep going until it crashes, then tell us what type of file descriptor caused the crash, that might be interesting. E.g. was it a socket or a regular file, if it was a regular file what was its path, etc.
Thanks I’ll add some more debug info, currently I’ve just got the ID, there’s no stacktrace
Interesting there was a crash log in the 'Crash reports'
Exception Type: EXC_GUARD Exception Codes: GUARD_TYPE_FD
Looks like some apple protection on one of the files
Full log at: https://gist.github.com/timja/0167c76924c5e44a09c64a545c87ef8b
The thread always crashes at:
Thread 64 Crashed:: Java: restart thread
0 libsystem_kernel.dylib 0x000000018c105964 __fcntl + 8
1 libsystem_kernel.dylib 0x000000018c105930 fcntl + 88
Apple docs: https://developer.apple.com/documentation/xcode/understanding-the-exception-types-in-a-crash-report
This looks to be the correct fix, there's no need to set `FD_CLOEXEC` on the file descriptor that already has it set
diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java index 391998cdff..97288d5fad 100644 --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java @@ -76,7 +76,9 @@ public class UnixLifecycle extends Lifecycle { int sz = LIBC.getdtablesize(); for(int i=3; i<sz; i++) { int flags = LIBC.fcntl(i, F_GETFD); - if(flags<0) continue; + if (flags < 0 || flags == FD_CLOEXEC) { + continue; + } LIBC.fcntl(i, F_SETFD,flags| FD_CLOEXEC); }
It gets further after the above, it now fails at:
2021-10-04 21:48:05.292+0000 [id=1] INFO winstone.Logger#logInternal: Jetty shutdown successfully java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:194) at winstone.Launcher.main(Launcher.java:369) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:311) at Main.main(Main.java:114) Caused by: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:192) ... 7 more Caused by: java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) ... 14 more 2021-10-04 21:48:05.292+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:192) Caused: java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:194) at winstone.Launcher.main(Launcher.java:369) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:311) at Main.main(Main.java:114)
Any thoughts on where to check for that?
This might be where it's supposed to be shutdown: https://github.com/jenkinsci/winstone/blob/825889d1a93a49feabca297308fff7ecafba70cd/src/main/java/winstone/ShutdownHook.java#L15
timja Good insight! I think you're almost there, but your fix isn't quite right. The flags is a bit field, so you can't check if FD_CLOEXEC is set on it with flags == FD_CLOEXEC – that is probably missing a lot of stuff including open sockets, which is probably why you then get the "Address already in use" error. You want to use bitwise arithmetic to check if the FD_CLOEXEC bit is set on the flags while ignoring the other bits that you don't care about.
timja Come to think of it, you might not be almost there after all, because I see no evidence that whatever file descriptor macOS is freaking out about has the FD_CLOEXEC already set on it – that seems like an assumption on your part based on a faulty understanding of how the bit is represented in the flags. I don't see any reason why your theory would be false but I also don't see any reason why it would be true. So back to the drawing board...
What is clear to me is this is some weird Apple thing where certain file descriptors are "guarded resources" where "the operating system marks a file descriptor as guarded so that normal file descriptor APIs can’t modify them." So when we try to set FD_CLOEXEC on it, the OS "helpfully" crashes the process rather than trusting that we know what we're doing. I had been initially worried that we were hitting some JNA issue, but this seems unlikely. We're bumping into some macOS restriction. My main question is: "what file is so special that macOS forbids us from closing it?" Can you run lsof -p ${PID} and please let me know what file handle is so sacred that Apple does not allow developers to close it.
It's not very informative...
java 37417 timja 844 NPOLICY
Only reference I found was someone asking what it is:
https://developer.apple.com/forums/thread/659475
The evidence I saw was from logging out the flags returned by 'F_GETFD' and seeing that the failing process had '1' returned which is FD_CLOEXEC from what I understand, but I haven't had to do bitwise arithmetic in a long time so likely very faulty assumption.
(There was also one other process with the flag 1 which was the preceeding FD)
Maybe you'd have more luck figuring out what this weird file descriptor is with DTrace. sudo opensnoop -p ${PID} will log every file descriptor as it's opened, giving you a bunch of information about it like its file descriptor and its path. It's based on DTrace, so you'll have to disable System Integrity Protection (SIP) in order to use it. (It's this kind of thing that drove me nuts which is why I eventually quit using macOS. Now I run Linux and can do system-level tracing to my heart's content.)
timja You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY? Check out this page:
A major addition to Darwin's network stack are Network Extension Control Policies (NECPs), added to Darwin 14. NECPs are described in bsd/net/necp.c as "..high-level policy sessions, which are ingested into low-level kernel policies that control and tag traffic at the application, socket, and IP layers". In other words, NECPs enable user mode programs to control the kernel's network routing and scheduling decisions. Naturally, these also allow QoS.
[…]
Starting with Darwin 16, just about every network-enabled process in the system uses NECPs, oftentimes without the developer even knowing what they are. This is because libnetwork.dylib calls necp_open() (#501) as part of its initialization (specifically, from nw_endpoint_handler_start). This creates a necp client, a file descriptor of type NPOLICY, which is readily visible in the output of lsof(1) or procexp ..fds. The descriptor does not offer the traditional operations (read(2)/write(2)/ioctl(2)), and only supports select(2), or use in a kqueue.
[…]
Darwin 17 extends the idea of NECP client descriptors, and adds the NECP session (also an NPOLICY file descriptor^*^). These descriptors are created with necp_session_open (#522), and support just the close(2) operation (which deletes the associated session).
So it looks like an NPOLICY file descriptor could be either an NECP client or an NECP session. Now, if trying to set FD_CLOEXEC on one of these is triggering a EXC_GUARD/GUARD_TYPE_FD "guarded resource protection", then my next question would be: who created the NECP client/session, and why is it guarded? More importantly, how can we get it to release the resource before we exec?
> You're absolutely sure that the file descriptor we're crashing trying to set FD_CLOEXEC on is of type NPOLICY?
Sure is a strong word here, this sort of thing is quite out of comfort zone, but yes it appears so.
These are the related entries from the last restart I just did:
java 40095 timja 840u IPv6 0x7a007b86cdf33161 0t0 TCP localhost:http-alt->localhost:58463 (ESTABLISHED) java 40095 timja 841u IPv6 0x7a007b86c2092461 0t0 TCP *:ndmp (LISTEN) java 40095 timja 842u systm 0x7a007b936831da81 0t0 [ctl com.apple.netsrc id 6 unit 26] java 40095 timja 843u unix 0x7a007b86a4514369 0t0 ->0x7a007b86a450efd1 java 40095 timja 844 NPOLICY java 40095 timja 845u IPv6 0x7a007b86c208eb01 0t0 TCP 192.168.4.151:58480->20.190.160.136:https (ESTABLISHED) java 40095 timja 846u IPv6 0x7a007b86c208f7c1 0t0 TCP 192.168.4.151:58484->20.190.160.136:https (ESTABLISHED) java 40095 timja 847u IPv6 0x7a007b86c2091141 0t0 TCP 192.168.4.151:58488->40.126.41.96:https (ESTABLISHED) java 40095 timja 848u IPv6 0x7a007b86cdf36461 0t0 TCP localhost:http-alt->localhost:58490 (ESTABLISHED) java 40095 timja 849u IPv6 0x7a007b86cdf337c1 0t0 TCP localhost:http-alt->localhost:58491 (ESTABLISHED) java 40095 timja 850u IPv6 0x7a007b86cdf34481 0t0 TCP localhost:http-alt->localhost:58493 (ESTABLISHED) java 40095 timja 851u IPv6 0x7a007b86cdaebe21 0t0 TCP localhost:http-alt->localhost:58494 (ESTABLISHED) java 40095 timja 852u IPv6 0x7a007b86cdaed7a1 0t0 TCP localhost:http-alt->localhost:58495 (ESTABLISHED)
and the relevant part of the crash log:
Exception Type: EXC_GUARD Exception Codes: 0x400000040000034c, 0xe20c7f41b3616e04 Exception Subtype: GUARD_TYPE_FD, id=0xe20c7f41b3616e04, fd=844, flavor=0x00000004 (NOCLOEXEC)
This is the result from opensnoop: https://gist.github.com/timja/65e67174af7a3a3f2e721bbeef2a9c5f
On this run the FD that caused it to crash was 847, and was of type 'NPOLICY'
Interesting I found that the file descriptor is only created after logging in to Jenkins,
If you disable auth and restart it doesn't get killed
and after some more digging it seems to be the azure-ad plugin is starting that file descriptor somehow.
The built-in security realm doesn't have the problem.
But the restart doesn't work either way it still hits the:
2021-10-05 07:26:21.441+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:192) Caused: java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:194) at winstone.Launcher.main(Launcher.java:369) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:311) at Main.main(Main.java:114)
timja Setting aside the Azure AD issue, it is interesting that with the built-in security realm the restart fails with java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080. I looked into how this works (correctly) on my Linux system. Before restarting Jenkins, here are the open TCP file descriptors:
$ sudo lsof -P -p 3916237 | grep TCP java 3916237 basil 5u IPv6 5821933 0t0 TCP localhost:8080->localhost:49530 (ESTABLISHED) java 3916237 basil 6u IPv4 5807038 0t0 TCP localhost:5005->localhost:48452 (ESTABLISHED) java 3916237 basil 120u IPv6 5812861 0t0 TCP *:8080 (LISTEN)
The call to LIBC.fcntl(i, F_SETFD,flags | FD_CLOEXEC) marks them as CLOEXEC, then the call to LIBC.execvp execs to self, during which the file descriptors are closed and the socket released. During startup there are no open TCP file descriptors, so Jetty has no problem binding to the port.
At what point is this going off the rails on M1?
Note that winstone.Launcher#shutdown isn't executed in this process, since this isn't a "regular" JVM shutdown where the shutdown hooks are executed. Whatever Winstone/Jetty might be doing to gracefully shut down, I can't see how closing all file descriptors and calling exec would be less than that - there's really not much left to the process once its program text and open file handles are gone. Unless there's some weird Apple resource that winstone.Launcher#shutdown releases which isn't a file descriptor or in the process's address space - which would be quite strange.
I can reproduce the problem on an M1 system with a stock Jenkins installation. Tracing the process with dtruss(1) I cannot see any calls to necp_session_open, but I see a call to necp_open:
927/0x364b: 3733 12 8 necp_open(0x0, 0x0, 0x0) = 3 0 0x18e4d1b78 0x1926e07ec 0x1926df85c 0x18f1429c8 0x102ee0450 0x102ede014 0x102eddf68 0x102eddea0 0x18e25c1d8 0x18e25abf8 0x18e34ac6c 0x18e366f68 0x18e352208 0x18e367cb8 0x18e35d708 0x18e505304 0x18e504018
For some reason DTrace's ustack() isn't resolving symbols, but I got the symbols from lldb(1):
target create /opt/homebrew/opt/openjdk@11/bin/java process launch --stop-at-entry -- -jar war/target/jenkins.war process handle -p true -s false SIGBUS process handle -p true -s false SIGSEGV process handle -p true -s false SIGSTOP break set -n necp_open -s libsystem_kernel.dylib continue bt * thread #53, name = 'Java: GrapeHack.hack', stop reason = breakpoint 1.1 * frame #0: 0x000000018e4d1b70 libsystem_kernel.dylib`necp_open frame #1: 0x0000000192715b2c libnetwork.dylib`nw_path_shared_necp_fd + 220 frame #2: 0x00000001926e07ec libnetwork.dylib`nw_path_evaluator_evaluate + 540 frame #3: 0x00000001926df85c libnetwork.dylib`nw_path_create_evaluator_for_endpoint + 108 frame #4: 0x0000000192e1e8cc libnetwork.dylib`nw_nat64_v4_address_requires_synthesis + 252 frame #5: 0x000000018e5298e4 libsystem_info.dylib`si_addrinfo + 2308 frame #6: 0x000000018e528f34 libsystem_info.dylib`getaddrinfo + 168 frame #7: 0x00000001021702c4 libnet.dylib`Java_java_net_Inet6AddressImpl_lookupAllHostAddr + 144 frame #8: 0x000000011f90bbfc frame #9: 0x000000011f906fc8 frame #10: 0x000000011f906fc8 frame #11: 0x000000011f906ee0 frame #12: 0x000000011f906fc8 frame #13: 0x000000011f906ee0 frame #14: 0x000000011f906ee0 frame #15: 0x000000011f906ee0 frame #16: 0x000000011f90719c frame #17: 0x000000011f90719c frame #18: 0x000000011f90719c frame #19: 0x000000011f900144 frame #20: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #21: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #22: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392 frame #23: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232 frame #24: 0x00000001274e1360 frame #25: 0x00000001203c8c84 frame #26: 0x000000011f906ee0 frame #27: 0x000000011f906ee0 frame #28: 0x000000011f906ee0 frame #29: 0x000000011f90719c frame #30: 0x000000011f9071f4 frame #31: 0x000000011f9071f4 frame #32: 0x00000001200a6434 frame #33: 0x000000011f906ab0 frame #34: 0x000000011f9067e4 frame #35: 0x000000011f90678c frame #36: 0x000000011f9071f4 frame #37: 0x000000011f90719c frame #38: 0x000000011f90719c frame #39: 0x000000011f90719c frame #40: 0x000000011f90719c frame #41: 0x000000011f90719c frame #42: 0x000000011f90719c frame #43: 0x000000011f90719c frame #44: 0x000000011f90719c frame #45: 0x000000011f906fc8 frame #46: 0x000000011f906ee0 frame #47: 0x000000011f906ee0 frame #48: 0x000000011f906fc8 frame #49: 0x000000011f900144 frame #50: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #51: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #52: 0x00000001028d4d8c libjvm.dylib`Reflection::invoke_constructor(oopDesc*, objArrayHandle, Thread*) + 392 frame #53: 0x00000001026d0554 libjvm.dylib`JVM_NewInstanceFromConstructor + 232 frame #54: 0x00000001274e1360 frame #55: 0x00000001203c8c84 frame #56: 0x000000011f906ee0 frame #57: 0x000000011f900144 frame #58: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #59: 0x00000001028d4898 libjvm.dylib`invoke(InstanceKlass*, methodHandle const&, Handle, bool, objArrayHandle, BasicType, objArrayHandle, bool, Thread*) + 1796 frame #60: 0x00000001028d4138 libjvm.dylib`Reflection::invoke_method(oopDesc*, Handle, objArrayHandle, Thread*) + 276 frame #61: 0x00000001026d038c libjvm.dylib`JVM_InvokeMethod + 508 frame #62: 0x00000001276a1270 frame #63: 0x000000012097c16c frame #64: 0x000000011f90719c frame #65: 0x000000011f9071f4 frame #66: 0x000000011f90719c frame #67: 0x000000011f90719c frame #68: 0x000000011f9071f4 frame #69: 0x000000011f9071f4 frame #70: 0x000000011f9071f4 frame #71: 0x000000011f90719c frame #72: 0x000000011f9071f4 frame #73: 0x000000011f900144 frame #74: 0x0000000102647570 libjvm.dylib`JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*) + 736 frame #75: 0x00000001026468d0 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, Thread*) + 236 frame #76: 0x0000000102646998 libjvm.dylib`JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*) + 100 frame #77: 0x00000001026cd0dc libjvm.dylib`thread_entry(JavaThread*, Thread*) + 120 frame #78: 0x0000000102973134 libjvm.dylib`JavaThread::thread_main_inner() + 124 frame #79: 0x0000000102972f64 libjvm.dylib`JavaThread::run() + 376 frame #80: 0x0000000102970c88 libjvm.dylib`Thread::call_run() + 120 frame #81: 0x0000000102874b94 libjvm.dylib`thread_native_entry(Thread*) + 316 frame #82: 0x000000018e509240 libsystem_pthread.dylib`_pthread_start + 148
So some thread named GrapeHack.hack is calling Java_java_net_Inet6AddressImpl_lookupAllHostAddr, which eventually calls necp_open, which creates a file descriptor with NPOLICY, which we then set FD_CLOEXEC on, which we then crash on when exec'ing.
I looked into disabling IPv6 in Java and tried running Java with -Djava.net.preferIPv4Stack=true. That eliminates the calls to necp_open and the NPOLICY FD:
$ sudo lsof -p `pgrep java` | grep NPOLICY $
That eliminates the crash, but Jenkins still doesn't restart properly. Now I'm getting:
2022-02-22 06:38:34.228+0000 [id=1] SEVERE winstone.Logger#logInternal: Container startup failed java.net.BindException: Address already in use at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Net.java:455) at java.base/sun.nio.ch.Net.bind(Net.java:447) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227) at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) Caused: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8080 at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.eclipse.jetty.server.Server.doStart(Server.java:401) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at winstone.Launcher.<init>(Launcher.java:198) Caused: java.io.IOException: Failed to start Jetty at winstone.Launcher.<init>(Launcher.java:200) at winstone.Launcher.main(Launcher.java:376) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at Main._main(Main.java:304) at Main.main(Main.java:108)
A mystery for another day.
More late night debugging with java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war... after instrumenting Winstone with
diff --git a/src/main/java/winstone/Launcher.java b/src/main/java/winstone/Launcher.java index 2657c71..2c0b72c 100644 --- a/src/main/java/winstone/Launcher.java +++ b/src/main/java/winstone/Launcher.java @@ -195,6 +195,7 @@ public class Launcher implements Runnable { } try { + Thread.sleep(10 * 1000L); server.start(); } catch (Exception e) { throw new IOException("Failed to start Jetty",e);
I could confirm that after the restart not only was the port still open but all the other files were also still open. It appears as if nothing got closed before the exec. Instrumenting Jenkins with
diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java index ca8f50b9f5..b6d28ea313 100644 --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java @@ -82,6 +82,7 @@ public class UnixLifecycle extends Lifecycle { } // exec to self + Thread.sleep(10 * 1000L); String exe = args.get(0); LIBC.execvp(exe, new StringArray(args.toArray(new String[0]))); throw new IOException("Failed to exec '" + exe + "' " + LIBC.strerror(Native.getLastError()));
and running sudo lsof +f g -p $PID I could confirm that no file descriptor had the CX (close-on-exec) flag. We are simply failing to set the close-on-exec flag.
Watching the fcntl system calls with dtruss -t fcntl I see:
fcntl(0x3, 0x1, 0x0) = 0 0 fcntl(0x3, 0x2, 0xFFFFFFFFC5EC5B50) = 0 0 fcntl(0x4, 0x1, 0x0) = 0 0 fcntl(0x4, 0x2, 0x29ECA00) = 0 0 fcntl(0x5, 0x1, 0x0) = 0 0 fcntl(0x5, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x6, 0x1, 0x0) = 0 0 fcntl(0x6, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x7, 0x1, 0x0) = 0 0 fcntl(0x7, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x8, 0x1, 0x0) = 0 0 fcntl(0x8, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0x9, 0x1, 0x0) = 0 0 fcntl(0x9, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0 fcntl(0xA, 0x1, 0x0) = 0 0 fcntl(0xA, 0x2, 0xFFFFFFFFC5EC5AB0) = 0 0
and on and on. The calls where the second argument is 0x2 correspond to F_SETFD. Most of them have 0xFFFFFFFFC5EC5B50 as the third argument, including the one for the TCP socket.
For comparison I wrote a simple Python program:
import fcntl import time with open('foo.txt') as foo: time.sleep(10) print('about to fcntl') flags = fcntl.fcntl(foo.fileno(), fcntl.F_GETFD) fcntl.fcntl(foo.fileno(), fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC) print('finished fcntl') time.sleep(10)
I confirmed that after the F_SETFD the file descriptor had CX set in lsof +f g as expected. The dtruss -t fcntl output here was:
fcntl(0x3, 0x1, 0x0) = 0 0 fcntl(0x3, 0x2, 0x1) = 0 0
So clearly Jenkins is in the wrong here by calling SET_FD with 0xFFFFFFFFC5EC5AB0 as the flags. It should be 0x1.
Funnily enough I got this working very easily with JNR (which I am trying to get rid of in core...):
diff --git a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java index ca8f50b9f5..a68bc4ef7b 100644 --- a/core/src/main/java/hudson/lifecycle/UnixLifecycle.java +++ b/core/src/main/java/hudson/lifecycle/UnixLifecycle.java @@ -26,18 +26,19 @@ package hudson.lifecycle; import static hudson.util.jna.GNUCLibrary.FD_CLOEXEC; import static hudson.util.jna.GNUCLibrary.F_GETFD; -import static hudson.util.jna.GNUCLibrary.F_SETFD; import static hudson.util.jna.GNUCLibrary.LIBC; import com.sun.jna.Native; import com.sun.jna.StringArray; import hudson.Platform; +import hudson.os.PosixAPI; import java.io.IOException; import java.util.List; import java.util.logging.Level; import java.util.logging.Logger; import jenkins.model.Jenkins; import jenkins.util.JavaVMArguments; +import jnr.constants.platform.Fcntl; /** * {@link Lifecycle} implementation when Hudson runs on the embedded @@ -78,7 +79,7 @@ public class UnixLifecycle extends Lifecycle { for (int i = 3; i < sz; i++) { int flags = LIBC.fcntl(i, F_GETFD); if (flags < 0) continue; - LIBC.fcntl(i, F_SETFD, flags | FD_CLOEXEC); + PosixAPI.jnr().fcntlInt(i, Fcntl.F_SETFD, flags | FD_CLOEXEC); } // exec to self
With that, java -Djava.net.preferIPv4Stack=true -jar war/target/jenkins.war is fully working and the restart succeeds.
To summarize, there are two separate bugs here:
- When running without -Djava.net.preferIPv4Stack=true, calls to Java_java_net_Inet6AddressImpl_lookupAllHostAddr call necp_open, which creates a file descriptor with NPOLICY, and if we then set FD_CLOEXEC on that file descriptor the process crashes. I am not sure offhand what the best way to deal with this is. Perhaps we should just skip those file descriptors? But would need to make sure they don't cause problems after exec'ing. Needs more investigation. In any case, -Djava.net.preferIPv4Stack=true works around the problem.
- Our call to fcntl(fd, F_SETFD has a bogus flags argument. The above code for JNR shows one way to fix this. I'm sure this can also be done with JNA. I am just too tired right now to come up with the JNA version.
Of note is that fcntl is a variadic function whose flags argument is an unsigned 64-bit integer. There is probably something wrong with our JNA signatures. Ironically, when I do a Google search for "fcntl jna", all I can find is the (incorrect) code in Jenkins.
A few years later, I looked at this again and can't reproduce the original problem on OpenJDK 11 (Homebrew) on my macOS system, but I could reproduce an fcntl issue in a unit test, fixed in https://github.com/jenkinsci/jenkins/pull/9026. Perhaps someone who is still experiencing the issue can verify that https://github.com/jenkinsci/jenkins/pull/9026 resolves the problem, or at least does not make things any worse.
This ticket contains a number of different problems, and is no longer reproducible for me end-to-end, although one specific problem previous identified has been fixed toward 2.449. To facilitate efficient troubleshooting, I am closing this ticket. If issues persist, please file new ticket(s) as necessary with specific symptoms.
In my case, it turned out that when Jenkins was installed via Homebrew but not upgraded via Homebrew (i.e. through the Jenkins UI), Jenkins would not restart. When we updated via Homebrew past a certain version (can't remember which), everything just started working. Having two different upgrade paths to keep up w/the changelog is a bit confusing. Feels like we should force Homebrew installs to update via Homebrew?
Having two different upgrade paths to keep up w/the changelog is a bit confusing. Feels like we should force Homebrew installs to update via Homebrew?
The Homebrew package could remove write permission on jenkins.war, then Jenkins will not offer to update itself. Note that that package is not maintained by the Jenkins project.
Or someone could contribute launchd integration that parallels the systemd integration (waiting until the service is fully running before the start command finishes, restarting by exiting a specific exit code to let the service manager restart the process, etc.). In any case, that would be a topic for a separate ticket.
This seems to be fixed in the 2.299 version, though it's not documented.