-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Jenkins 2.73.3 LTS
org.jenkins-ci.main:jenkins-war:2.73.3
org.jenkins-ci:crypto-util:1.1
commons-httpclient:commons-httpclient:3.1-jenkins-1
org.jenkins-ci.main:jenkins-core:2.73.3
aopalliance:aopalliance:1.0
com.google.inject:guice:4.0
org.springframework:spring-dao:1.2.9
org.jenkins-ci.modules:instance-identity:2.1
javax.servlet:jstl:1.1.0
org.jenkins-ci:constant-pool-scanner:1.2
org.connectbot.jbcrypt:jbcrypt:1.0.0
org.jenkins-ci.modules:ssh-cli-auth:1.4
org.jenkins-ci.modules:windows-slave-installer:1.9.1
org.jenkins-ci.modules:sshd:2.0
org.ow2.asm:asm-commons:5.0.3
org.jenkins-ci:symbol-annotation:1.1
com.github.jnr:jnr-constants:0.9.8
commons-digester:commons-digester:2.1
commons-io:commons-io:2.4
org.kohsuke:trilead-putty-extension:1.2
org.kohsuke:libzfs:0.8
org.kohsuke.stapler:stapler:1.250
org.kohsuke.stapler:stapler-groovy:1.250
org.jenkins-ci.ui:jquery-detached:1.2
org.kohsuke.stapler:json-lib:2.4-jenkins-2
org.jvnet.robust-http-client:robust-http-client:1.2
org.ow2.asm:asm:5.0.3
com.google.code.findbugs:jsr305:1.3.9
net.java.sezpoz:sezpoz:1.12
org.kohsuke.stapler:stapler-adjunct-timeline:1.5
org.jenkins-ci:winstone:4.1
org.jenkins-ci:version-number:1.4
org.codehaus.groovy:groovy-all:2.4.11
org.jvnet.hudson:commons-jelly-tags-define:1.0.1-hudson-20071021
org.jenkins-ci:jmdns:3.4.0-jenkins-3
commons-lang:commons-lang:2.6
org.springframework:spring-jdbc:1.2.9
org.codehaus.woodstox:wstx-asl:3.2.9
org.springframework:spring-core:2.5.6.SEC03
org.springframework:spring-aop:2.5.6.SEC03
org.samba.jcifs:jcifs:1.3.17-kohsuke-1
org.jenkins-ci:bytecode-compatibility-transformer:1.8
com.sun.solaris:embedded_su4j:1.1
javax.inject:javax.inject:1
org.jenkins-ci.modules:upstart-slave-installer:1.1
org.apache.commons:commons-compress:1.10
commons-beanutils:commons-beanutils:1.8.3
org.jvnet.localizer:localizer:1.24
org.fusesource.jansi:jansi:1.11
org.springframework:spring-beans:2.5.6.SEC03
javax.xml.stream:stax-api:1.0-2
org.jvnet.hudson:activation:1.1.1-hudson-1
org.jenkins-ci.main:cli:2.73.3
commons-jelly:commons-jelly-tags-fmt:1.0
net.i2p.crypto:eddsa:0.2.0
jfree:jfreechart:1.0.9
org.jenkins-ci:task-reactor:1.4
org.apache.ant:ant-launcher:1.8.4
org.apache.sshd:sshd-core:1.6.0
oro:oro:2.0.8
org.jenkins-ci:commons-jexl:1.1-jenkins-20111212
org.kohsuke:access-modifier-annotation:1.11
org.slf4j:slf4j-api:1.7.7
org.jenkins-ci.plugins.icon-shim:icon-set:1.0.5
stax:stax-api:1.0.1
org.kohsuke:windows-package-checker:1.2
org.acegisecurity:acegi-security:1.0.7
commons-fileupload:commons-fileupload:1.3.1-jenkins-2
org.jenkins-ci.modules:launchd-slave-installer:1.2
org.jenkins-ci:annotation-indexer:1.12
org.kohsuke:libpam4j:1.8
jline:jline:2.12
com.github.jnr:jffi:1.2.15
org.kohsuke.stapler:stapler-adjunct-zeroclipboard:1.3.5-1
org.kohsuke.stapler:stapler-jelly:1.250
org.kohsuke.stapler:stapler-adjunct-codemirror:1.3
org.ow2.asm:asm-util:5.0.3
org.kohsuke:akuma:1.10
javax.mail:mail:1.4.4
org.hamcrest:hamcrest-core:1.3
jfree:jcommon:1.0.12
org.springframework:spring-context-support:2.5.6.SEC03
org.slf4j:jcl-over-slf4j:1.7.7
com.google.guava:guava:11.0.1
org.jvnet.hudson:jtidy:4aug2000r7-dev-hudson-1
org.jenkins-ci:commons-jelly:1.1-jenkins-20120928
org.jenkins-ci.ui:handlebars:1.1.1
org.springframework:spring-context:2.5.6.SEC03
org.jenkins-ci.ui:jquery-detached:1.2.1
org.ow2.asm:asm-analysis:5.0.3
io.github.stephenc.crypto:self-signed-cert-generator:1.0.0
com.github.jnr:jffi:1.2.15
org.jvnet.winp:winp:1.25
commons-discovery:commons-discovery:0.4
org.jenkins-ci.dom4j:dom4j:1.6.1-jenkins-4
org.jenkins-ci:memory-monitor:1.9
org.jenkins-ci.modules:systemd-slave-installer:1.1
org.jvnet.hudson:xstream:1.4.7-jenkins-1
org.jvnet:tiger-types:2.2
com.sun.xml.txw2:txw2:20110809
org.springframework:spring-web:2.5.6.SEC03
org.slf4j:log4j-over-slf4j:1.7.7
org.kohsuke.jinterop:j-interop:2.0.6-kohsuke-1
org.jruby.ext.posix:jna-posix:1.0.3-jenkins-1
com.github.jnr:jnr-ffi:2.1.4
com.github.jnr:jnr-posix:3.0.41
javax.annotation:javax.annotation-api:1.2
org.jenkins-ci.main:remoting:3.10.2
org.kohsuke.jinterop:j-interopdeps:2.0.6-kohsuke-1
com.infradna.tool:bridge-method-annotation:1.13
org.ow2.asm:asm-tree:5.0.3
args4j:args4j:2.0.31
org.kohsuke:asm5:5.0.1
antlr:antlr:2.7.6
relaxngDatatype:relaxngDatatype:20020414
com.jcraft:jzlib:1.1.3-kohsuke-1
org.kohsuke.stapler:stapler-jrebel:1.250
org.jenkins-ci.ui:bootstrap:1.3.2
commons-collections:commons-collections:3.2.2
org.jenkins-ci.modules:slave-installer:1.5
net.java.dev.jna:jna:4.2.1
junit:junit:4.12
org.slf4j:slf4j-jdk14:1.7.7
org.jenkins-ci:trilead-ssh2:build-217-jenkins-11
net.sf.ezmorph:ezmorph:1.0.6
org.apache.ant:ant:1.8.4
commons-codec:commons-codec:1.8
org.springframework:spring-webmvc:2.5.6.SEC03
com.github.jnr:jnr-x86asm:1.0.2
xpp3:xpp3:1.1.4c
jaxen:jaxen:1.1-beta-11
commons-jelly:commons-jelly-tags-xml:1.1Jenkins 2.73.3 LTS org.jenkins-ci.main:jenkins-war:2.73.3 org.jenkins-ci:crypto-util:1.1 commons-httpclient:commons-httpclient:3.1-jenkins-1 org.jenkins-ci.main:jenkins-core:2.73.3 aopalliance:aopalliance:1.0 com.google.inject:guice:4.0 org.springframework:spring-dao:1.2.9 org.jenkins-ci.modules:instance-identity:2.1 javax.servlet:jstl:1.1.0 org.jenkins-ci:constant-pool-scanner:1.2 org.connectbot.jbcrypt:jbcrypt:1.0.0 org.jenkins-ci.modules:ssh-cli-auth:1.4 org.jenkins-ci.modules:windows-slave-installer:1.9.1 org.jenkins-ci.modules:sshd:2.0 org.ow2.asm:asm-commons:5.0.3 org.jenkins-ci:symbol-annotation:1.1 com.github.jnr:jnr-constants:0.9.8 commons-digester:commons-digester:2.1 commons-io:commons-io:2.4 org.kohsuke:trilead-putty-extension:1.2 org.kohsuke:libzfs:0.8 org.kohsuke.stapler:stapler:1.250 org.kohsuke.stapler:stapler-groovy:1.250 org.jenkins-ci.ui:jquery-detached:1.2 org.kohsuke.stapler:json-lib:2.4-jenkins-2 org.jvnet.robust-http-client:robust-http-client:1.2 org.ow2.asm:asm:5.0.3 com.google.code.findbugs:jsr305:1.3.9 net.java.sezpoz:sezpoz:1.12 org.kohsuke.stapler:stapler-adjunct-timeline:1.5 org.jenkins-ci:winstone:4.1 org.jenkins-ci:version-number:1.4 org.codehaus.groovy:groovy-all:2.4.11 org.jvnet.hudson:commons-jelly-tags-define:1.0.1-hudson-20071021 org.jenkins-ci:jmdns:3.4.0-jenkins-3 commons-lang:commons-lang:2.6 org.springframework:spring-jdbc:1.2.9 org.codehaus.woodstox:wstx-asl:3.2.9 org.springframework:spring-core:2.5.6.SEC03 org.springframework:spring-aop:2.5.6.SEC03 org.samba.jcifs:jcifs:1.3.17-kohsuke-1 org.jenkins-ci:bytecode-compatibility-transformer:1.8 com.sun.solaris:embedded_su4j:1.1 javax.inject:javax.inject:1 org.jenkins-ci.modules:upstart-slave-installer:1.1 org.apache.commons:commons-compress:1.10 commons-beanutils:commons-beanutils:1.8.3 org.jvnet.localizer:localizer:1.24 org.fusesource.jansi:jansi:1.11 org.springframework:spring-beans:2.5.6.SEC03 javax.xml.stream:stax-api:1.0-2 org.jvnet.hudson:activation:1.1.1-hudson-1 org.jenkins-ci.main:cli:2.73.3 commons-jelly:commons-jelly-tags-fmt:1.0 net.i2p.crypto:eddsa:0.2.0 jfree:jfreechart:1.0.9 org.jenkins-ci:task-reactor:1.4 org.apache.ant:ant-launcher:1.8.4 org.apache.sshd:sshd-core:1.6.0 oro:oro:2.0.8 org.jenkins-ci:commons-jexl:1.1-jenkins-20111212 org.kohsuke:access-modifier-annotation:1.11 org.slf4j:slf4j-api:1.7.7 org.jenkins-ci.plugins.icon-shim:icon-set:1.0.5 stax:stax-api:1.0.1 org.kohsuke:windows-package-checker:1.2 org.acegisecurity:acegi-security:1.0.7 commons-fileupload:commons-fileupload:1.3.1-jenkins-2 org.jenkins-ci.modules:launchd-slave-installer:1.2 org.jenkins-ci:annotation-indexer:1.12 org.kohsuke:libpam4j:1.8 jline:jline:2.12 com.github.jnr:jffi:1.2.15 org.kohsuke.stapler:stapler-adjunct-zeroclipboard:1.3.5-1 org.kohsuke.stapler:stapler-jelly:1.250 org.kohsuke.stapler:stapler-adjunct-codemirror:1.3 org.ow2.asm:asm-util:5.0.3 org.kohsuke:akuma:1.10 javax.mail:mail:1.4.4 org.hamcrest:hamcrest-core:1.3 jfree:jcommon:1.0.12 org.springframework:spring-context-support:2.5.6.SEC03 org.slf4j:jcl-over-slf4j:1.7.7 com.google.guava:guava:11.0.1 org.jvnet.hudson:jtidy:4aug2000r7-dev-hudson-1 org.jenkins-ci:commons-jelly:1.1-jenkins-20120928 org.jenkins-ci.ui:handlebars:1.1.1 org.springframework:spring-context:2.5.6.SEC03 org.jenkins-ci.ui:jquery-detached:1.2.1 org.ow2.asm:asm-analysis:5.0.3 io.github.stephenc.crypto:self-signed-cert-generator:1.0.0 com.github.jnr:jffi:1.2.15 org.jvnet.winp:winp:1.25 commons-discovery:commons-discovery:0.4 org.jenkins-ci.dom4j:dom4j:1.6.1-jenkins-4 org.jenkins-ci:memory-monitor:1.9 org.jenkins-ci.modules:systemd-slave-installer:1.1 org.jvnet.hudson:xstream:1.4.7-jenkins-1 org.jvnet:tiger-types:2.2 com.sun.xml.txw2:txw2:20110809 org.springframework:spring-web:2.5.6.SEC03 org.slf4j:log4j-over-slf4j:1.7.7 org.kohsuke.jinterop:j-interop:2.0.6-kohsuke-1 org.jruby.ext.posix:jna-posix:1.0.3-jenkins-1 com.github.jnr:jnr-ffi:2.1.4 com.github.jnr:jnr-posix:3.0.41 javax.annotation:javax.annotation-api:1.2 org.jenkins-ci.main:remoting:3.10.2 org.kohsuke.jinterop:j-interopdeps:2.0.6-kohsuke-1 com.infradna.tool:bridge-method-annotation:1.13 org.ow2.asm:asm-tree:5.0.3 args4j:args4j:2.0.31 org.kohsuke:asm5:5.0.1 antlr:antlr:2.7.6 relaxngDatatype:relaxngDatatype:20020414 com.jcraft:jzlib:1.1.3-kohsuke-1 org.kohsuke.stapler:stapler-jrebel:1.250 org.jenkins-ci.ui:bootstrap:1.3.2 commons-collections:commons-collections:3.2.2 org.jenkins-ci.modules:slave-installer:1.5 net.java.dev.jna:jna:4.2.1 junit:junit:4.12 org.slf4j:slf4j-jdk14:1.7.7 org.jenkins-ci:trilead-ssh2:build-217-jenkins-11 net.sf.ezmorph:ezmorph:1.0.6 org.apache.ant:ant:1.8.4 commons-codec:commons-codec:1.8 org.springframework:spring-webmvc:2.5.6.SEC03 com.github.jnr:jnr-x86asm:1.0.2 xpp3:xpp3:1.1.4c jaxen:jaxen:1.1-beta-11 commons-jelly:commons-jelly-tags-xml:1.1
-
-
durable-task 1.26
A few of my Jenkins pipelines failed last night with this failure mode:
01:19:19 Running on blackbox-slave2 in /var/tmp/jenkins_slaves/jenkins-regression/path/to/workspace. [Note: this is an SSH slave] [Pipeline] { [Pipeline] ws 01:19:19 Running in /net/nas.delphix.com/nas/regression-run-workspace/jenkins-regression/workspace@10. [Note: This is an NFS share on a NAS]nd they shouldn't take down Jenkins jobs when they do. Our Jenkins jobs used to just hang when there was a NFS outage, now the script liveness check kills the job. I view this as a regression. As flawed [Pipeline] { [Pipeline] sh 01:20:10 [qa-gate] Running shell script [... script output ...] 01:27:19 Running test_create_domain at 2017-11-29 01:27:18.887531... [Pipeline] // dir [Pipeline] } [Pipeline] // ws [Pipeline] } [Pipeline] // node [Pipeline] } [Pipeline] // timestamps [Pipeline] } [Pipeline] // timeout ERROR: script returned exit code -1 Finished: FAILURE
As far as I can tell the script was running fine, but apparently Jenkins killed it prematurely because Jenkins didn't think the process was still alive.
The interesting thing is that this is normally working, but failed last night at exactly the same time in multiple pipeline jobs. And I only started seeing this after upgrading durable-task-plugin from 1.14 to 1.17. I looked at the code change and saw that the main change has been the change in ProcessLiveness from using a ps-based system to a timestamp-based system. What I suspect is that the NFS server on which this workspace is hosted wasn't processing I/O operations fast enough at the time this problem occurred, so the timestamp wasn't updated even though the script continued running. Note that I am not using Docker here, this is just a regular SSH slave.
The ps-based approach may have been suboptimal, but it was more reliable for us than the new timestamp-based approach, at least when using NFS-based workspaces. Expecting a timestamp to increase on a file every 15 seconds may be a tall order for some system and network administrators, especially over NFS – network issues can and do happen, and they shouldn't take down Jenkins jobs when they do. Our Jenkins jobs used to just hang when there was a NFS outage, now the script liveness check kills the job. I view this as a regression. As flawed as the old approach may have been, it was immune to this failure mode. Is there anything I can do here besides increasing various timeouts to avoid hitting this? The fact that no diagnostic information was printed to the Jenkins log or the SSH slave remotin log is also problematic here.
- relates to
-
JENKINS-50892 Pipeline jobs stuck after restart
-
- Closed
-
-
JENKINS-47791 Eliminate ProcessLiveness
-
- Resolved
-
-
JENKINS-50379 Jenkins kills long running sh script with no output
-
- Open
-
- links to
[JENKINS-48300] Pipeline shell step aborts prematurely with ERROR: script returned exit code -1
Link |
New:
This issue relates to |
Remote Link | New: This issue links to "durable-task PR 57 (Web Link)" [ 19953 ] |
Remote Link | New: This issue links to "workflow-durable-task-step PR 62 (Web Link)" [ 19954 ] |
Resolution | New: Fixed [ 1 ] | |
Status | Original: Open [ 1 ] | New: Closed [ 6 ] |
Comment |
[ [~svanoort]: We're periodically running into this even though we don't use NFS on either the master or the slaves and even though we're using the fastest durability setting, so I did some research. It looks like between two heartbeat checks, there are a lot of network I/O operations between master and slave which can easily cause a timeout, even without NFS. Therefore, the current error message was extremely misleading in our case. At the very least, the error message should be changed to make people aware that the heartbeat timestamps are compared on the Jenkins master and that there are a lot of other network operations happening in between those two heartbeat checks. Without a code review of both plugins involved (Durable Task, Durable Task Step), I would have never figured that out. But I'm also questioning whether the defaults are sensible at all. Why should Jenkins assume that the shell process is dead just because a bunch of network operations between master and slave took more than 15 seconds to complete? That's an awfully short time span. Please reconsider the default value for this. I think something in the order of minutes might be more reasonable; short-term network congestion can happen from time to time and shouldn't cause builds to fail. ] |
Description |
Original:
A few of my Jenkins pipelines failed last night with this failure mode: {noformat} 01:19:19 Running on blackbox-slave2 in /var/tmp/jenkins_slaves/jenkins-regression/path/to/workspace. [Note: this is an SSH slave] [Pipeline] { [Pipeline] ws 01:19:19 Running in /net/nas.delphix.com/nas/regression-run-workspace/jenkins-regression/workspace@10. [Note: This is an NFS share on a NAS] [Pipeline] { [Pipeline] sh 01:20:10 [qa-gate] Running shell script [... script output ...] 01:27:19 Running test_create_domain at 2017-11-29 01:27:18.887531... [Pipeline] // dir [Pipeline] } [Pipeline] // ws [Pipeline] } [Pipeline] // node [Pipeline] } [Pipeline] // timestamps [Pipeline] } [Pipeline] // timeout ERROR: script returned exit code -1 Finished: FAILURE {noformat} As far as I can tell the script was running fine, but apparently Jenkins killed it prematurely because Jenkins didn't think the process was still alive. The interesting thing is that this is normally working, but failed last night at exactly the same time in multiple pipeline jobs. And I only started seeing this after upgrading {{durable-task-plugin}} from 1.14 to 1.17. I looked at the code change and saw that the main change has been the change in {{ProcessLiveness}} from using a {{ps}}-based system to a timestamp-based system. What I suspect is that the NFS server on which this workspace is hosted wasn't processing I/O operations fast enough at the time this problem occurred, so the timestamp wasn't updated even though the script continued running. Note that I am not using Docker here, this is just a regular SSH slave. The ps-based approach may have been suboptimal, but it was more reliable for us than the new timestamp-based approach, at least when using NFS-based workspaces. Expecting a timestamp to increase on a file every 15 seconds may be a tall order for some system and network administrators, especially over NFS -- network issues can and do happen, and they shouldn't take down Jenkins jobs when they do. Our Jenkins jobs used to just hang when there was a NFS outage, now the script liveness check kills the job. I view this as a regression. As flawed as the old approach may have been, it was immune to this failure mode. Is there anything I can do here besides increasing various timeouts to avoid hitting this? The fact that no diagnostic information was printed to the Jenkins log or the SSH slave remotin log is also problematic here. |
New:
A few of my Jenkins pipelines failed last night with this failure mode: {noformat} 01:19:19 Running on blackbox-slave2 in /var/tmp/jenkins_slaves/jenkins-regression/path/to/workspace. [Note: this is an SSH slave] [Pipeline] { [Pipeline] ws 01:19:19 Running in /net/nas.delphix.com/nas/regression-run-workspace/jenkins-regression/workspace@10. [Note: This is an NFS share on a NAS]nd they shouldn't take down Jenkins jobs when they do. Our Jenkins jobs used to just hang when there was a NFS outage, now the script liveness check kills the job. I view this as a regression. As flawed [Pipeline] { [Pipeline] sh 01:20:10 [qa-gate] Running shell script [... script output ...] 01:27:19 Running test_create_domain at 2017-11-29 01:27:18.887531... [Pipeline] // dir [Pipeline] } [Pipeline] // ws [Pipeline] } [Pipeline] // node [Pipeline] } [Pipeline] // timestamps [Pipeline] } [Pipeline] // timeout ERROR: script returned exit code -1 Finished: FAILURE {noformat} As far as I can tell the script was running fine, but apparently Jenkins killed it prematurely because Jenkins didn't think the process was still alive. The interesting thing is that this is normally working, but failed last night at exactly the same time in multiple pipeline jobs. And I only started seeing this after upgrading {{durable-task-plugin}} from 1.14 to 1.17. I looked at the code change and saw that the main change has been the change in {{ProcessLiveness}} from using a {{ps}}-based system to a timestamp-based system. What I suspect is that the NFS server on which this workspace is hosted wasn't processing I/O operations fast enough at the time this problem occurred, so the timestamp wasn't updated even though the script continued running. Note that I am not using Docker here, this is just a regular SSH slave. The ps-based approach may have been suboptimal, but it was more reliable for us than the new timestamp-based approach, at least when using NFS-based workspaces. Expecting a timestamp to increase on a file every 15 seconds may be a tall order for some system and network administrators, especially over NFS – network issues can and do happen, and they shouldn't take down Jenkins jobs when they do. Our Jenkins jobs used to just hang when there was a NFS outage, now the script liveness check kills the job. I view this as a regression. As flawed as the old approach may have been, it was immune to this failure mode. Is there anything I can do here besides increasing various timeouts to avoid hitting this? The fact that no diagnostic information was printed to the Jenkins log or the SSH slave remotin log is also problematic here. |
Assignee | New: Sam Van Oort [ svanoort ] |
Assignee | Original: Sam Van Oort [ svanoort ] | New: Jesse Glick [ jglick ] |
Resolution | Original: Fixed [ 1 ] | |
Status | Original: Closed [ 6 ] | New: Reopened [ 4 ] |
Remote Link | New: This issue links to "durable-task PR 81 (Web Link)" [ 21336 ] |