Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-33412

Jenkins locks when started in HTTPS mode on a host with 37+ processors

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Blocker Blocker
    • winstone-jetty
    • Jenkins 1.652
      org.jenkins-ci:winstone 2.9
      Testing in the Linux JDK 1.7 and 1.8 as well as the Solaris JDK 1.7 1.8 (both OpenJDK and OracleJDK). Reproduces in Ubuntu, Debian, CentOS and SmartOS.

      Summary
      Using Winstone 2.9 (i.e. the embedded Jetty wrapper) or below it will not run in HTTPS mode on hosts with 37 cores/processors or more. This problem replicates regardless of the JDK or the operating system.

      Reproduction
      The easiest way to reproduce the error is to use qemu to virtualize a 37 core system. You can do that with the -smp <cores> parameter. For example, for testing I run:

      qemu-system-x86_64 -hda ubuntu.img -m 4096 -smp 48
      

      Once you have a VM with more than 37 cores setup, install Jenkins 1.652 and configure it to use HTTPS. Attempt to start it and connect to either the HTTP or the HTTPS port. The connection will time out for either port with the server effectively locked until you send it a SIGTERM. Please refer to the attached log file to see its start process.

      Why is this important?
      You may ask - who is running Jenkins on that big of a server? Well, with containerization technologies (e.g. Docker) taking a center stage, we are seeing more and more deployments where there is no VM involved and hence a container gets a slice of CPU but has visibility to all of the processors on a system. The official Docker image of Jenkins suffers from this defect. Sure a user can set up their own reverse proxy to run TLS through, but it adds unneeded complexity for users looking to containerize their Jenkins environment.

      Solution
      I've done the work of reducing the surface area of root cause analysis. I removed the entire jenkins war from the jenkins winstone runner (https://github.com/jenkinsci/winstone) and ran a simple hello world war instead. With HTTPS enabled, the issue still reproduced. It took a lot of fiddling to determine that it was exactly at 37 cores in which the hang occurred.

      Lastly, I tried the exact same reproduction steps with winstone-3.1. Luckily, with the upgrade to embedded Jetty in the 3.1 version the issue is resolved.

      Can we upgrade the next Jenkins release to use the winstone-3.1 component?

      This would be the easiest and the best fix. I would be happy to contribute to any efforts that would allow for us to get this into a release.

          [JENKINS-33412] Jenkins locks when started in HTTPS mode on a host with 37+ processors

          Elijah Zupancic created issue -
          Elijah Zupancic made changes -
          Description Original: *Summary*
          Using Winstone Jetty 3.1 or below it will not run in HTTPS mode on hosts with 37 cores/processors or more. This problem replicates regardless of the JDK or the operating system.

          *Reproduction*
          The easiest way to reproduce the error is to use qemu to virtualize a 37 core system. You can do that with the -smp <cores> parameter. For example, for testing I run:

          {code:none}
          qemu-system-x86_64 -hda ubuntu.img -m 4096 -smp 48
          {code}

          Once you have a VM with more than 37 cores setup, install Jenkins 1.652 and configure it to use HTTPS. Attempt to start it and connect to either the HTTP or the HTTPS port. The connection will time out for either port with the server effectively locked until you send it a SIGTERM. Please refer to the attached log file to see its start process.

          *Why is this important?*
          You may ask - who is running Jenkins on that big of a server? Well, with containerization technologies (e.g. Docker) taking a center stage, we are seeing more and more deployments where there is no VM involved and hence a container gets a slice of CPU but has visibility to all of the processors on a system. The official Docker image of Jenkins suffers from this defect. Sure a user can set up their own reverse proxy to run TLS through, but it adds unneeded complexity for users looking to containerize their Jenkins environment.

          *Solution*
          I've done the work of reducing the surface area of root cause analysis. I removed the entire jenkins war from the jenkins winstone runner (https://github.com/jenkinsci/winstone) and ran a simple hello world war instead. With HTTPS enabled, the issue still reproduced. It took a lot of fiddling to determine that it was exactly at 37 cores in which the hang occurred.

          Lastly, I tried the exact same reproduction steps with winstone-3.2-SNAPSHOT. Luckily, with the upgrade to embedded Jetty in the 3.2 version the issue is resolved.

          _*Can we upgrade the next Jenkins release to use the winstone-3.2 component?*_

          This would be the easiest and the best fix. I would be happy to contribute to any efforts that would allow for us to get this into a release.
          New: *Summary*
          Using Winstone Jetty 3.1 or below it will not run in HTTPS mode on hosts with 37 cores/processors or more. This problem replicates regardless of the JDK or the operating system.

          *Reproduction*
          The easiest way to reproduce the error is to use qemu to virtualize a 37 core system. You can do that with the -smp <cores> parameter. For example, for testing I run:

          {code:none}
          qemu-system-x86_64 -hda ubuntu.img -m 4096 -smp 48
          {code}

          Once you have a VM with more than 37 cores setup, install Jenkins 1.652 and configure it to use HTTPS. Attempt to start it and connect to either the HTTP or the HTTPS port. The connection will time out for either port with the server effectively locked until you send it a SIGTERM. Please refer to the attached log file to see its start process.

          *Why is this important?*
          You may ask - who is running Jenkins on that big of a server? Well, with containerization technologies (e.g. Docker) taking a center stage, we are seeing more and more deployments where there is no VM involved and hence a container gets a slice of CPU but has visibility to all of the processors on a system. The official Docker image of Jenkins suffers from this defect. Sure a user can set up their own reverse proxy to run TLS through, but it adds unneeded complexity for users looking to containerize their Jenkins environment.

          *Solution*
          I've done the work of reducing the surface area of root cause analysis. I removed the entire jenkins war from the jenkins winstone runner (https://github.com/jenkinsci/winstone) and ran a simple hello world war instead. With HTTPS enabled, the issue still reproduced. It took a lot of fiddling to determine that it was exactly at 37 cores in which the hang occurred.

          Lastly, I tried the exact same reproduction steps with winstone-3.2-SNAPSHOT. Luckily, with the upgrade to embedded Jetty in the 3.2 version the issue is resolved.

          _Can we upgrade the next Jenkins release to use the winstone-3.2 component?_

          This would be the easiest and the best fix. I would be happy to contribute to any efforts that would allow for us to get this into a release.
          Elijah Zupancic made changes -
          Description Original: *Summary*
          Using Winstone Jetty 3.1 or below it will not run in HTTPS mode on hosts with 37 cores/processors or more. This problem replicates regardless of the JDK or the operating system.

          *Reproduction*
          The easiest way to reproduce the error is to use qemu to virtualize a 37 core system. You can do that with the -smp <cores> parameter. For example, for testing I run:

          {code:none}
          qemu-system-x86_64 -hda ubuntu.img -m 4096 -smp 48
          {code}

          Once you have a VM with more than 37 cores setup, install Jenkins 1.652 and configure it to use HTTPS. Attempt to start it and connect to either the HTTP or the HTTPS port. The connection will time out for either port with the server effectively locked until you send it a SIGTERM. Please refer to the attached log file to see its start process.

          *Why is this important?*
          You may ask - who is running Jenkins on that big of a server? Well, with containerization technologies (e.g. Docker) taking a center stage, we are seeing more and more deployments where there is no VM involved and hence a container gets a slice of CPU but has visibility to all of the processors on a system. The official Docker image of Jenkins suffers from this defect. Sure a user can set up their own reverse proxy to run TLS through, but it adds unneeded complexity for users looking to containerize their Jenkins environment.

          *Solution*
          I've done the work of reducing the surface area of root cause analysis. I removed the entire jenkins war from the jenkins winstone runner (https://github.com/jenkinsci/winstone) and ran a simple hello world war instead. With HTTPS enabled, the issue still reproduced. It took a lot of fiddling to determine that it was exactly at 37 cores in which the hang occurred.

          Lastly, I tried the exact same reproduction steps with winstone-3.2-SNAPSHOT. Luckily, with the upgrade to embedded Jetty in the 3.2 version the issue is resolved.

          _Can we upgrade the next Jenkins release to use the winstone-3.2 component?_

          This would be the easiest and the best fix. I would be happy to contribute to any efforts that would allow for us to get this into a release.
          New: *Summary*
          Using Winstone 3.1 (i.e. the embedded Jetty wrapper) or below it will not run in HTTPS mode on hosts with 37 cores/processors or more. This problem replicates regardless of the JDK or the operating system.

          *Reproduction*
          The easiest way to reproduce the error is to use qemu to virtualize a 37 core system. You can do that with the -smp <cores> parameter. For example, for testing I run:

          {code:none}
          qemu-system-x86_64 -hda ubuntu.img -m 4096 -smp 48
          {code}

          Once you have a VM with more than 37 cores setup, install Jenkins 1.652 and configure it to use HTTPS. Attempt to start it and connect to either the HTTP or the HTTPS port. The connection will time out for either port with the server effectively locked until you send it a SIGTERM. Please refer to the attached log file to see its start process.

          *Why is this important?*
          You may ask - who is running Jenkins on that big of a server? Well, with containerization technologies (e.g. Docker) taking a center stage, we are seeing more and more deployments where there is no VM involved and hence a container gets a slice of CPU but has visibility to all of the processors on a system. The official Docker image of Jenkins suffers from this defect. Sure a user can set up their own reverse proxy to run TLS through, but it adds unneeded complexity for users looking to containerize their Jenkins environment.

          *Solution*
          I've done the work of reducing the surface area of root cause analysis. I removed the entire jenkins war from the jenkins winstone runner (https://github.com/jenkinsci/winstone) and ran a simple hello world war instead. With HTTPS enabled, the issue still reproduced. It took a lot of fiddling to determine that it was exactly at 37 cores in which the hang occurred.

          Lastly, I tried the exact same reproduction steps with winstone-3.2-SNAPSHOT. Luckily, with the upgrade to embedded Jetty in the 3.2 version the issue is resolved.

          _Can we upgrade the next Jenkins release to use the winstone-3.2 component?_

          This would be the easiest and the best fix. I would be happy to contribute to any efforts that would allow for us to get this into a release.

          Daniel Beck added a comment -

          Lastly, I tried the exact same reproduction steps with winstone-3.2-SNAPSHOT. Luckily, with the upgrade to embedded Jetty in the 3.2 version the issue is resolved.

          It's not clear to me what you did. There are no open PRs, and master… well… https://github.com/jenkinsci/winstone/compare/winstone-3.1...master

          Daniel Beck added a comment - Lastly, I tried the exact same reproduction steps with winstone-3.2-SNAPSHOT. Luckily, with the upgrade to embedded Jetty in the 3.2 version the issue is resolved. It's not clear to me what you did. There are no open PRs, and master… well… https://github.com/jenkinsci/winstone/compare/winstone-3.1...master

          With that many CPUs, glibc can do crazy memory allocations, like reported here: https://issues.apache.org/jira/browse/HADOOP-7154

          I wonder if switching to latest jetty is working by luck, as memory arena creation depends on thread contention. Do you have a /proc/PID/status content of the hanging JVM?

          Yoann Dubreuil added a comment - With that many CPUs, glibc can do crazy memory allocations, like reported here: https://issues.apache.org/jira/browse/HADOOP-7154 I wonder if switching to latest jetty is working by luck, as memory arena creation depends on thread contention. Do you have a /proc/PID/status content of the hanging JVM?
          Elijah Zupancic made changes -
          Description Original: *Summary*
          Using Winstone 3.1 (i.e. the embedded Jetty wrapper) or below it will not run in HTTPS mode on hosts with 37 cores/processors or more. This problem replicates regardless of the JDK or the operating system.

          *Reproduction*
          The easiest way to reproduce the error is to use qemu to virtualize a 37 core system. You can do that with the -smp <cores> parameter. For example, for testing I run:

          {code:none}
          qemu-system-x86_64 -hda ubuntu.img -m 4096 -smp 48
          {code}

          Once you have a VM with more than 37 cores setup, install Jenkins 1.652 and configure it to use HTTPS. Attempt to start it and connect to either the HTTP or the HTTPS port. The connection will time out for either port with the server effectively locked until you send it a SIGTERM. Please refer to the attached log file to see its start process.

          *Why is this important?*
          You may ask - who is running Jenkins on that big of a server? Well, with containerization technologies (e.g. Docker) taking a center stage, we are seeing more and more deployments where there is no VM involved and hence a container gets a slice of CPU but has visibility to all of the processors on a system. The official Docker image of Jenkins suffers from this defect. Sure a user can set up their own reverse proxy to run TLS through, but it adds unneeded complexity for users looking to containerize their Jenkins environment.

          *Solution*
          I've done the work of reducing the surface area of root cause analysis. I removed the entire jenkins war from the jenkins winstone runner (https://github.com/jenkinsci/winstone) and ran a simple hello world war instead. With HTTPS enabled, the issue still reproduced. It took a lot of fiddling to determine that it was exactly at 37 cores in which the hang occurred.

          Lastly, I tried the exact same reproduction steps with winstone-3.2-SNAPSHOT. Luckily, with the upgrade to embedded Jetty in the 3.2 version the issue is resolved.

          _Can we upgrade the next Jenkins release to use the winstone-3.2 component?_

          This would be the easiest and the best fix. I would be happy to contribute to any efforts that would allow for us to get this into a release.
          New: *Summary*
          Using Winstone 2.9 (i.e. the embedded Jetty wrapper) or below it will not run in HTTPS mode on hosts with 37 cores/processors or more. This problem replicates regardless of the JDK or the operating system.

          *Reproduction*
          The easiest way to reproduce the error is to use qemu to virtualize a 37 core system. You can do that with the -smp <cores> parameter. For example, for testing I run:

          {code:none}
          qemu-system-x86_64 -hda ubuntu.img -m 4096 -smp 48
          {code}

          Once you have a VM with more than 37 cores setup, install Jenkins 1.652 and configure it to use HTTPS. Attempt to start it and connect to either the HTTP or the HTTPS port. The connection will time out for either port with the server effectively locked until you send it a SIGTERM. Please refer to the attached log file to see its start process.

          *Why is this important?*
          You may ask - who is running Jenkins on that big of a server? Well, with containerization technologies (e.g. Docker) taking a center stage, we are seeing more and more deployments where there is no VM involved and hence a container gets a slice of CPU but has visibility to all of the processors on a system. The official Docker image of Jenkins suffers from this defect. Sure a user can set up their own reverse proxy to run TLS through, but it adds unneeded complexity for users looking to containerize their Jenkins environment.

          *Solution*
          I've done the work of reducing the surface area of root cause analysis. I removed the entire jenkins war from the jenkins winstone runner (https://github.com/jenkinsci/winstone) and ran a simple hello world war instead. With HTTPS enabled, the issue still reproduced. It took a lot of fiddling to determine that it was exactly at 37 cores in which the hang occurred.

          Lastly, I tried the exact same reproduction steps with winstone-3.1. Luckily, with the upgrade to embedded Jetty in the 3.1 version the issue is resolved.

          _Can we upgrade the next Jenkins release to use the winstone-3.1 component?_

          This would be the easiest and the best fix. I would be happy to contribute to any efforts that would allow for us to get this into a release.
          Elijah Zupancic made changes -
          Environment Original: Jenkins 1.652
          org.jenkins-ci:winstone 2.9, 3.0, 3.1
          Testing in the Linux JDK 1.7 and 1.8 as well as the Solaris JDK 1.7 1.8 (both OpenJDK and OracleJDK). Reproduces in Ubuntu, Debian, CentOS and SmartOS.
          New: Jenkins 1.652
          org.jenkins-ci:winstone 2.9
          Testing in the Linux JDK 1.7 and 1.8 as well as the Solaris JDK 1.7 1.8 (both OpenJDK and OracleJDK). Reproduces in Ubuntu, Debian, CentOS and SmartOS.

          Elijah Zupancic added a comment - - edited

          danielbeck You are completely correct. I did the bulk of my testing with winstone-2.9 and I lightly tested with winstone-3.1. I made a mistake with my network setup. I just validated that winstone-3.1 also works correctly and updated the bug to that effect.

          I would still recommend upgrading the next version of Jenkins to winstone-3.1 in order to fix this bug because this seems like one of those messy things that if you have a fix by upgrading a core library - it is just better doing it that way.

          That said, read below if we want to go down the root cause analysis route:

          I work at Joyent and if any developers want environment in which this can be reproduced, please email me your public key (elijah.zupancic@joyent.com) and I will create an instance.

          ydubreuil The behavior is present across different operating systems including one's that do not use glibc as part of their JVM implementation (e.g. SmartOS). When I was inspecting the application with a debugger, the best that I could tell was that even though a request had come in, EPOLL would not wake and a processing thread would not be dispatched. We see that winstone recieves the request (when FINE logging is enabled):

          FINE: created SCEP@3c108ef8{l(/97.113.3.1:54175)<->r(/165.225.168.215:8443),s=0,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=0}-{SslConnection@56a764e3 SSL NOT_HANDSHAKING i/o/u=-1/-1/-1 ishut=false oshut=false {AsyncHttpConnection@23fe1e75,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0}}
          

          However, the process just continues to sleep after this.

          Here's the output from /proc/PID/status when it is in a hung state:

          Name:   java
          State:  S (sleeping)
          Tgid:   1673
          Ngid:   0
          Pid:    1673
          PPid:   1672
          TracerPid:      0
          Uid:    1000    1000    1000    1000
          Gid:    1000    1000    1000    1000
          FDSize: 256
          Groups: 4 24 27 30 46 110 111 1000 
          NStgid: 1673
          NSpid:  1673
          NSpgid: 1672
          NSsid:  1551
          VmPeak:  8983212 kB
          VmSize:  8983208 kB
          VmLck:         0 kB
          VmPin:         0 kB
          VmHWM:    199776 kB
          VmRSS:    199276 kB
          VmData:  8922448 kB
          VmStk:       136 kB
          VmExe:         4 kB
          VmLib:     17216 kB
          VmPTE:      1080 kB
          VmPMD:        48 kB
          VmSwap:        0 kB
          Threads:        98
          SigQ:   0/15699
          SigPnd: 0000000000000000
          ShdPnd: 0000000000000000
          NSsid:  1551
          VmPeak:  8983212 kB
          VmSize:  8983208 kB
          VmLck:         0 kB
          VmPin:         0 kB
          VmHWM:    199776 kB
          VmRSS:    199276 kB
          VmData:  8922448 kB
          VmStk:       136 kB
          VmExe:         4 kB
          VmLib:     17216 kB
          VmPTE:      1080 kB
          VmPMD:        48 kB
          VmSwap:        0 kB
          Threads:        98
          SigQ:   0/15699
          SigPnd: 0000000000000000
          ShdPnd: 0000000000000000
          SigBlk: 0000000000000000
          SigIgn: 0000000000000000
          SigCgt: 2000000181005ccf
          CapInh: 0000000000000000
          CapPrm: 0000000000000000
          CapEff: 0000000000000000
          CapBnd: 0000003fffffffff
          Seccomp:        0
          Cpus_allowed:   ffff,ffffffff
          Cpus_allowed_list:      0-47
          Mems_allowed:   00000000,00000001
          Mems_allowed_list:      0
          voluntary_ctxt_switches:        3
          nonvoluntary_ctxt_switches:     1
          

          Elijah Zupancic added a comment - - edited danielbeck You are completely correct. I did the bulk of my testing with winstone-2.9 and I lightly tested with winstone-3.1. I made a mistake with my network setup. I just validated that winstone-3.1 also works correctly and updated the bug to that effect. I would still recommend upgrading the next version of Jenkins to winstone-3.1 in order to fix this bug because this seems like one of those messy things that if you have a fix by upgrading a core library - it is just better doing it that way. That said, read below if we want to go down the root cause analysis route: I work at Joyent and if any developers want environment in which this can be reproduced, please email me your public key (elijah.zupancic@joyent.com) and I will create an instance. ydubreuil The behavior is present across different operating systems including one's that do not use glibc as part of their JVM implementation (e.g. SmartOS). When I was inspecting the application with a debugger, the best that I could tell was that even though a request had come in, EPOLL would not wake and a processing thread would not be dispatched. We see that winstone recieves the request (when FINE logging is enabled): FINE: created SCEP@3c108ef8{l(/97.113.3.1:54175)<->r(/165.225.168.215:8443),s=0,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=0}-{SslConnection@56a764e3 SSL NOT_HANDSHAKING i/o/u=-1/-1/-1 ishut=false oshut=false {AsyncHttpConnection@23fe1e75,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0}} However, the process just continues to sleep after this. Here's the output from /proc/PID/status when it is in a hung state: Name: java State: S (sleeping) Tgid: 1673 Ngid: 0 Pid: 1673 PPid: 1672 TracerPid: 0 Uid: 1000 1000 1000 1000 Gid: 1000 1000 1000 1000 FDSize: 256 Groups: 4 24 27 30 46 110 111 1000 NStgid: 1673 NSpid: 1673 NSpgid: 1672 NSsid: 1551 VmPeak: 8983212 kB VmSize: 8983208 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 199776 kB VmRSS: 199276 kB VmData: 8922448 kB VmStk: 136 kB VmExe: 4 kB VmLib: 17216 kB VmPTE: 1080 kB VmPMD: 48 kB VmSwap: 0 kB Threads: 98 SigQ: 0/15699 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 NSsid: 1551 VmPeak: 8983212 kB VmSize: 8983208 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 199776 kB VmRSS: 199276 kB VmData: 8922448 kB VmStk: 136 kB VmExe: 4 kB VmLib: 17216 kB VmPTE: 1080 kB VmPMD: 48 kB VmSwap: 0 kB Threads: 98 SigQ: 0/15699 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 2000000181005ccf CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 0000003fffffffff Seccomp: 0 Cpus_allowed: ffff,ffffffff Cpus_allowed_list: 0-47 Mems_allowed: 00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 3 nonvoluntary_ctxt_switches: 1

          Thanks for the attachments, it's pretty clear that glibc is not to blame here, with only 200Mb of RSS for the process.

          DropWizard team hit the same issue, which was fixed in Jetty after their report.

          danielbeck I think we should consider upgrading Jetty to 9.2.2.v20140723 to get this fix

          Yoann Dubreuil added a comment - Thanks for the attachments, it's pretty clear that glibc is not to blame here, with only 200Mb of RSS for the process. DropWizard team hit the same issue, which was fixed in Jetty after their report. danielbeck I think we should consider upgrading Jetty to 9.2.2.v20140723 to get this fix

          Thanks for your prompt action.

          That fix looks like it will solve the problem but it makes me sad. It is doing math to determine the thread count by computing the number of cores available. In the container world, we are seeing things like CPU shares or fair-share scheduling algorithms used to dice up CPU while exposing all of the cores to the OS. This leads to a bunch of weird performance problems especially if CPU to thread affinity is set. This way of configuring applications is not going to be sustainable long-term based on where OS containerization is heading.

          I'll add a feature request after this issue is fixed to allow for the manual configuration of the number of threads for acceptors, selectors, etc. Once again - thanks for your work.

          Elijah Zupancic added a comment - Thanks for your prompt action. That fix looks like it will solve the problem but it makes me sad. It is doing math to determine the thread count by computing the number of cores available. In the container world, we are seeing things like CPU shares or fair-share scheduling algorithms used to dice up CPU while exposing all of the cores to the OS. This leads to a bunch of weird performance problems especially if CPU to thread affinity is set. This way of configuring applications is not going to be sustainable long-term based on where OS containerization is heading. I'll add a feature request after this issue is fixed to allow for the manual configuration of the number of threads for acceptors, selectors, etc. Once again - thanks for your work.

            olamy Olivier Lamy
            elijah Elijah Zupancic
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: