Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-63750

Thousands of JNLP4-connect threads blocked on object monitor

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
      None
    • Environment:
    • Similar Issues:

      Description

      Our JNLP agents often get stuck – the process is still running, Jenkins thinks they are still executing the build that was assigned to them, but the build is not progressing anywhere for days.

      When we started investigating the state of such stuck agents, the first thing we noticed was the following message in the agent's log:

      Sep 17, 2020 2:17:11 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connected
      Sep 17, 2020 2:17:30 PM hudson.remoting.RemoteInvocationHandler$Unexporter run
      SEVERE: Couldn't clean up oid=2 from null
      java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:717)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
        at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
        at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:42)
        at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:46)
        at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:41)
        at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:66)
        at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:122)
        at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1290)
        at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
        at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
        at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
        at java.lang.Throwable.writeObject(Throwable.java:1014)
        at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
        at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
        at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
        at hudson.remoting.Command.writeTo(Command.java:111)
        at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:287)
        at hudson.remoting.Channel.send(Channel.java:764)
        at hudson.remoting.RemoteInvocationHandler$PhantomReferenceImpl.cleanup(RemoteInvocationHandler.java:395)
        at hudson.remoting.RemoteInvocationHandler$PhantomReferenceImpl.access$1000(RemoteInvocationHandler.java:354)
        at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:612)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)
      

      Because the process is still running, we can obtain the thread dump, which contains 3653 lines for threads like this:

      "pool-1-thread-157 for JNLP4-connect connection to resources-ci-master-jenkins.grid.hosting.cerence.net/10.179.225.4:50003 id=1138934" #172 daemon prio=5 os_prio=0 tid=0x00007f2a1c002800 nid=0x3d3a waiting for monitor entry [0x00007f2a49c81000]
      "pool-1-thread-155 for JNLP4-connect connection to resources-ci-master-jenkins.grid.hosting.cerence.net/10.179.225.4:50003 id=1138866" #170 daemon prio=5 os_prio=0 tid=0x00007f29d8002000 nid=0x3d38 waiting for monitor entry [0x00007f2a4ad92000]
      "pool-1-thread-148 for JNLP4-connect connection to resources-ci-master-jenkins.grid.hosting.cerence.net/10.179.225.4:50003 id=1138261" #163 daemon prio=5 os_prio=0 tid=0x00007f2a8c043800 nid=0x131b8 waiting for monitor entry [0x00007f2a4a388000]
      

      Each of those threads (by quickly looking through the thread dump) seems to be stuck in the same call:

      "pool-1-thread-13682 for JNLP4-connect connection to resources-ci-master-jenkins.grid.hosting.cerence.net/10.179.225.4:50003 id=2766416" #13698 daemon prio=5 os_prio=0 tid=0x00007f2a8cba6800 nid=0x12b54 waiting for monitor entry [0x00007f1fdd8a0000]
         java.lang.Thread.State: BLOCKED (on object monitor)
        at sun.misc.Unsafe.defineClass(Native Method)
        at sun.reflect.ClassDefiner.defineClass(ClassDefiner.java:63)
        at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:399)
        at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:394)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:393)
        at sun.reflect.MethodAccessorGenerator.generateMethod(MethodAccessorGenerator.java:75)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:53)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2177)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2068)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2286)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2210)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2068)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430)
        at hudson.remoting.UserRequest.deserialize(UserRequest.java:290)
        at hudson.remoting.UserRequest.perform(UserRequest.java:189)
        at hudson.remoting.UserRequest.perform(UserRequest.java:54)
        at hudson.remoting.Request$2.run(Request.java:369)
        at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
        at hudson.remoting.Engine$1$$Lambda$3/231593000.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:748)
      

      This massive deadlock grinds the agent to a halt – I suppose that's because in this state, the user has reached its ulimit on number of (lightweight) processes (ulimit -u) of 4096.

      It could be that the trigger was some error in transmission over the network between the Jenkins master and the agent – I don't trust the network much. We also have occasional issues with reliability of the file systems. Anyhow, I am filing this ticket already, expecting that maybe it's partially caused by some issue in Jenkins or Remoting and will come back again.

        Attachments

          Activity

          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          Hard to say, this is a request handler which leads to a failure somewhere inside Java internals. AFAICT the only plausible reason is the metaspace overflow which causes new defineClass() calls to wait until garbage collection. "-XX:MetaspaceSize" could help

          We could set a deserialization timeout in Remoting.  

          Show
          oleg_nenashev Oleg Nenashev added a comment - Hard to say, this is a request handler which leads to a failure somewhere inside Java internals. AFAICT the only plausible reason is the metaspace overflow which causes new defineClass() calls to wait until garbage collection. "-XX:MetaspaceSize" could help We could set a deserialization timeout in Remoting.  
          Hide
          jthompson Jeff Thompson added a comment -

          Is this something you've seen over a period of time? Is it related to any specific versions?

          I'm not aware of others reporting this problem, so there's a good chance it's related to a local, installation specific issue. It could be networking related, as you noted. This might have something to do with some weird or wrong behavior from some plugin. Or possibly even something going on in the controller that disrupts the response.

          A timeout around this could be a good idea. Previous attempts at instituting timeouts have gotten bogged down in a variety of issues, so I don't expect a quick fix there.

          Show
          jthompson Jeff Thompson added a comment - Is this something you've seen over a period of time? Is it related to any specific versions? I'm not aware of others reporting this problem, so there's a good chance it's related to a local, installation specific issue. It could be networking related, as you noted. This might have something to do with some weird or wrong behavior from some plugin. Or possibly even something going on in the controller that disrupts the response. A timeout around this could be a good idea. Previous attempts at instituting timeouts have gotten bogged down in a variety of issues, so I don't expect a quick fix there.

            People

            Assignee:
            jthompson Jeff Thompson
            Reporter:
            mkorvas Matěj Korvas
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated: