Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-64178

Agent disconnects during a build due to JVM crash

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • remoting, ws-cleanup-plugin
    • None

      Jenkins brings all 10 nodes online. A build is started on one of the nodes. The JVM crashes, disconnecting the node, and failing the build.

      Steps to reproduce:

      1. Jenkins brings each node online (see file nwb-sol11-test1_connection_log.txt)
      2. Directory /x1/jenkins/agent_directory/remoting/jarCache is populated with multiple directories containing jar files(see file jarCache_directory_before_build.txt)
      3. Start a build that runs on a node (nwb-sol11-test1 in this case) (see files job_config.xml and nwb-soll1-test1.config.xml)
      4. Very early on in the build execution, the JVM running the remoting.jar file crashes causing the build to fail (see file build_log.txt)

      Once the build starts on the node, directory /x1/jenkins/agent_directory/remoting/jarCache is populated with even more directories containing jar files - all jar files have their modification timestamps updated (see file jarCache_directory_after_build.txt). The assumption is that existing jar files in the cache are rewritten during the build.

      The JVM crash generates a core file. The core file indicates the JVM crashes due to signal 10 (SIGBUS) (see file jvm_core_file_where.txt).

      The following truss command was run against the PID of the JVM before starting the build: truss -a -d -D -E -f -o /x1/truss.out -p <jvm PID>
      The truss output shows the JVM incurs a fault when trying to read a file in /x1/jenkins/agent_directory/remoting/jarCache (see file truss.txt).
      Note these lines in the truss output:

      1818/220: 98.175569 0.000047 0.000017 stat("/x1/jenkins/agent_directory/remoting/jarCache/DD/891A07A8C64C7162B75516CD586859.jar", 0xFFFF80FF9F9CCAB0) = 0
      1818/220: 98.175657 0.000088 0.000016 lseek(27, 0x0016D598, SEEK_SET) = 1496472
      1818/220: 98.175703 0.000046 0.000021 read(27, " P K01021403\n\0\0\b\b\0".., 160) = 160
      1818/220: 98.175766 0.000063 0.000015 lseek(27, 0x0001918B, SEEK_SET) = 102795
      1818/220: 98.175808 0.000042 0.000018 read(27, " P K0304\n\0\0\b\b\0 o a".., 30) = 30
      1818/220: 98.175857 0.000049 0.000015 lseek(27, 0x000191D4, SEEK_SET) = 102868
      1818/220: 98.175914 0.000057 0.000017 read(27, "8D92 M OC2 @1086DF85 BA1".., 353) = 353
      1818/220: 98.176729 0.000815 0.000832 Incurred fault #5, FLTACCESS %pc = 0xFFFF80FFBD63F8C0
      1818/220: siginfo: SIGBUS BUS_OBJERR addr=0xFFFF80FFBD63F8C0 errno=151(ESTALE)
      1818/220: 98.177078 0.000349 0.001182 Received signal #10, SIGBUS [caught]
      1818/220: siginfo: SIGBUS BUS_OBJERR addr=0xFFFF80FFBD63F8C0 errno=151(ESTALE)
      

      It appears as if the updating of the jar files in the jarCache directory while the JVM is running is triggering a stale file handle error in the JVM, ultimately causing the JVM to crash.

      The JVM is run with property "-Dsun.zip.disableMemoryMapping=true" in an attempt to avoid the crash. It does not help.

      This example illustrates how the JVM crashes on one of the 10 Solaris intel nodes.  It happens on all 10 nodes (both Solaris 10 intel and Solaris 11 intel).

      Note that while this crash happens often, the odd time a build will not crash and run to completion.  This might suggest a timing issue.

        1. build_log.txt
          3 kB
        2. jarCache_directory_after_build.txt
          3 kB
        3. jarCache_directory_before_build.txt
          2 kB
        4. job_config.xml
          10 kB
        5. jvm_core_file_where.txt
          5 kB
        6. nwb-sol11-test1_config.xml
          0.9 kB
        7. nwb-sol11-test1_connection_log.txt
          7 kB
        8. truss.txt
          4.38 MB

            Unassigned Unassigned
            jdavey John Davey
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: