Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-56535

Caused: java.io.IOException: Unexpected termination of the channel for Azure Agents

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Blocker Blocker
    • _unsorted
    • None
    • Jenkins 2.138.2
      Azure VM Plugin: 0.7.5

      We are experiencing this problem quite extensively only on our autoscaled agents. 

      The agent get disconnected during the execution of a job and gets deleted in azure as well. This gives us no time to login to VM and see the agent logs.

      The job continues to hold the agent even though it is "disconnected" in jenkins and actual VM is already deleted.

      The Jenkins Agent logs shows: 

      Connection was broken

      java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49) at hudson.remoting.Command.readFrom(Command.java:140) at hudson.remoting.Command.readFrom(Command.java:126) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) Caused: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)

       

       

      and we looked at logs on jenkins server and found this: 

       

       

      2019-03-12T13:04:28.614306215Z INFO: AzureVMCloud: createProvisionedAgent: Deployment winconfig7-0312130351356 not yet finished (Running): Microsoft.Compute/virtualMachines:winconfi9e19b0 - waited 30 seconds
      2019-03-12T13:05:01.195797620Z INFO: AzureVMCloud: createProvisionedAgent: Deployment winconfig7-0312130351356 not yet finished (Running): Microsoft.Compute/virtualMachines:winconfi9e19b0 - waited 60 seconds
      2019-03-12T13:05:34.204246893Z INFO: AzureVMCloud: createProvisionedAgent: Deployment winconfig7-0312130351356 not yet finished (Running): Microsoft.Compute/virtualMachines:winconfi9e19b0 - waited 90 seconds
      2019-03-12T13:06:07.630586283Z INFO: AzureVMCloud: createProvisionedAgent: Deployment winconfig7-0312130351356 not yet finished (Running): Microsoft.Compute/virtualMachines:winconfi9e19b0 - waited 120 seconds
      2019-03-12T13:06:40.605311559Z INFO: AzureVMCloud: createProvisionedAgent: VM available: winconfi9e19b0
      2019-03-12T13:06:40.682752835Z  found agent winconfi9e19b0
      2019-03-12T13:06:41.085375927Z nodeNamewinconfi9e19b0
      2019-03-12T13:06:41.085392927Z INFO: Azure Cloud: provision: Adding agent winconfi9e19b0 to Jenkins nodes
      2019-03-12T13:06:41.085557426Z INFO: AzureVMAgent: createComputer: start for agent winconfi9e19b0
      2019-03-12T13:06:41.086645120Z INFO: AzureVMCloudRetensionStrategy: start: azureComputer name winconfi9e19b0
      2019-03-12T13:06:41.124592912Z INFO: AzureVMAgentSSHLauncher: launch: launch method called for agent winconfi9e19b0
      2019-03-12T13:07:31.163904528Z INFO: AzureVMAgentCleanUpTask: cleanVMs: node winconfi9e19b0 blocked to cleanup
      2019-03-12T13:21:34.609221023Z  Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to winconfi9e19b0
      2019-03-12T14:16:07.211594879Z  Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to winconfi9e19b0
      2019-03-12T14:33:49.513820368Z  Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to winconfi9e19b0
      2019-03-12T14:45:50.112174433Z  Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to winconfi9e19b0
      2019-03-12T15:04:38.977113250Z  Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to winconfi9e19b0
      2019-03-12T15:11:23.332151176Z SEVERE: I/O error in channel winconfi9e19b0
      2019-03-12T15:23:24.569513881Z WARNING: Issue with creating launcher for agent winconfi9e19b0. The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries
      2019-03-12T15:27:31.917195333Z INFO: AzureVMManagementServiceDelegate: virtualMachineExists: check for winconfi9e19b0
      2019-03-12T15:27:32.116311424Z INFO: AzureVMManagementServiceDelegate: virtualMachineExists: winconfi9e19b0 doesnt exist
      2019-03-12T15:27:32.116318724Z INFO: AzureVMAgentCleanUpTask: cleanVMs: node winconfi9e19b0 doesnt exist, removing
      2019-03-12T15:28:21.183757963Z INFO: cleanLeakedResources: deleting winconfi9e19b0NIC from resource group jenkins-agents

      Is there a suggestion for us, where can we look at to know what exactly is causing this ? 

      Everything was working fine just a week ago and we did not even updated anything.

          [JENKINS-56535] Caused: java.io.IOException: Unexpected termination of the channel for Azure Agents

          Jie Shen added a comment -

          Thanks for reporting the issue here. I think this issue may be relevant with https://github.com/jenkinsci/jenkins/blob/1c9eb43283e7321ee4d3a0e1e9995453493ff04a/core/src/main/java/hudson/model/Slave.java#L496-L500 . The agent does not set up connection with master successfully. Do you choose SSH to connect to slave? The code is not enough robust for now. I think we should set timeout and retry strategy for setting up the connection.

          Jie Shen added a comment - Thanks for reporting the issue here. I think this issue may be relevant with https://github.com/jenkinsci/jenkins/blob/1c9eb43283e7321ee4d3a0e1e9995453493ff04a/core/src/main/java/hudson/model/Slave.java#L496-L500  . The agent does not set up connection with master successfully. Do you choose SSH to connect to slave? The code is not enough robust for now. I think we should set timeout and retry strategy for setting up the connection.

          It was working more stable up until two weeks ago. and yes, we are using ssh. 

          It seems like the agent was up and serving the jobs for two hours. the it got allocated to yet another job and during the execution of the job, the VM got disappeared (the deletion seems to be ordrered by the plugin) and job wiill fail with errors like: 

           
          ERROR: Issue with creating launcher for agent testtempba65f0. The agent has not been fully initialized yet
          Channel "unknown": Remote call on testtempba65f0 failed. The channel is closing down or has closed down
           
          We do not get a chance to look into agent, what has gone wrong. This is big time stopper for us right now.

          Muhammad Faizan ul haq added a comment - It was working more stable up until two weeks ago. and yes, we are using ssh.  It seems like the agent was up and serving the jobs for two hours. the it got allocated to yet another job and during the execution of the job, the VM got disappeared (the deletion seems to be ordrered by the plugin) and job wiill fail with errors like:    ERROR: Issue with creating launcher for agent testtempba65f0. The agent has not been fully initialized yet Channel "unknown": Remote call on testtempba65f0 failed. The channel is closing down or has closed down   We do not get a chance to look into agent, what has gone wrong. This is big time stopper for us right now.

          Chance Davies added a comment -

          Just like Muhammad, we have only begun to see these errors only after the last update. We now have many slaves failing to run jobs which have been very reliable in the past. 

          Chance Davies added a comment - Just like Muhammad, we have only begun to see these errors only after the last update. We now have many slaves failing to run jobs which have been very reliable in the past. 

          Jie Shen added a comment -

          cdavies Which version were you use before you updated. The latest version of this plugin only fixes error messages in the configuration page and add another way to provide images for agents. This kind of connection errors are always annoying but hard to debug. I plan to put a high priority for vm-agent-plugin's stability in the my next work plan.

          Jie Shen added a comment - cdavies Which version were you use before you updated. The latest version of this plugin only fixes error messages in the configuration page and add another way to provide images for agents. This kind of connection errors are always annoying but hard to debug. I plan to put a high priority for vm-agent-plugin's stability in the my next work plan.

          We fought a little and found the issue.

          We spun up a test-jenkins from a backup of our production jenkins server. Both server had same ID and somehow after aprox two hours of an agent creation, the agents created by one server were being removed by other server. jieshe can probably explain this. 

          Muhammad Faizan ul haq added a comment - We fought a little and found the issue. We spun up a test-jenkins from a backup of our production jenkins server. Both server had same ID and somehow after aprox two hours of an agent creation, the agents created by one server were being removed by other server. jieshe can probably explain this. 

          Tim Jacomb added a comment -

          All issues have been transferred to GitHub.

          See https://github.com/jenkinsci/azure-vm-agents-plugin/issues

          Search the issue title to find it.

          (This is a bulk comment and can't link to the specific issue)

          Tim Jacomb added a comment - All issues have been transferred to GitHub. See https://github.com/jenkinsci/azure-vm-agents-plugin/issues Search the issue title to find it. (This is a bulk comment and can't link to the specific issue)

            jieshe Jie Shen
            faizan Muhammad Faizan ul haq
            Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: