Uploaded image for project: 'Infrastructure'
  1. Infrastructure
  2. INFRA-2739

Out of disk space in EC2 agents

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Has happened to me several times today, e.g. in https://ci.jenkins.io/job/Plugins/job/nodejs-plugin/job/PR-36/2/console

      /home/jenkins/workspace/Plugins: No space left on dvice
      

      Workaround is to use ACI agents, though in this case I cannot since I lack write permission to the repository so I cannot meaningfully edit Jenkinsfile.

        Attachments

          Issue Links

            Activity

            Hide
            markewaite Mark Waite added a comment -

            We had an agent connected to ci.jenkins.io that was reporting its free disc space as "N/A". I'm accustomed to seeing "N/A" for some period after the agent has connected. In this case, the "N/A" seems to have been an indication that the file system was so full that it could not process the request to report available disc space.

            I deleted the agent that was reporting "N/A" for free disc space but am not sure what we should do to resolve the problem more generally.

            Show
            markewaite Mark Waite added a comment - We had an agent connected to ci.jenkins.io that was reporting its free disc space as "N/A". I'm accustomed to seeing "N/A" for some period after the agent has connected. In this case, the "N/A" seems to have been an indication that the file system was so full that it could not process the request to report available disc space. I deleted the agent that was reporting "N/A" for free disc space but am not sure what we should do to resolve the problem more generally.
            Hide
            markewaite Mark Waite added a comment -

            No out of disc space reports since end of day Sep 30, 2020.

            Show
            markewaite Mark Waite added a comment - No out of disc space reports since end of day Sep 30, 2020.
            Hide
            jglick Jesse Glick added a comment -

            Has been breaking all EC2 builds now for a day or two.

            Show
            jglick Jesse Glick added a comment - Has been breaking all EC2 builds now for a day or two.
            Hide
            markewaite Mark Waite added a comment -

            Broken Oct 19, 2020 by a change to the AMI definition. That change has been corrected Oct 20, 2020. We detected it Oct 19 but I didn't have the necessary skills to correct it. Olivier fixed it Oct 20.

            Still need to update the Windows agents with more disc space, but that's different than this issue.

            Show
            markewaite Mark Waite added a comment - Broken Oct 19, 2020 by a change to the AMI definition. That change has been corrected Oct 20, 2020. We detected it Oct 19 but I didn't have the necessary skills to correct it. Olivier fixed it Oct 20. Still need to update the Windows agents with more disc space, but that's different than this issue.
            Hide
            olblak Olivier Vernin added a comment -

            The default ec2 machines were using ephemeral devices (8GB) instead of the EBS defined by the AMI (100GB).
            This has been fixed in ci.jenkins.io configuration

            Show
            olblak Olivier Vernin added a comment - The default ec2 machines were using ephemeral devices (8GB) instead of the EBS defined by the AMI (100GB). This has been fixed in ci.jenkins.io configuration
            Hide
            jglick Jesse Glick added a comment -

            I think it would be fine to use ephemeral disks with modest sizes, if each build got a fresh agent. The problem AFAICT is reusing agents across builds.

            Show
            jglick Jesse Glick added a comment - I think it would be fine to use ephemeral disks with modest sizes, if each build got a fresh agent. The problem AFAICT is reusing agents across builds.
            Hide
            markewaite Mark Waite added a comment -

            I implemented the work around yesterday of limiting use of each AWS EC2 machine to a single build. The 8 GB ephemeral disc was not enough for some of the builds even with single use. Combine that with the 5 minute or more penalty to start and connect a new AWS EC2 machine and single use AWS virtual machines are quite painful unless we maintain a pool of available single use machines. Eventually, we'll switch ci.jenkins.io to a Kubernetes and JCasC controlled environment and allow Kubernetes to handle the creation and destruction of agents.

            Show
            markewaite Mark Waite added a comment - I implemented the work around yesterday of limiting use of each AWS EC2 machine to a single build. The 8 GB ephemeral disc was not enough for some of the builds even with single use. Combine that with the 5 minute or more penalty to start and connect a new AWS EC2 machine and single use AWS virtual machines are quite painful unless we maintain a pool of available single use machines. Eventually, we'll switch ci.jenkins.io to a Kubernetes and JCasC controlled environment and allow Kubernetes to handle the creation and destruction of agents.
            Hide
            olblak Olivier Vernin added a comment -

            The disk space issue is one of the most common issues we have to face and often with different reason.
            >ephemeral disks with modest sizes
            The ec2 plugin doesn't allow us to set the size, so we can't use it for now.

            > Eventually, we'll switch ci.jenkins.io to a Kubernetes and JCasC controlled environment and allow Kubernetes
            Switching to JCasC is a must-have to allow everybody to review and fix an issue but updating every Pipeline to run on k8s versus virtual machine won't happen in one day

            Show
            olblak Olivier Vernin added a comment - The disk space issue is one of the most common issues we have to face and often with different reason. >ephemeral disks with modest sizes The ec2 plugin doesn't allow us to set the size, so we can't use it for now. > Eventually, we'll switch ci.jenkins.io to a Kubernetes and JCasC controlled environment and allow Kubernetes Switching to JCasC is a must-have to allow everybody to review and fix an issue but updating every Pipeline to run on k8s versus virtual machine won't happen in one day

              People

              Assignee:
              olblak Olivier Vernin
              Reporter:
              jglick Jesse Glick
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: