Hi,

      I need to poll hundreds of (fat) Git repos for changes and I can't use webhooks for that.

      The problem: Using 'dir() + checkout scmGit' inside a loop gives an unexpected result because for a given iteration, the previous iteration may not have finished, so the .git inside the current iteration is polluted with information from the previous iteration. Each checkout seems to run asynchronously.

      Consequence: using 'trigger pollscm', at the next build changes are detected as the last builds revisions in each dir/git repo are fulfilled with wrong data.

      This problem doesn't occur using parallel as each checkout runs in a separate thread/stage, but it's not an option right now for me.

       

      Code sample:

      stage('Checkout repos') {
          steps {
              script {
                  int size = reposMap.size()
      
                  reposMap.eachWithIndex {name, branches, index ->
                      println("\n######\n>> Checkout [${index}/${size}]: ${name}\nBranches:${branches}\n######")
      
                      for (branch in branches) {
                          println(">>> Starting Checkout of ${name} on ${branch}")
                          dir("${TOP_DIR}/${name}/${branch}") {
                              checkout scmGit(
                                  [
                                      branches: [[name: branch]],
                                      extensions: [
                                          cloneOption(depth: 1, noTags: true, shallow: true),
                                          pruneStaleBranch(),
                                          pruneTags(true)
                                      ],
                                      userRemoteConfigs: [[url: "${BASE_URL}/${name}", credentialsId: CRED_ID]]
                                  ]
                              )
                          }
                          println(">>> Ending Checkout of ${name} on ${branch}")
                      }
                  }
              }
          }
      } 

       

      And the output with comments

      ######
      >> Checkout [7/118]: my_project/my_repo_B
      Branches:[my_branch_B]
      ######
      [Pipeline] echo
      >>> Starting Checkout of my_project/my_repo_B on my_branch_B
      [Pipeline] dir
      Running in /opt/jenkins/workspace/test_poll_error/my_project/my_repo_B/my_branch_B
      [Pipeline] {
      [Pipeline] checkout
      The recommended git tool is: NONE
      using credential integration
      Cloning the remote Git repository
      Using shallow clone with depth 1
      Avoid fetching tags
      Avoid second fetch
      Checking out Revision c34c9426ff31a1832829b0ceda6df3d575ce3f0a (origin/my_branch_B)
      Commit message: "<BLANK>"
      First time build. Skipping changelog.
      [Pipeline] }
       > git config remote.origin.url https://my_server.com/scm/my_project/my_repo_A # timeout=10  // Here the previous iteration A writes in the current dir of B  
       > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
       > git rev-parse origin/my_branch_A^{commit} # timeout=10
       > git config core.sparsecheckout # timeout=10
       > git checkout -f 7f4423aabd9865ad2bdfd9708d31ef115ce8b239 # timeout=10  // Wrong revision for B
      [Pipeline] // dir
      [Pipeline] echo
      >>> Ending Checkout of my_project/my_repo_B on my_branch_B
      [Pipeline] echo
      
      ######
      >> Checkout [8/118]: my_project/my_repo_C
      Branches:[my_branch_C]
      ######
      [Pipeline] echo
      >>> Starting Checkout of my_project/my_repo_C on my_branch_C
      [Pipeline] dir
      Running in /opt/jenkins/workspace/test_poll_error/my_project/my_repo_C/my_branch_C
      [Pipeline] {
      [Pipeline] checkout
      The recommended git tool is: NONE
      using credential integration
      Cloning the remote Git repository
      Using shallow clone with depth 1
      Avoid fetching tags
      Cloning repository https://my_server.com/scm/my_project/my_repo_B     // Here the previous iteration B writes in the current iteration C 
       > git init /opt/jenkins/workspace/test_poll_error/my_project/my_repo_B/my_branch_B # timeout=10
      Fetching upstream changes from https://my_server.com/scm/my_project/my_repo_B
       > git --version # timeout=10
       > git --version # 'git version 2.25.1'
      using GIT_ASKPASS to set credentials
       > git fetch --no-tags --force --progress --depth=1 -- https://my_server.com/scm/my_project/my_repo_B +refs/heads/*:refs/remotes/origin/* # timeout=10
       > git config remote.origin.url https://my_server.com/scm/my_project/my_repo_B # timeout=10
       > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
       > git rev-parse origin/my_branch_B^{commit} # timeout=10
       > git config core.sparsecheckout # timeout=10
       > git checkout -f c34c9426ff31a1832829b0ceda6df3d575ce3f0a # timeout=10
      Cloning repository https://my_server.com/scm/my_project/my_repo_C
       > git init /opt/jenkins/workspace/test_poll_error/my_project/my_repo_C/my_branch_C # timeout=10
      Fetching upstream changes from https://my_server.com/scm/my_project/my_repo_C
       > git --version # timeout=10
       > git --version # 'git version 2.25.1'
      using GIT_ASKPASS to set credentials
       > git fetch --no-tags --force --progress --depth=1 -- https://my_server.com/scm/my_project/my_repo_C +refs/heads/*:refs/remotes/origin/* # timeout=10
      Avoid second fetch
      Checking out Revision 6b299692657347e89b3249303cc3ad644a306400 (origin/my_branch_C)
       > git config remote.origin.url https://my_server.com/scm/my_project/my_repo_C # timeout=10
       > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
       > git rev-parse origin/my_branch_C^{commit} # timeout=10
       > git config core.sparsecheckout # timeout=10
       > git checkout -f 6b299692657347e89b3249303cc3ad644a306400 # timeout=10
      Commit message: "<BLANK>"
      First time build. Skipping changelog.
      [Pipeline] }
      [Pipeline] // dir
      [Pipeline] echo
      >>> Ending Checkout of my_project/my_repo_C on my_branch_C
      [Pipeline] echo
      

       

          [JENKINS-72713] checkout scm in loop seems async

          Markus Winter added a comment - - edited

          Instead of 

          for (branch in branches) {

          can you try

          branches.each { branch ->

          Markus Winter added a comment - - edited Instead of  for (branch in branches) { can you try branches.each { branch ->

          Mark Waite added a comment -

          As mawinter69 noted, the for construct in Jenkins Pipeline surprises many people. Try the each construct. My test without using a loop was enough to confirm for me that checkout is synchronous. I tested with:

          pipeline {
              agent {
                  label '!windows'
              }
              stages {
                  stage('checkout') {
                      steps {
                          dir('git-client-plugin') {
                              checkout scmGit(branches: [[name: 'master']],
                                              userRemoteConfigs: [[url: 'https://github.com/jenkinsci/git-client-plugin.git']])
                              sh 'git remote -v'
                          }
                          dir('git-plugin') {
                              checkout scmGit(branches: [[name: 'master']],
                                              userRemoteConfigs: [[url: 'https://github.com/jenkinsci/git-plugin.git']])
                              sh 'git remote -v'
                          }
                          dir('implied-labels-plugin') {
                              checkout scmGit(branches: [[name: 'master']],
                                              userRemoteConfigs: [[url: 'https://github.com/jenkinsci/implied-labels-plugin.git']])
                              sh 'git remote -v'
                          }
                      }
                  }
              }
          }
          

          Mark Waite added a comment - As mawinter69 noted, the for construct in Jenkins Pipeline surprises many people. Try the each construct. My test without using a loop was enough to confirm for me that checkout is synchronous. I tested with: pipeline { agent { label '!windows' } stages { stage( 'checkout' ) { steps { dir( 'git-client-plugin' ) { checkout scmGit(branches: [[name: 'master' ]], userRemoteConfigs: [[url: 'https: //github.com/jenkinsci/git-client-plugin.git' ]]) sh 'git remote -v' } dir( 'git-plugin' ) { checkout scmGit(branches: [[name: 'master' ]], userRemoteConfigs: [[url: 'https: //github.com/jenkinsci/git-plugin.git' ]]) sh 'git remote -v' } dir( 'implied-labels-plugin' ) { checkout scmGit(branches: [[name: 'master' ]], userRemoteConfigs: [[url: 'https: //github.com/jenkinsci/implied-labels-plugin.git' ]]) sh 'git remote -v' } } } } }

          Lionel added a comment - - edited

          I have tried both 'each' and 'for' and same result

          markewaite with a few small repos there's no error. I'm experiencing issues with hundreds of repos with a very large codebase.

          Lionel added a comment - - edited I have tried both 'each' and 'for' and same result markewaite with a few small repos there's no error. I'm experiencing issues with hundreds of repos with a very large codebase.

          Mark Waite added a comment -

          Then I suspect there is some other flaw in your script. I ran the following with a loop over a map of repository names and repository URLs:

          pipeline {
              agent {
                  label '!windows'
              }
              stages {
                  stage('checkout') {
                      steps {
                          cleanWs()
                          script {
                              def reposMap = [ 
                                  'git-client-plugin':'https://github.com/jenkinsci/git-client-plugin.git',
                                  'git-plugin':'https://github.com/jenkinsci/git-plugin.git',
                                  'implied-labels-plugin':'https://github.com/jenkinsci/git-plugin.git',
                              ]
                              reposMap.each { name, url ->
                                  dir(name) {
                                      checkout scmGit(branches: [[name: 'master']], 
                                                      userRemoteConfigs: [[url: url]])
                                      sh 'pwd ; git remote -v'
                                  }
                              }
                          }
                      }
                  }
              }
          }
          

          Mark Waite added a comment - Then I suspect there is some other flaw in your script. I ran the following with a loop over a map of repository names and repository URLs: pipeline { agent { label '!windows' } stages { stage( 'checkout' ) { steps { cleanWs() script { def reposMap = [ 'git-client-plugin' : 'https: //github.com/jenkinsci/git-client-plugin.git' , 'git-plugin' : 'https: //github.com/jenkinsci/git-plugin.git' , 'implied-labels-plugin' : 'https: //github.com/jenkinsci/git-plugin.git' , ] reposMap.each { name, url -> dir(name) { checkout scmGit(branches: [[name: 'master' ]], userRemoteConfigs: [[url: url]]) sh 'pwd ; git remote -v' } } } } } } }

          Lionel added a comment -

          markewaite 

          You try 3 smalls repos... As said, the same base code works if using parallel instead of a loop, but the stage view is not usable then.

          Try with a all repos defined in an AOSP manifest file and you will see the problem as https://android.googlesource.com/platform/manifest/+/refs/heads/main/default.xml

          Lionel added a comment - markewaite   You try 3 smalls repos... As said, the same base code works if using parallel instead of a loop, but the stage view is not usable then. Try with a all repos defined in an AOSP manifest file and you will see the problem as https://android.googlesource.com/platform/manifest/+/refs/heads/main/default.xml

          Mark Waite added a comment -

          lnlrbr I'm sorry, but I'm not willing to debug your script.

          I've seen no indication that repository size is related to the synchronous execution of the checkout scm Pipeline step. I have many Pipelines that depend on the synchronous execution of the checkout scm step. I've seen no reports from any other user that checkout scm is executed asynchronously. If checkout scm where not synchronous, there would have been many, many bug reports.

          Mark Waite added a comment - lnlrbr I'm sorry, but I'm not willing to debug your script. I've seen no indication that repository size is related to the synchronous execution of the checkout scm Pipeline step. I have many Pipelines that depend on the synchronous execution of the checkout scm step. I've seen no reports from any other user that checkout scm is executed asynchronously. If checkout scm where not synchronous, there would have been many, many bug reports.

          Markus Winter added a comment - - edited

          I think that is a problem of delayed transfer of logs from the checkout step to the controller.
          I did the same adding a check afterwards if a file from the repo exists and this was always the case, even though the logs appear cluttered (using a small repo so the checkout is fast)
          When you look at a specific step by going to "Pipeline steps" and then selecting a checkout step all the logs are there for that specific checkout.

          Afaik logs are not instantly transferred from an agent to the controller but this happens in chunks every second or so.

          But an echo step (or println which is mapped to echo) is executed on the controller and then there is no delay in the logs. This makes it appear as if things are executed in parallel but this is not the case.

          Markus Winter added a comment - - edited I think that is a problem of delayed transfer of logs from the checkout step to the controller. I did the same adding a check afterwards if a file from the repo exists and this was always the case, even though the logs appear cluttered (using a small repo so the checkout is fast) When you look at a specific step by going to "Pipeline steps" and then selecting a checkout step all the logs are there for that specific checkout. Afaik logs are not instantly transferred from an agent to the controller but this happens in chunks every second or so. But an echo step (or println which is mapped to echo) is executed on the controller and then there is no delay in the logs. This makes it appear as if things are executed in parallel but this is not the case.

          Markus Winter added a comment - - edited

          btw when enabling timestamps one can see that the logs are not in the correct order

          16:07:33  + echo '--- Starting Checkout of repo_1 on branch_0'
          16:07:33  --- Starting Checkout of repo_1 on branch_0
          16:07:33  [Pipeline] dir
          16:07:33  Running in /data/jenkins/workspace/_pipeline_test/repo_1/branch_0
          16:07:33  [Pipeline] {
          16:07:33  [Pipeline] checkout
          16:07:33  The recommended git tool is: NONE
          16:07:33  using credential ******
          16:07:33  Cloning the remote Git repository
          16:07:34  Avoid second fetch
          16:07:34  Checking out Revision 784c15b0c2511c3910f6349be9793e91a0b8e1e3 (refs/remotes/origin/master)
          16:07:34  Commit message: "Merge branch 'test'"
          16:07:34  [Pipeline] sh
          16:07:34  + echo '--- Ending Checkout of repo_1 on branch_0'
          16:07:34  --- Ending Checkout of repo_1 on branch_0
          16:07:34  [Pipeline] fileExists
          16:07:34  [Pipeline] sh
          16:07:34  + echo 'All good the checkout was successful'
          16:07:34  All good the checkout was successful
          16:07:34  [Pipeline] }
          16:07:34  [Pipeline] // dir
          16:07:34  [Pipeline] sh
          16:07:33  Cloning repository ssh://<url>
          16:07:33   > git init /data/jenkins/workspace/_pipeline_test/repo_1/branch_0 # timeout=10
          16:07:33  Fetching upstream changes from ssh://<url>
          16:07:33   > git --version # timeout=10
          16:07:33   > git --version # 'git version 2.33.0'
          16:07:33  using GIT_SSH to set credentials ****** read user
          16:07:33  Verifying host key using known hosts file

          Markus Winter added a comment - - edited btw when enabling timestamps one can see that the logs are not in the correct order 16:07:33 + echo '--- Starting Checkout of repo_1 on branch_0' 16:07:33 --- Starting Checkout of repo_1 on branch_0 16:07:33 [Pipeline] dir 16:07:33 Running in /data/jenkins/workspace/_pipeline_test/repo_1/branch_0 16:07:33 [Pipeline] { 16:07:33 [Pipeline] checkout 16:07:33 The recommended git tool is: NONE 16:07:33 using credential ****** 16:07:33 Cloning the remote Git repository 16:07:34 Avoid second fetch 16:07:34 Checking out Revision 784c15b0c2511c3910f6349be9793e91a0b8e1e3 (refs/remotes/origin/master) 16:07:34 Commit message: "Merge branch 'test' " 16:07:34 [Pipeline] sh 16:07:34 + echo '--- Ending Checkout of repo_1 on branch_0' 16:07:34 --- Ending Checkout of repo_1 on branch_0 16:07:34 [Pipeline] fileExists 16:07:34 [Pipeline] sh 16:07:34 + echo 'All good the checkout was successful' 16:07:34 All good the checkout was successful 16:07:34 [Pipeline] } 16:07:34 [Pipeline] // dir 16:07:34 [Pipeline] sh 16:07:33 Cloning repository ssh: //<url> 16:07:33 > git init /data/jenkins/workspace/_pipeline_test/repo_1/branch_0 # timeout=10 16:07:33 Fetching upstream changes from ssh: //<url> 16:07:33 > git --version # timeout=10 16:07:33 > git --version # 'git version 2.33.0' 16:07:33 using GIT_SSH to set credentials ****** read user 16:07:33 Verifying host key using known hosts file

          Lionel added a comment -

          I'm sorry to bother you but effectively it's not related to async checkout.

          After testing, it seems that the problem comes from using more than one branch per repo.

          Based on your example and with this Jenkinsfile:

           

          BASE_URL='https://github.com/jenkinsci'
          
          // Where to checkout repos
          TOP_DIR="pollSCM"
          
          reposMap = [
              'git-client-plugin.git': ['master', 'stable-2.x'],
              'jenkins.git': ['master', 'stable-2.190'],
              'git-plugin.git': ['master']
          ]
          
          pipeline {
          
              agent {label 'Docker'}
              triggers {pollSCM('*/5 * * * *')}
          
              stages {
                  stage('Checkout repos') {
                      steps {
                          script {
                              int size = reposMap.size()
          
                              // Launch series jobs
                              reposMap.eachWithIndex {name, branches, index ->
                                  println("\n######\n>> Checkout [${index+1}/${size}]: ${name}\nBranches:${branches}\n######")
          
                                  branches.each { branch ->
                                      dir("${TOP_DIR}/${name}/${branch}") {
                                          checkout scmGit(
                                              branches: [[name: branch]],
                                              extensions: [
                                                  cloneOption(depth: 1, noTags: true, shallow: true),
                                                  pruneStaleBranch(),
                                                  pruneTags(true)
                                              ],
                                              userRemoteConfigs: [[url: "${BASE_URL}/${name}"]]
                                          )
                                      }
                                  }
                              }
                          }
                      }
                  }
          
                  stage('Build') {
                      steps {
                          echo 'Done'
                      }
                  }
          
              }
          } 

           

          There is a mismatch between the 2 branches from git-client-plugin.git (same on jenkins.git) as seen in the polling log:

          Using strategy: Default
          [poll] Last Built Revision: Revision 37fa961f57f71fa567a972c9e6e676fed8a73e20 (origin/stable-2.x)
          The recommended git tool is: NONE
          No credentials specified
           > git --version # timeout=10
           > git --version # 'git version 2.25.1'
           > git ls-remote -h -- https://github.com/jenkinsci/git-client-plugin.git # timeout=10
          Found 10 remote heads on https://github.com/jenkinsci/git-client-plugin.git
          [poll] Latest remote head revision on refs/heads/master is: 394e59405da00912b29b66a10d8516067cdfd941
          Using strategy: Default
          [poll] Last Built Revision: Revision 37fa961f57f71fa567a972c9e6e676fed8a73e20 (origin/stable-2.x)
          The recommended git tool is: NONE
          No credentials specified
           > git --version # timeout=10
           > git --version # 'git version 2.25.1'
           > git ls-remote -h -- https://github.com/jenkinsci/git-client-plugin.git # timeout=10
          Found 10 remote heads on https://github.com/jenkinsci/git-client-plugin.git
          [poll] Latest remote head revision on refs/heads/stable-2.x is: 37fa961f57f71fa567a972c9e6e676fed8a73e20 - already built by 2 

           

           

           

          Lionel added a comment - I'm sorry to bother you but effectively it's not related to async checkout. After testing, it seems that the problem comes from using more than one branch per repo. Based on your example and with this Jenkinsfile:   BASE_URL= 'https: //github.com/jenkinsci' // Where to checkout repos TOP_DIR= "pollSCM" reposMap = [ 'git-client-plugin.git' : [ 'master' , 'stable-2.x' ], 'jenkins.git' : [ 'master' , 'stable-2.190' ], 'git-plugin.git' : [ 'master' ] ] pipeline { agent {label 'Docker' } triggers {pollSCM( '*/5 * * * *' )} stages { stage( 'Checkout repos' ) { steps { script { int size = reposMap.size() // Launch series jobs reposMap.eachWithIndex {name, branches, index -> println( "\n######\n>> Checkout [${index+1}/${size}]: ${name}\nBranches:${branches}\n######" ) branches.each { branch -> dir( "${TOP_DIR}/${name}/${branch}" ) { checkout scmGit( branches: [[name: branch]], extensions: [ cloneOption(depth: 1, noTags: true , shallow: true ), pruneStaleBranch(), pruneTags( true ) ], userRemoteConfigs: [[url: "${BASE_URL}/${name}" ]] ) } } } } } } stage( 'Build' ) { steps { echo 'Done' } } } }   There is a mismatch between the 2 branches from git-client-plugin.git (same on jenkins.git) as seen in the polling log: Using strategy: Default [poll] Last Built Revision: Revision 37fa961f57f71fa567a972c9e6e676fed8a73e20 (origin/stable-2.x) The recommended git tool is: NONE No credentials specified > git --version # timeout=10 > git --version # 'git version 2.25.1' > git ls-remote -h -- https: //github.com/jenkinsci/git-client-plugin.git # timeout=10 Found 10 remote heads on https: //github.com/jenkinsci/git-client-plugin.git [poll] Latest remote head revision on refs/heads/master is: 394e59405da00912b29b66a10d8516067cdfd941 Using strategy: Default [poll] Last Built Revision: Revision 37fa961f57f71fa567a972c9e6e676fed8a73e20 (origin/stable-2.x) The recommended git tool is: NONE No credentials specified > git --version # timeout=10 > git --version # 'git version 2.25.1' > git ls-remote -h -- https: //github.com/jenkinsci/git-client-plugin.git # timeout=10 Found 10 remote heads on https: //github.com/jenkinsci/git-client-plugin.git [poll] Latest remote head revision on refs/heads/stable-2.x is: 37fa961f57f71fa567a972c9e6e676fed8a73e20 - already built by 2      

          Markus Winter added a comment -

          Can you repeat that test with timestamps enabled. And maybe add a sleep of 2s after each checkout step. I think than you will see that everything is serial.

          Markus Winter added a comment - Can you repeat that test with timestamps enabled. And maybe add a sleep of 2s after each checkout step. I think than you will see that everything is serial.

            Unassigned Unassigned
            lnlrbr Lionel
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: