Uploaded image for project: 'Infrastructure'
  1. Infrastructure
  2. INFRA-1447

https://repo.jenkins-ci.org/ is slow and or unreachable

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Reported this morning by Oleg Nenashev

      {quote}

      [1:33 PM] Oleg Nenashev: @here could somebody with Jenkins INFRA access permissions take a look at https://repo.jenkins-ci.org/ performance? It blocks PCT tests which I need to perform as a part of JEP-200
      [1:34 PM] Oleg Nenashev: Merry Christmas though
      [1:34 PM] Oleg Nenashev: Likely nobody is online, CC @aheritier just in case

      {quote}

        Attachments

          Activity

          aheritier Arnaud Héritier created issue -
          Hide
          danielbeck Daniel Beck added a comment - - edited

          Artifactory was restarted half an hour ago (2017-12-26 13:13:47,671)

          It seems to be down again now. https://repo.jenkins-ci.org/api/storage/releases/.index/nexus-maven-repository-index.gz returns:

          {
            "errors" : [ {
              "status" : 500,
              "message" : "Unable to connect to Access server: Read timed out"
            } ]
          }
          Show
          danielbeck Daniel Beck added a comment - - edited Artifactory was restarted half an hour ago (2017-12-26 13:13:47,671) It seems to be down again now. https://repo.jenkins-ci.org/api/storage/releases/.index/nexus-maven-repository-index.gz returns: { "errors" : [ { "status" : 500, "message" : "Unable to connect to Access server: Read timed out" } ] }
          Hide
          danielbeck Daniel Beck added a comment - - edited

          It's accessible again, but lots of messages like this are in the artifactory.log:

          org.apache.http.conn.HttpHostConnectException: Connect to acolshared1b.gcoss-use1.jfrog.net:8343 [acolshared1b.gcoss-use1.jfrog.net/10.250.30.41] failed: Connection refused (Connection refused)
          
          2017-12-26 13:36:42,220 [http-nio-8083-exec-232] [ERROR] (o.a.r.d.DbStoringRepoMixin:264) - IO error while trying to save resource public:com/mapbox/mapboxsdk/mapbox-android-telemetry/2.2.2/mapbox-android-telemetry-2.2.2.pom'': org.apache.http.conn.HttpHostConnectException: Connect to acolshared1b.gcoss-use1.jfrog.net:8343 [acolshared1b.gcoss-use1.jfrog.net/10.250.30.41] failed: Connection refused (Connection refused)

          Exceptions started at log time 2017-12-26 13:32:13,521 at least through 2017-12-26 13:37:07,541

          Show
          danielbeck Daniel Beck added a comment - - edited It's accessible again, but lots of messages like this are in the artifactory.log: org.apache.http.conn.HttpHostConnectException: Connect to acolshared1b.gcoss-use1.jfrog.net:8343 [acolshared1b.gcoss-use1.jfrog.net/10.250.30.41] failed: Connection refused (Connection refused) 2017-12-26 13:36:42,220 [http-nio-8083-exec-232] [ERROR] (o.a.r.d.DbStoringRepoMixin:264) - IO error while trying to save resource public:com/mapbox/mapboxsdk/mapbox-android-telemetry/2.2.2/mapbox-android-telemetry-2.2.2.pom'': org.apache.http.conn.HttpHostConnectException: Connect to acolshared1b.gcoss-use1.jfrog.net:8343 [acolshared1b.gcoss-use1.jfrog.net/10.250.30.41] failed: Connection refused (Connection refused) Exceptions started at log time 2017-12-26 13:32:13,521 at least through 2017-12-26 13:37:07,541
          Hide
          danielbeck Daniel Beck added a comment -

          Another restart:

          2017-12-26 13:42:08,505 [art-init] [INFO ] (o.a.w.s.ArtifactoryContextConfigListener:282) -
                          _   _  __           _                      _____ _                 _
               /\        | | (_)/ _|         | |                    / ____| |               | |
              /  \   _ __| |_ _| |_ __ _  ___| |_ ___  _ __ _   _  | |    | | ___  _   _  __| |
             / /\ \ | '__| __| |  _/ _` |/ __| __/ _ \| '__| | | | | |    | |/ _ \| | | |/ _` |
            / ____ \| |  | |_| | || (_| | (__| || (_) | |  | |_| | | |____| | (_) | |_| | (_| |
           /_/    \_\_|   \__|_|_| \__,_|\___|\__\___/|_|   \__, |  \_____|_|\___/ \__,_|\__,_|
                                                             __/ |
           Revision: 50603900                               |___/
           Artifactory Home: '/data/aolback/homes/jenkinsci'
          
          2017-12-26 13:42:08,507 [art-init] [WARN ] (o.a.f.l.ArtifactoryLockFile:65) - Found existing lock file. Artifactory was not shutdown properly. [/data/aolback/homes/jenkinsci/data/.lock]
          Show
          danielbeck Daniel Beck added a comment - Another restart: 2017-12-26 13:42:08,505 [art-init] [INFO ] (o.a.w.s.ArtifactoryContextConfigListener:282) -                 _   _  __           _                      _____ _                 _      /\        | | (_)/ _|         | |                    / ____| |               | |     /  \   _ __| |_ _| |_ __ _  ___| |_ ___  _ __ _   _  | |    | | ___  _   _  __| |    / /\ \ | '__| __| |  _/ _` |/ __| __/ _ \| '__| | | | | |    | |/ _ \| | | |/ _` |   / ____ \| |  | |_| | || (_| | (__| || (_) | |  | |_| | | |____| | (_) | |_| | (_| |  /_/    \_\_|   \__|_|_| \__,_|\___|\__\___/|_|   \__, |  \_____|_|\___/ \__,_|\__,_|                                                    __/ |  Revision: 50603900                               |___/  Artifactory Home: '/data/aolback/homes/jenkinsci' 2017-12-26 13:42:08,507 [art-init] [WARN ] (o.a.f.l.ArtifactoryLockFile:65) - Found existing lock file. Artifactory was not shutdown properly. [/data/aolback/homes/jenkinsci/data/.lock]
          Hide
          danielbeck Daniel Beck added a comment -

          Since 2017-12-26 14:34:26,279, exceptions appeared again:

          2017-12-26 14:34:26,279 [http-nio-8083-exec-621] [ERROR] (o.a.r.d.DbStoringRepoMixin:271) - Couldn't save resource, reason:
          org.artifactory.concurrent.LockingException: Lock on LockEntryId public:co/cask/cdap/cdap-ui/3.4.3/cdap-ui-3.4.3.pom not acquired in 120 seconds. Lock info: org.artifactory.storage.fs.lock.provider.JVMLockWrapper@29f00ef9.
          Show
          danielbeck Daniel Beck added a comment - Since 2017-12-26 14:34:26,279, exceptions appeared again: 2017-12-26 14:34:26,279 [http-nio-8083-exec-621] [ERROR] (o.a.r.d.DbStoringRepoMixin:271) - Couldn't save resource, reason: org.artifactory.concurrent.LockingException: Lock on LockEntryId public:co/cask/cdap/cdap-ui/3.4.3/cdap-ui-3.4.3.pom not acquired in 120 seconds. Lock info: org.artifactory.storage.fs.lock.provider.JVMLockWrapper@29f00ef9.
          Hide
          danielbeck Daniel Beck added a comment -

          Latest Artifactory startup at 2017-12-26 18:09:09,638

          Show
          danielbeck Daniel Beck added a comment - Latest Artifactory startup at 2017-12-26 18:09:09,638
          Hide
          danielbeck Daniel Beck added a comment -

          Latest startup at 2017-12-26 19:03:14,329

          Show
          danielbeck Daniel Beck added a comment - Latest startup at 2017-12-26 19:03:14,329
          Hide
          danielbeck Daniel Beck added a comment -

          Startup at 2017-12-26 19:25:20,033

          Show
          danielbeck Daniel Beck added a comment - Startup at 2017-12-26 19:25:20,033
          Hide
          danielbeck Daniel Beck added a comment -

          Startup at 2017-12-26 19:45:16,725

          Show
          danielbeck Daniel Beck added a comment - Startup at 2017-12-26 19:45:16,725
          Hide
          danielbeck Daniel Beck added a comment -

          Startup at 2017-12-26 20:07:51,922

          Show
          danielbeck Daniel Beck added a comment - Startup at 2017-12-26 20:07:51,922
          danielbeck Daniel Beck made changes -
          Field Original Value New Value
          Priority Minor [ 4 ] Blocker [ 1 ]
          Show
          danielbeck Daniel Beck added a comment - Multiple emails to JFrog about this in http://lists.jenkins-ci.org/pipermail/jenkins-infra/2017-December/thread.html http://lists.jenkins-ci.org/pipermail/jenkins-infra/2017-December/001343.html http://lists.jenkins-ci.org/pipermail/jenkins-infra/2017-December/001346.html  
          Hide
          tiffany_loon Tiffany Loon added a comment - - edited

          Would it have anything to do with this? http://status.artifactoryonline.com/incidents/d7rz689dlhz2

          Show
          tiffany_loon Tiffany Loon added a comment - - edited Would it have anything to do with this?  http://status.artifactoryonline.com/incidents/d7rz689dlhz2
          Hide
          danielbeck Daniel Beck added a comment - - edited

          Tiffany Loon Possible, although our problems started ~9 hours before their initial status update (21:19 UTC vs. initially reported by Oleg Nenashev at 12:03 UTC in #jenkins-infra). They had another incident earlier today for ~6 hours, seems like both might have affected us.

          However, the status page says the problems have been resolved an hour ago, but appear to be ongoing on our instance, despite us having suspended all regular services creating even minimal load (update site generation and kicking the Maven indexer).

          Show
          danielbeck Daniel Beck added a comment - - edited Tiffany Loon Possible, although our problems started ~9 hours before their initial status update (21:19 UTC vs. initially reported by Oleg Nenashev at 12:03 UTC in #jenkins-infra). They had another incident earlier today for ~6 hours, seems like both might have affected us. However, the status page says the problems have been resolved an hour ago, but appear to be ongoing on our instance, despite us having suspended all regular services creating even minimal load (update site generation and kicking the Maven indexer).
          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          It seems the instance is still unstable.
          I tried running fresh build about 10 times. In most cases the build fails with 502 from Artifactory or with missing artifact.

          This issue seems to be reproducible:

          [ERROR] Failed to execute goal on project job-restrictions: Could not resolve dependencies for project com.synopsys.arc.jenkinsci.plugins:job-restrictions:hpi:0.7-SNAPSHOT: The following artifacts could not be resolved: org.jenkins-ci.plugins:workflow-step-api:jar:2.14, org.jenkins-ci.plugins:ace-editor:jar:1.1, org.jenkins-ci.plugins:jquery-detached:jar:1.2.1, org.jenkins-ci.plugins:workflow-job:jar:2.16, org.jenkins-ci.plugins:workflow-cps:jar:2.42, org.jenkins-ci.plugins:workflow-support:jar:2.16, org.jenkins-ci.plugins:workflow-scm-step:jar:2.6, org.jenkins-ci.plugins:workflow-api:jar:2.24: Could not find artifact org.jenkins-ci.plugins:workflow-step-api:jar:2.14 in repo.jenkins-ci.org (https://repo.jenkins-ci.org/public/) -> [Help 1]
          
          Show
          oleg_nenashev Oleg Nenashev added a comment - It seems the instance is still unstable. I tried running fresh build about 10 times. In most cases the build fails with 502 from Artifactory or with missing artifact. This issue seems to be reproducible: [ERROR] Failed to execute goal on project job-restrictions: Could not resolve dependencies for project com.synopsys.arc.jenkinsci.plugins:job-restrictions:hpi:0.7-SNAPSHOT: The following artifacts could not be resolved: org.jenkins-ci.plugins:workflow-step-api:jar:2.14, org.jenkins-ci.plugins:ace-editor:jar:1.1, org.jenkins-ci.plugins:jquery-detached:jar:1.2.1, org.jenkins-ci.plugins:workflow-job:jar:2.16, org.jenkins-ci.plugins:workflow-cps:jar:2.42, org.jenkins-ci.plugins:workflow-support:jar:2.16, org.jenkins-ci.plugins:workflow-scm-step:jar:2.6, org.jenkins-ci.plugins:workflow-api:jar:2.24: Could not find artifact org.jenkins-ci.plugins:workflow-step-api:jar:2.14 in repo.jenkins-ci.org (https://repo.jenkins-ci.org/public/) -> [Help 1]
          Hide
          danielbeck Daniel Beck added a comment -

          Restart at 2017-12-27 08:09:19,382

          Another restart just now at 2017-12-27 09:41:17,432


          JFrog responded with a canned response pointing to http://status.artifactoryonline.com/incidents/d7rz689dlhz2 and saying:

          Thank you for contacting JFrog Support.
          We have experienced performance difficulties with some servers, including jenkins-ci. Please be advised that in the meantime, to mitigate this, we have blocked the directory browsing for https://repo.jenkins-ci.org/public. You may still use this repository for uploading and downloading files. In addition, we improved performance related to Access service.
          repo.jenkins-ci.org is currently up and running.

          Show
          danielbeck Daniel Beck added a comment - Restart at 2017-12-27 08:09:19,382 Another restart just now at 2017-12-27 09:41:17,432 JFrog responded with a canned response pointing to http://status.artifactoryonline.com/incidents/d7rz689dlhz2 and saying: Thank you for contacting JFrog Support. We have experienced performance difficulties with some servers, including jenkins-ci. Please be advised that in the meantime, to mitigate this, we have blocked the directory browsing for https://repo.jenkins-ci.org/public . You may still use this repository for uploading and downloading files. In addition, we improved performance related to Access service. repo.jenkins-ci.org is currently up and running.
          Hide
          danielbeck Daniel Beck added a comment -
          Show
          danielbeck Daniel Beck added a comment - I sent an email to JFrog support: http://lists.jenkins-ci.org/pipermail/jenkins-infra/2017-December/001350.html
          Hide
          oleg_nenashev Oleg Nenashev added a comment -
          Show
          oleg_nenashev Oleg Nenashev added a comment - FYI Raul Arabaolaza
          Hide
          danielbeck Daniel Beck added a comment -

          Artifactory restarted at 2017-12-27 13:32:57,130

          Show
          danielbeck Daniel Beck added a comment - Artifactory restarted at 2017-12-27 13:32:57,130
          Hide
          aflat aflat added a comment -

          Restart hasn't fixed the issue.

          Show
          aflat aflat added a comment - Restart hasn't fixed the issue.
          Hide
          tiffany_loon Tiffany Loon added a comment -

          It was working after the restart, but stopped working about an hour or 2 in. Around 11:20 EST it stopped working.

          Show
          tiffany_loon Tiffany Loon added a comment - It was working after the restart, but stopped working about an hour or 2 in. Around 11:20 EST it stopped working.
          Hide
          tiffany_loon Tiffany Loon added a comment -

          Looks like it is up again and working fast.

          Show
          tiffany_loon Tiffany Loon added a comment - Looks like it is up again and working fast.
          Hide
          aflat aflat added a comment -

          Yup, looks better now.

          Show
          aflat aflat added a comment - Yup, looks better now.
          Hide
          danielbeck Daniel Beck added a comment -

          A period of errors and delays in responses from Wed Dec 27 19:16:36 CET 2017 to Wed Dec 27 22:57:31 CET 2017.

          The log level is so high that I can no longer find the time of the last restart in artifactory.log.

          Show
          danielbeck Daniel Beck added a comment - A period of errors and delays in responses from Wed Dec 27 19:16:36 CET 2017 to Wed Dec 27 22:57:31 CET 2017. The log level is so high that I can no longer find the time of the last restart in artifactory.log.
          Hide
          danielbeck Daniel Beck added a comment - - edited

          Artifactory seems to be holding up well since I posted the previous comment. We have regular Maven index updates every 100 minutes or so, and there have been no outages I could confirm (a single request of those I send once per minute failed, but that could have been my connection).

          Show
          danielbeck Daniel Beck added a comment - - edited Artifactory seems to be holding up well since I posted the previous comment. We have regular Maven index updates every 100 minutes or so, and there have been no outages I could confirm (a single request of those I send once per minute failed, but that could have been my connection).
          danielbeck Daniel Beck made changes -
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]
          Hide
          oleg_nenashev Oleg Nenashev added a comment -

          It works for me well, I was able to release some plugins, more to come. Thanks a lot!

          Show
          oleg_nenashev Oleg Nenashev added a comment - It works for me well, I was able to release some plugins, more to come. Thanks a lot!
          Hide
          aheritier Arnaud Héritier added a comment -

          From JFrog

          Hi Daniel,

          Thank you for your patience. We want to share some of our findings with you. We have noticed that the load on your server has increased, and we are still looking into the exact root cause. We do know with certainty that the load is related to what appears to be a script that runs on your side, that updates/creates Permission Targets via the api/security/permissions endpoint, using a user with username 'permission-updater'. We've noticed that this script is running every 30 minutes, and this, along with repository browsing that happens constantly in the background, causes the load on your server to spike every 30 minutes. We are still looking into the root cause of the increase in load, to see if it's related to a change in one of the recent versions of Artifactory. 

          In the meantime, can you kindly elaborate on the use-case behind updating the permission targets every 30 minutes? additionally, if possible, please consider staggering the executions of this script as a short-term remediation strategy, at least until we complete our RCA.    

          Best regards,
          Uriah Levy

           

          Show
          aheritier Arnaud Héritier added a comment - From JFrog Hi Daniel, Thank you for your patience. We want to share some of our findings with you. We have noticed that the load on your server has increased, and we are still looking into the exact root cause. We do know with certainty that the load is related to what appears to be a script that runs on your side, that updates/creates Permission Targets via the api/security/permissions endpoint, using a user with username 'permission-updater'. We've noticed that this script is running every 30 minutes, and this, along with repository browsing that happens constantly in the background, causes the load on your server to spike every 30 minutes. We are still looking into the root cause of the increase in load, to see if it's related to a change in one of the recent versions of Artifactory.  In the meantime, can you kindly elaborate on the use-case behind updating the permission targets every 30 minutes? additionally, if possible, please consider staggering the executions of this script as a short-term remediation strategy, at least until we complete our RCA.     Best regards, Uriah Levy  
          Hide
          aheritier Arnaud Héritier added a comment -

          Daniel Beck it seems that our update permissions script could be a part of the issue ...

          Show
          aheritier Arnaud Héritier added a comment - Daniel Beck it seems that our update permissions script could be a part of the issue ...
          aheritier Arnaud Héritier made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          rtyler R. Tyler Croy added a comment -

          Going to go ahead and close this since normal service appears to have resumed

          Show
          rtyler R. Tyler Croy added a comment - Going to go ahead and close this since normal service appears to have resumed
          rtyler R. Tyler Croy made changes -
          Resolution Fixed [ 1 ]
          Status Reopened [ 4 ] Closed [ 6 ]

            People

            Assignee:
            danielbeck Daniel Beck
            Reporter:
            aheritier Arnaud Héritier
            Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: