Details
-
Type:
Epic
-
Status: In Progress (View Workflow)
-
Priority:
Minor
-
Resolution: Unresolved
-
Component/s: etc
-
Labels:
-
Epic Name:Infra Contributor UX Revamp 2019Q4
-
Similar Issues:
Description
Original list from R. Tyler Croy which needs to be reviewed and implemented
Incidents/Alerts which need to be documented "the right way to handle":
- Jenkins
- not responding to requests/high CPU
- Inspecting for slow requests, restarting Jenkins properly
- Upgrading plugins/restarting to pick up new core changes
- ideally also how the jenkinsci docker org image creation works in trusted.ci as precondition for core security updates?
- trusted-ci
- Agents have stuck pipeliens and don't appear to do anything – docker daemon stuck, needs manual reboot
- Disk space issues:
- LDAP - prune old transaction logs
- eggplant - truncate old Apache logs
- celery, or other Jenkins agents
- ci.jenkins.io - the master has /var/lib/jenkins filling up
- also needs to be made into a proper alert (perhaps metrics plugin based?) rather than admin monitor on the UI
- Confluence
- Dealing with spammers:
- Delete the user
- Delete the pages
- Delete the cached pages
- undo the edits, etc
- Letsencrypt certificates expire 'soon' can be fixed by /etc/init.d/apache2 reload
- Mapping AWS instances from Datadog to actual hostnames that are usable
- Release/distribution architecture documentation
- Defining all the moving components related to the release and distribution of core and plugins
- How to perform a manual sync of mirrors
- Manual syncing for plugin specific updates
- Blacklist some mirrors
- Puppet
- Where is the Puppet dashboard?
- Figuring out when a Puppet agent is not responding properly (from Datadog)
- Running puppet manually
- Manually running an r10k deployment in the occasion the webhooks from GitHub to puppet.jenkins.io fail
- Accounts App
- Processing account signup rejections
- Deleting spammers' accounts
- Kubernetes
- Where does it live
- how do you know it's healthy
- what do you do if a service (account-app, plugins, etc) are not working properly.
- How do you manually renew a Letsencrypt certificate for a Kubernetes-based application
- would perhaps be interesting to know what's not documented? known unknowns
- Document black boxes from the infra, e.g. the release process in KK's basement
- Legacy services which are "not managed"
- jenkins.ci.cloudbees.com
- Any "tyler-only" jobs?
Attachments
Issue Links
- is related to
-
INFRA-2401 Optimize ci.jenkins.io workload
-
- Open
-
- links to
R. Tyler Croy Olivier Vernin I will take over this EPIC if you do not mind