Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65918

Support logging of full detail from master node and not using slaves to create logs

      As an engineer

      I want to have the option of full Cloudwatch logs created by the master node rather than the slave

      So that I can support deployment patterns where master runs in a controlling account and slaves utilise an IAM role with very limited privelege and with ONLY privilige over the target account (i.e. no access back to logs in the account containing the master server).

       

      Additional context:

      A pattern we are working utilises EKS based slaves that are given their roles through OIDC connector into a target account. The role associated has limited access to the target account and NO access to the account where the Jenkins master is created. Our use case is better to have the logs in this master account and have the master provide all the logging capability.

      I created a feature for the aws-cloudwatch-logs-publisher noted here https://issues.jenkins.io/browse/JENKINS-65917 to remove ANSI escape sequences for readability because, for now, this fits our deployment pattern better although I would have preferred to contribute to this project.

          [JENKINS-65918] Support logging of full detail from master node and not using slaves to create logs

          Andy added a comment - - edited

          In our model the slave role, in the target account, does not have rights to connect back. There is an IAM role assumed through the use of k8s annotations that gives the slave access through OIDC connection and web federation into the target account from the master.. we do not use AWS credentials at all with this config and purely rely upon IAM instance roles for the master and OIDC web assume role for the slave so it purely operates in the target account with no access back to the account containing the Cloudwatch log groups that the master has got access to.

           

          For additional context, the role assumed by the slave is for infrastructure deployment so really needs the minimal roles that we assigned to it, we tested fallback to Fargate instance role in the source account and that didn't work. If this had been a role on a Jenkins slave that needed no other IAM based rights in the target account then possibly passing credentials could be fine, however, a) We can't for these reasons and b) This still allows the slave to operate on costed resources (logging) in the master account so we feel this model is cleaner for us.

           

          It also seem less value to spend the time creating a 3rd plugin to get the real feature, live streaming of events vs copy up in a stage, on another plugin where you have done such a great job

           

          I hope this clarifies.

          -Andy

          Andy added a comment - - edited In our model the slave role, in the target account, does not have rights to connect back. There is an IAM role assumed through the use of k8s annotations that gives the slave access through OIDC connection and web federation into the target account from the master.. we do not use AWS credentials at all with this config and purely rely upon IAM instance roles for the master and OIDC web assume role for the slave so it purely operates in the target account with no access back to the account containing the Cloudwatch log groups that the master has got access to.   For additional context, the role assumed by the slave is for infrastructure deployment so really needs the minimal roles that we assigned to it, we tested fallback to Fargate instance role in the source account and that didn't work. If this had been a role on a Jenkins slave that needed no other IAM based rights in the target account then possibly passing credentials could be fine, however, a) We can't for these reasons and b) This still allows the slave to operate on costed resources (logging) in the master account so we feel this model is cleaner for us.   It also seem less value to spend the time creating a 3rd plugin to get the real feature, live streaming of events vs copy up in a stage, on another plugin where you have done such a great job   I hope this clarifies. -Andy

          Andy added a comment - - edited

          Obviously this is a bigger delta than the story I played for the other plugin and thought there may be more of this kind of conversation to be had before considering attempting this feature

          Andy added a comment - - edited Obviously this is a bigger delta than the story I played for the other plugin and thought there may be more of this kind of conversation to be had before considering attempting this feature

          Jesse Glick added a comment -

          There is an IAM role assumed through the use of k8s annotations that gives the slave access […] into the target account from the master [and] we […] rely upon IAM instance roles for the master and OIDC web assume role for the slave

          So what I am saying is: remove all configuration granting any CWL-related rights to the agent. It is just some box with a JVM and access to the Internet. Just use this plugin, with the controller running with IAM instance identity sufficient to work with CWL. The agent should be able to connect to CWL with a very specific and restrictive policy using temporary credentials. If the agent happens to have some IAM roles in another account for an unrelated purpose, fine, should not matter for this plugin.

          Jesse Glick added a comment - There is an IAM role assumed through the use of k8s annotations that gives the slave access […] into the target account from the master [and] we […] rely upon IAM instance roles for the master and OIDC web assume role for the slave So what I am saying is: remove all configuration granting any CWL-related rights to the agent. It is just some box with a JVM and access to the Internet. Just use this plugin, with the controller running with IAM instance identity sufficient to work with CWL. The agent should be able to connect to CWL with a very specific and restrictive policy using temporary credentials. If the agent happens to have some IAM roles in another account for an unrelated purpose, fine, should not matter for this plugin.

          Andy added a comment - - edited

          Which is what we can't do because the role in the target account is needed to grant enough policy to perform infrastructure deployments in that account.

          Andy added a comment - - edited Which is what we can't do because the role in the target account is needed to grant enough policy to perform infrastructure deployments in that account.

          Jesse Glick added a comment -

          Hence my last sentence: the agent can have some unrelated role, perhaps in an unrelated account, for an unrelated reason. It should not matter because when it is sending data to CWL it should be using the temporary credentials it has been issued by the controller, with its own access key, secret access key, session token, even region, all produced on the controller using either AssumeRole or GetFederationToken according to the nature of the controller’s credentials. (Again, not something I have specifically tested, so possibly buggy.)

          Jesse Glick added a comment - Hence my last sentence: the agent can have some unrelated role, perhaps in an unrelated account, for an unrelated reason. It should not matter because when it is sending data to CWL it should be using the temporary credentials it has been issued by the controller, with its own access key, secret access key, session token, even region, all produced on the controller using either AssumeRole or GetFederationToken according to the nature of the controller’s credentials. (Again, not something I have specifically tested, so possibly buggy.)

          Andy added a comment - - edited

          Here's the complete setup - account A will refer to the account where the Jenkins master operates and the EKS cluster running the slaves, account B will refer to the account we are provisioning resources into where we use a deployment role. Apologies as this is a bit lengthy.

          1. Master jenkins server runs in account A using a limited IAM role which has access to EKS cluster using the Jenkins k8s plugin.
          2. EKS nodes within nodegroups and Fargate profiles are configured with IAM permissions to be able to run Fluent Bit configurations and allow them to write their own logs to Cloudwatch which represent the API logs and console logs for the containers. This happens through regular IAM and is independent of roles assumed on pods. This is a standard Amazon logging configuration using Fluent Bit as a DaemonSet on the NodeGroup and the baked in implementation of Fluent Bit offered on the Fargate workers.
          3. AWS standard for giving k8s service accounts access to IAM roles is to create an OIDC connector between the target account (account B) and the k8s endpoint, not a local IAM role within account Account A at all. The role in Account B is configured with an Assume Role policy that allows a AssumeRoleWithWebIdentity for the k8s user by name, through the OIDC connector only. This role is given the deployment permissions for account B.
          4. Service Accounts are configured within a namespace on EKS containing an 'eks.amazonaws.com/role-arn' annotation which specifies the role that the service account should take on.
          5. podTemplate for the deployment job within Jenkins specifies namespace where the pod is to run and the service account name.
          6. When the pod is launched AWS calls (must be baked into the SDK to perform AssumeRoleWithWebIdentity following a hint)  automatically match the role from the service account annotation and connect to the target account. Being challenged for a token, the k8s endpoint issues the token to the OIDC connector allowing the k8s user to assume the role. This happens somewhat 'automagically' once the configuration is right.

          Note that this configuration is not something that is only needed for federated accounts, this is Amazon's model for assigning roles to EKS service accounts regardless and in a single account scenario the OIDC connector and the ROLE would be in the same account, the mechanism is the same though, one uses AssumeRoleWithWebIdentity and this happens automatically with the federated principal coming from k8s rather than IAM.

          In this configuration, either cross account or within the same account we get the same experience, the run completes but, the master log is produced (lacking agent activity as noted in script areas etc.) and log streams are created for the agent runs, these are empty because the agent appears not to be able to write to them.

          Noting standard AWS Fluent Bit configurations produce logs perfectly both for the node group and the EKS cluster but this does not offer job logging for Jenkins of course.

          Andy added a comment - - edited Here's the complete setup - account A will refer to the account where the Jenkins master operates and the EKS cluster running the slaves, account B will refer to the account we are provisioning resources into where we use a deployment role. Apologies as this is a bit lengthy. Master jenkins server runs in account A using a limited IAM role which has access to EKS cluster using the Jenkins k8s plugin. EKS nodes within nodegroups and Fargate profiles are configured with IAM permissions to be able to run Fluent Bit configurations and allow them to write their own logs to Cloudwatch which represent the API logs and console logs for the containers. This happens through regular IAM and is independent of roles assumed on pods. This is a standard Amazon logging configuration using Fluent Bit as a DaemonSet on the NodeGroup and the baked in implementation of Fluent Bit offered on the Fargate workers. AWS standard for giving k8s service accounts access to IAM roles is to create an OIDC connector between the target account (account B) and the k8s endpoint, not a local IAM role within account Account A at all. The role in Account B is configured with an Assume Role policy that allows a AssumeRoleWithWebIdentity for the k8s user by name, through the OIDC connector only. This role is given the deployment permissions for account B. Service Accounts are configured within a namespace on EKS containing an 'eks.amazonaws.com/role-arn' annotation which specifies the role that the service account should take on. podTemplate for the deployment job within Jenkins specifies namespace where the pod is to run and the service account name. When the pod is launched AWS calls (must be baked into the SDK to perform AssumeRoleWithWebIdentity following a hint)  automatically match the role from the service account annotation and connect to the target account. Being challenged for a token, the k8s endpoint issues the token to the OIDC connector allowing the k8s user to assume the role. This happens somewhat 'automagically' once the configuration is right. Note that this configuration is not something that is only needed for federated accounts, this is Amazon's model for assigning roles to EKS service accounts regardless and in a single account scenario the OIDC connector and the ROLE would be in the same account, the mechanism is the same though, one uses AssumeRoleWithWebIdentity and this happens automatically with the federated principal coming from k8s rather than IAM. In this configuration, either cross account or within the same account we get the same experience, the run completes but, the master log is produced (lacking agent activity as noted in script areas etc.) and log streams are created for the agent runs, these are empty because the agent appears not to be able to write to them. Noting standard AWS Fluent Bit configurations produce logs perfectly both for the node group and the EKS cluster but this does not offer job logging for Jenkins of course.

          Jesse Glick added a comment -

          this is a bit lengthy

          Indeed, and I only follow parts of it, but none of that should matter for purposes of this plugin. All that should matter is the controller’s credentials and its access to CWL, and physical connectivity from the agent pod to the CWL endpoint. The agent code sending to CWL deliberately ignores the default search path for AWS credentials.

          the agent appears not to be able to write to them

          So there is a bug, but I have not seen a diagnosis of the cause. Does the controller call AssumeRole or GetFederationToken to create agent credentials? Are there error messages logged in the agent JVM? Can you turn on some fine-level logging?

          It is probably straightforward to have a configuration option to disable serialized agent CWL credentials, causing agents to fall back to the system built into Jenkins core whereby log messages are streamed over the Remoting channel and then handled by the controller however it would handle other general messages. But this would be a workaround for a bug, and would degrade performance.

          Jesse Glick added a comment - this is a bit lengthy Indeed, and I only follow parts of it, but none of that should matter for purposes of this plugin. All that should matter is the controller’s credentials and its access to CWL, and physical connectivity from the agent pod to the CWL endpoint. The agent code sending to CWL deliberately ignores the default search path for AWS credentials. the agent appears not to be able to write to them So there is a bug, but I have not seen a diagnosis of the cause. Does the controller call AssumeRole or GetFederationToken to create agent credentials? Are there error messages logged in the agent JVM? Can you turn on some fine-level logging? It is probably straightforward to have a configuration option to disable serialized agent CWL credentials, causing agents to fall back to the system built into Jenkins core whereby log messages are streamed over the Remoting channel and then handled by the controller however it would handle other general messages. But this would be a workaround for a bug, and would degrade performance.

          Andy added a comment -

          I'll take a look jglick, I'll have to swing my test environment back to this plugin to try it out as I'm not sure I would find anything sensible to confirm the exact behaviour in Cloudtrail from previous tests due to the volume of stuff we have going through our accounts.

           

          Andy added a comment - I'll take a look jglick , I'll have to swing my test environment back to this plugin to try it out as I'm not sure I would find anything sensible to confirm the exact behaviour in Cloudtrail from previous tests due to the volume of stuff we have going through our accounts.  

          Andy added a comment - - edited

          I'm not spotting this at the moment, I haven't seen either message logged and I have the logging set to FINEST for LogStreamState.

          I did manage to spot an agent exception (AccessDenied) on PutLogEvents but lost the message in the limited system console capability so I'll do a bit more digging, I'll maybe add some temporary extra debug level stuff to the authentication methods to see if I can track this down but it's getting a bit O/T for this request.

          Andy added a comment - - edited I'm not spotting this at the moment, I haven't seen either message logged and I have the logging set to FINEST for LogStreamState. I did manage to spot an agent exception (AccessDenied) on PutLogEvents but lost the message in the limited system console capability so I'll do a bit more digging, I'll maybe add some temporary extra debug level stuff to the authentication methods to see if I can track this down but it's getting a bit O/T for this request.

          Andy added a comment - - edited

          TBH there's another issue, which I'll report additionally, which doesn't make the agent logging great at the moment - I think a story will help to outline that one https://issues.jenkins.io/browse/JENKINS-65930

           

          Having logging purely in the master somewhat obviates this story but in terms of creating the full picture if folks prefer logging to be agent based and we can solve the auth issue as well then the use of the plugin seems more viable.

          Andy added a comment - - edited TBH there's another issue, which I'll report additionally, which doesn't make the agent logging great at the moment - I think a story will help to outline that one  https://issues.jenkins.io/browse/JENKINS-65930   Having logging purely in the master somewhat obviates this story but in terms of creating the full picture if folks prefer logging to be agent based and we can solve the auth issue as well then the use of the plugin seems more viable.

            jglick Jesse Glick
            iamasmith Andy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: