Configuration Anomaly Detection (CAD) is responsible for reducing manual SRE effort by pre-investigating alerts, detecting cluster anomalies and sending relevant communications to the cluster owner.
To contribute to CAD, please see our CONTRIBUTING Document.
- cadctl -- Performs investigation workflow.
Every alert managed by CAD corresponds to an investigation, representing the executed code associated with the alert.
Investigation specific documentation can be found in the according investigation folder, e.g. for ClusterHasGoneMissing.
- AWS -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
- PagerDuty -- Retrieving alert info, esclating or silencing incidents, and adding notes.
- OCM -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
- osd-network-verifier -- Tool to verify the pre-configured networking components for ROSA and OSD CCS clusters.
- CAD is a command line tool that is run in tekton pipelines.
- The tekton service is running on an app-sre cluster.
- CAD is triggered by PagerDuty webhooks configured on selected services, meaning that all alerts in that service trigger a CAD pipeline.
- CAD uses the data received via the webhook to determine which investigation to start.
- Update-Template -- Updating configuration-anomaly-detection-template.Template.yaml.
- OpenShift -- Used by app-interface to deploy the CAD resources on a target cluster.
Grafana dashboard configmaps are stored in the Dashboards directory. See app-interface for further documentation on dashboards.
- Tekton -- Installation/configuration of Tekton and triggering pipeline runs.
- Skip Webhooks -- Skipping the eventlistener and creating the pipelinerun directly.
- Namespace -- Allowing the code to ignore the namespace.
- Boilerplate -- Conventions for OSD containers.
- PipelinePruner -- Documentation about PipelineRun pruning.
CAD_OCM_CLIENT_ID
: refers to the OCM client ID used by CAD to initialize the OCM clientCAD_OCM_CLIENT_SECRET
: refers to the OCM client secret used by CAD to initialize the OCM clientCAD_OCM_URL
: refers to the used OCM url used by CAD to initialize the OCM clientAWS_ACCESS_KEY_ID
: refers to the access key id of the base AWS account used by CADAWS_SECRET_ACCESS_KEY
: refers to the secret access key of the base AWS account used by CADCAD_AWS_CSS_JUMPROLE
: refers to the arn of the RH-SRE-CCS-Access jumproleCAD_AWS_SUPPORT_JUMPROLE
: refers to the arn of the RH-Technical-Support-Access jumproleCAD_ESCALATION_POLICY
: refers to the escalation policy CAD should use to escalate the incident toCAD_PD_EMAIL
: refers to the email for a login via mail/pw credentialsCAD_PD_PW
: refers to the password for a login via mail/pw credentialsCAD_PD_TOKEN
: refers to the generated private access token for token-based authenticationCAD_PD_USERNAME
: refers to the username of CAD on PagerDutyCAD_SILENT_POLICY
: refers to the silent policy CAD should use if the incident shall be silentPD_SIGNATURE
: refers to the PagerDuty webhook signature (HMAC+SHA256)X_SECRET_TOKEN
: refers to our custom Secret Token for authenticating against our pipelineCAD_PROMETHEUS_PUSHGATEWAY
: refers to the URL cad will push metrics to
For Red Hat employees, these environment variables can be found in the SRE-P vault.