jobs/auctioneer/templates/indicators.yml.erb

---
apiVersion: indicatorprotocol.io/v1
kind: IndicatorDocument

metadata:
  labels:
    deployment: <%= spec.deployment %>
    component: auctioneer

spec:
  product:
    name: diego
    version: latest

  indicators:
  - name: auctioneer_lrp_auctions_failed
    promql: rate(AuctioneerLRPAuctionsFailed{source_id="auctioneer"}[5m]) * 60
    documentation:
      title: Auctioneer - App Instance (AI) Placement Failures
      description: |
        The number of Long Running Process (LRP) instances that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

        Use: This metric can indicate that Diego is out of container space or that there is a lack of resources within your environment. This indicator also increases when the LRP is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work.

        This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled.

        This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state.

        Origin: Firehose
        Type: Counter (Integer)
        Frequency: During each auction
      recommended_alert_thresholds: |
          Yellow warning: ≥ 0.5
          Red critical: ≥ 1
      recommended_response: |
        1. To best determine the root cause, examine the Auctioneer logs. Depending on the specific error and resource constraint, you may also find a failure reason in the Cloud Controller (CC) API.
        2. Investigate the health of your Diego cells to determine if they are the resource type causing the problem.
        3. Consider scaling additional cells.

  - name: auctioneer_states_duration
    promql: max_over_time(AuctioneerFetchStatesDuration{source_id="auctioneer"}[5m]) / 1000000000
    documentation:
      title: Auctioneer - Time to Fetch Cell State
      description: |
        Time in ns that the auctioneer took to fetch state from all the Diego cells when running its auction.

        Use: Indicates how the cells themselves are performing. Alerting on this metric helps alert that app staging requests to Diego may be failing.

        Origin: Firehose
        Type: Gauge, integer in ns
        Frequency: During event, during each auction
      recommended_alert_thresholds: |
          Yellow warning: ≥ 2s
          Red critical: ≥ 5s
      recommended_response: |
        1. Check the health of the cells by reviewing the logs and looking for errors.
        2. Review IaaS console metrics.
        3. Inspect the Auctioneer logs to determine if one or more cells is taking significantly longer to fetch state than other cells. Relevant log lines will have wording like `fetched cell state`.

  - name: auctioneer_lrp_auctions_started
    promql: rate(AuctioneerLRPAuctionsStarted{source_id="auctioneer"}[5m]) * 60
    documentation:
      title: Auctioneer - App Instance Starts
      description: |
        The number of LRP instances that the auctioneer successfully placed on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

        Use: Provides a sense of running system activity levels in your environment. Can also give you a sense of how many app instances have been started over time. The provided measurement can help indicate a significant amount of container churn. However, for capacity planning purposes, it is more helpful to observe deltas over a long time window.

        This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled.

        Origin: Firehose
        Type: Counter (Integer)
        Frequency: During event, during each auction
      recommended_alert_thresholds: Dynamic
      recommended_response: |
        When observing a significant amount of container churn, do the following:

        1. Look to eliminate explainable causes of temporary churn, such as a deployment or increased developer activity.
        2. If container churn appears to continue over an extended period, inspect Diego Auctioneer and BBS logs.

        When observing extended periods of high or low activity trends, scale up or down CF components as needed.

  - name: auctioneer_task_auctions_failed
    promql: rate(AuctioneerTaskAuctionsFailed{source_id="auctioneer"}[5m]) * 60
    documentation:
      title: Auctioneer - Task Placement Failures
      description: |
        The number of Tasks that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

        Use: Failing Task auctions indicate a lack of resources within your environment and that you likely need to scale. This indicator also increases when the Task is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work.

        This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no tasks being scheduled.

        This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state.

        Origin: Firehose
        Type: Counter (Float)
        Frequency: During event, during each auction
      recommended_alert_thresholds: |
          Yellow warning: ≥ 0.5
          Red critical: ≥ 1
      recommended_response: |
        In order to best determine the root cause, examine the Auctioneer logs. Depending on the specific error or resource constraint, you may also find a failure reason in the CC API.

        1. Investigate the health of Diego cells.
        2. Consider scaling additional cells.

  - name: auctioneer_lock_held
    promql: max_over_time(LockHeld{source_id="auctioneer"}[5m])
    documentation:
      title: Auctioneer - Lock Held
      description: |
        Whether an Auctioneer instance holds the expected Auctioneer lock (in Locket). 1 means the active Auctioneer holds the lock, and 0 means the lock was lost.

        Use: This metric is complimentary to Active Locks, and it offers an Auctioneer-level version of the Locket metrics. Although it is emitted per Auctioneer instance, only 1 active lock is held by Auctioneer. Therefore, the expected value is 1. The metric may occasionally be 0 when the Auctioneer instances are performing a leader transition, but a prolonged value of 0 indicates an issue with Auctioneer.

        Origin: Firehose
        Type: Gauge
        Frequency: Periodically
      recommended_alert_thresholds: |
        Red critical: any value other than 1
      recommended_response: |
        1. Run monit status on the instance group that the Auctioneer job is running on to check for failing processes.
        2. If there are no failing processes, then review the logs for Auctioneer.
           - Recent logs for Auctioneer should show all but one of its instances are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app development activity, such as cf push.