-
Notifications
You must be signed in to change notification settings - Fork 214
/
Copy pathindicators.yml.erb
189 lines (157 loc) · 9.73 KB
/
indicators.yml.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
apiVersion: indicatorprotocol.io/v1
kind: IndicatorDocument
metadata:
labels:
deployment: <%= spec.deployment %>
component: bbs
spec:
product:
name: diego
version: latest
indicators:
- name: convergence_lrp_duration
promql: max_over_time(ConvergenceLRPDuration{source_id="bbs"}[15m]) / 1000000000
documentation:
title: BBS - Time to Run LRP Convergence
description: |
Time in ns that the BBS took to run its LRP convergence pass.
Use: If the convergence run begins taking too long, apps or Tasks may be crashing without restarting. This symptom can also indicate loss of connectivity to the BBS database.
Origin: Firehose
Type: Gauge (Integer in ns)
Frequency: 30 s
recommended_alert_thresholds: |
Yellow warning: ≥ 10 s
Red critical: ≥ 20 s
recommended_response: |
1. Check BBS logs for errors.
2. Try vertically scaling the BBS VM resources up. For example, add more CPUs or memory depending on its system.cpu/system.memory metrics.
3. Consider vertically scaling the backing database, if system.cpu and system.memory metrics for the database instances are high.
- name: request_latency
promql: avg_over_time(RequestLatency{source_id="bbs"}[15m]) / 1000000000
documentation:
title: BBS - Time to Handle Requests
description: |
The maximum observed latency time over the past 60 seconds that the BBS took to handle requests across all its API endpoints.
Diego is now aggregating this metric to emit the max value observed over 60 seconds.
Use: If this metric rises, the BBS API is slowing. Response to certain operations is slow if request latency is high.
Origin: Firehose
Type: Gauge (Integer in ns)
Frequency: 60 s
recommended_alert_thresholds: |
Yellow warning: ≥ 5s
Red critical: ≥ 10s
recommended_response: |
1. Check CPU and memory statistics.
2. Check BBS logs for faults and errors that can indicate issues with BBS.
3. Try scaling the BBS VM resources up. For example, add more CPUs/memory depending on its system.cpu/system.memory metrics.
4. Consider vertically scaling the backing database, if system.cpu and system.memory metrics for the database instances are high.
- name: lrps_extra
promql: avg_over_time(LRPsExtra{source_id="bbs"}[5m])
documentation:
title: BBS - More App Instances Than Expected
description: |
Total number of LRP instances that are no longer desired but still have a BBS record. When Diego wants to add more apps, the BBS sends a request to the auctioneer to spin up additional LRPs. LRPsExtra is the total number of LRP instances that are no longer desired but still have a BBS record.
Use: If Diego has more LRPs running than expected, there may be problems with the BBS.
Deleting an app with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsExtra is unusual and should be investigated.
Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
recommended_alert_thresholds: |
Yellow warning: ≥ 5
Red critical: ≥ 10
recommended_response: |
1. Review the BBS logs for proper operation or errors, looking for detailed error messages.
2. Check the Domain freshness.
- name: lrps_missing
promql: avg_over_time(LRPsMissing{source_id="bbs"}[5m])
documentation:
title: BBS - Fewer App Instances Than Expected
description: |
Total number of LRP instances that are desired but have no record in the BBS. When Diego wants to add more apps, the BBS sends a request to the auctioneer to spin up additional LRPs. LRPsMissing is the total number of LRP instances that are desired but have no BBS record.
Use: If Diego has less LRP running than expected, there may be problems with the BBS.
An app push with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsMissing is unusual and should be investigated.
Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
recommended_alert_thresholds: |
Yellow warning: ≥ 5
Red critical: ≥ 10
recommended_response: |
1. Review the BBS logs for proper operation or errors, looking for detailed error messages.
2. Check the Domain freshness.
- name: crashed_actual_lrps
promql: avg_over_time(CrashedActualLRPs{source_id="bbs"}[5m])
documentation:
title: BBS - Crashed App Instances
description: |
Total number of LRP instances that have crashed.
Use: Indicates how many instances in the deployment are in a crashed state. An increase in bbs.CrashedActualLRPs can indicate several problems, from a bad app with many instances associated, to a platform issue that is resulting in app crashes. Use this metric to help create a baseline for your deployment. After you have a baseline, you can create a deployment-specific alert to notify of a spike in crashes above the trend line. Tune alert values to your deployment.
Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
recommended_alert_thresholds: Dynamic
recommended_response: |
1. Look at the BBS logs for apps that are crashing and at the cell logs to see if the problem is with the apps themselves, rather than a platform issue.
1. Alerting thresholds for this metric will vary depending on your deployment’s size and utilization. Monitoring crashed LRPs and investigating their cause (a troublesome application or a problem with the deployment itself) can help determine what levels to set for alerts.
- name: lrps_running
promql: avg_over_time(LRPsRunning{source_id="bbs"}[1h]) - avg_over_time(LRPsRunning{source_id="bbs"}[1h] offset 1h)
documentation:
title: BBS - Running App Instances, Rate of Change
description: |
Rate of change in app instances being started or stopped on the platform. It is derived from bbs.LRPsRunning and represents the total number of LRP instances that are running on Diego cells.
Use: Delta reflects upward or downward trend for app instances started or stopped. Helps to provide a picture of the overall growth trend of the environment for capacity planning. You may want to alert on delta values outside of the expected range.
Origin: Firehose
Type: Gauge (Float)
Frequency: During event, emission should be constant on a running deployment.
recommended_alert_thresholds: Dynamic
recommended_response: |
1. Scale components as necessary.
- name: bbs_lock_held
promql: max_over_time(LockHeld{source_id="bbs"}[5m])
documentation:
title: BBS - Lock Held
description: |
Whether a BBS instance holds the expected BBS lock (in Locket). 1 means the active BBS server holds the lock, and 0 means the lock was lost.
Use: This metric is complimentary to Active Locks, and it offers a BBS-level version of the Locket metrics. Although it is emitted per BBS instance, only 1 active lock is held by BBS. Therefore, the expected value is 1. The metric may occasionally be 0 when the BBS instances are performing a leader transition, but a prolonged value of 0 indicates an issue with BBS.
Origin: Firehose
Type: Gauge
Frequency: Periodically
recommended_alert_thresholds: |
Red critical: any value other than 1
recommended_response: |
1. Run monit status on the instance group that the BBS job is running on to check for failing processes.
2. If there are no failing processes, then review the logs for BBS.
- A healthy BBS shows obvious activity around starting or claiming LRPs.
- An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer.
- name: domain_cf_apps
promql: max_over_time(Domain_cf_apps{source_id="bbs"}[5m])
documentation:
title: BBS - Cloud Controller and Diego in Sync
description: |
Indicates if the cf-apps Domain is up-to-date, meaning that CF App requests from Cloud Controller are synchronized to bbs.LRPsDesired (Diego-desired AIs) for execution.
- 1 means cf-apps Domain is up-to-date
- No data received means cf-apps Domain is not up-to-date
Use: If the cf-apps Domain does not stay up-to-date, changes requested in the Cloud Controller are not guaranteed to propagate throughout the system. If the Cloud Controller and Diego are out of sync, then apps running could vary from those desired.
Origin: Firehose
Type: Gauge (Float)
Frequency: 30 s
recommended_response: |
1. Check the BBS and Clock Global (Cloud Controller clock) logs.
- name: bbs_master_elected
promql: BBSMasterElected{source_id="bbs"}
presentation:
chartType: bar
documentation:
title: BBS - Master Elected
description: |
Indicates when there is a BBS master election, meaning that a BBS instance has taken over as the active instance. A value of 1 is emitted when the election takes place.
Use: This metric will be emitted when a redeployment of the BBS occurs. If this metric is emitted frequently outside of a deployment, this may be a signal of underlying problems that should be investigated. If the active BBS is continually changing, this can cause application push downtime.
Origin: Firehose
Type: Gauge (Float)
Frequency: On event
recommended_alert_thresholds: Dynamic
recommended_response: |
1. Check the BBS logs.
1. Check the BBS VM load average and instance group size.
1. Check the health of the connection to the backing SQL database and network latency.