forked from jasonkeene/docs-rabbitmq-staging
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmonitor.html.md.erb
779 lines (666 loc) · 30.7 KB
/
monitor.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
---
title: Monitoring and KPIs for VMware RabbitMQ for Tanzu Application Service
owner: London Services
---
This topic explains how to monitor the health of the <%= vars.product_full %>
service using the logs, metrics, and Key Performance Indicators (KPIs) generated by
<%= vars.product_short %> component VMs.
<p class="note">
<strong>Note:</strong> As of <%= vars.product_short %> v2.0, <code>rabbitmq_prometheus</code>
plug-in now provides RabbitMQ Server metrics.
Consequently, many metric names change after upgrading to <%= vars.product_short %> v2.0.
Both on-demand and pre-provisioned <%= vars.product_short %> are affected.
For a list of the changes made in v2.0 to metric names, see
<a href="./migrate-2-0-metrics.html">Migrating Metrics from <%= vars.product_short %> v1.x to v2.0</a>.
</p>
## <a id="metrics"></a> Metrics
Metrics are regularly-collected log entries that report measured component states.
You can either consume metrics through the Loggregator subsystem, or by configuring a Prometheus server
or the Healthwatch tile.
The Loggregator subsystem collects metrics automatically based on the metrics polling interval.
Prometheus servers and the Healthwatch tile directly scrape the VMs deployed by <%= vars.product_short %>.
The RabbitMQ servers expose the same information in each case.
For a full list of all metrics exposed in pre-provisioned and on-demand service
instances of <%= vars.product_short %>, see the [Component Metrics Reference](#reference)
later in this topic.
<p class="note">
<strong>Note:</strong> As of <%= vars.product_short %> v2.0, the format of the metrics has changed.
For a list of the changes to metric names in <%= vars.product_short %> v2.0, see
<a href="./migrate-2-0-metrics.html">Migrating Metrics from <%= vars.product_short %> v1.x to v2.0</a>.
</p>
### <a id="loggregator"></a> Collecting Metrics with the Loggregator System
Loggregator-collected metrics are long, single lines of text that follow the format:
```
origin:"p-rabbitmq" eventType:ValueMetric timestamp:1616427704616569016 deployment:"cf-rabbitmq" job:"rabbitmq-broker" index:"0" ip:"10.0.4.101" tags:<key:"instance_id" value:"d4b4fd51-50de-4227-a96f-8ce636960f0b" > tags:<key:"source_id" value:"rabbitmq-broker" > valueMetric:<name:"_p_rabbitmq_service_broker_heartbeat" value:1 unit:"boolean" >
```
If the prometheus plug-in is enabled, <%= vars.product_short %> automatically
collects these metrics and forwards them to the Loggregator system.
For general information about logging and metrics in <%= vars.app_runtime_full %>
and how to consume the metrics from the Loggregator system, see
[Overview of Logging and Metrics](https://docs.pivotal.io/application-service/loggregator/data-sources.html).
#### <a id="metrics-polling-interval"></a> Configure the Metrics Polling Interval
The default metrics polling interval for Loggregator is 30 seconds.
The **metrics polling interval** is a configuration option on the <%= vars.product_short %> tile
(**Settings** > **RabbitMQ**). Setting this interval to -1 deactivates metrics.
The interval setting applies to all components deployed by the tile.
To configure the metrics polling interval:
1. From the <%= vars.ops_manager %> Installation Dashboard, click the <%= vars.product_short %> tile.
1. In the <%= vars.product_short %> tile, click the **Settings** tab.
1. Click **Metrics**.
![Screenshot of the RabbitMQ tile with header
'Metrics settings for both Pre-Provisioned and On-Demand service offerings'.
The fields shown are described in the table in the step.](images/metrics-configuration.png)
1. Configure the fields on the **Metrics** pane as follows:
<table class="nice">
<th>Field</th>
<th>Description</th>
<tr>
<td><strong>Metrics polling interval</strong></td>
<td>
The default setting is 30 seconds for all deployed components.
VMware recommends that you do not change this interval.
To avoid overwhelming components, do not set this below 10 seconds.
Set this to -1 to deactivate metrics.
Changing this setting affects all deployed instances.
</td>
</tr>
</table>
1. Click **Save**.
1. Return to the <%= vars.ops_manager %> Installation Dashboard.
1. Click **Review Pending Changes**.
For more information about this <%= vars.ops_manager %> page,
see [Reviewing Pending Product Changes](https://docs.pivotal.io/ops-manager/install/review-pending-changes.html).
1. Click **Apply Changes** to redeploy with the changes.
#### <a id="detailed-metrics"></a> Gathering Additional Metrics
As of <%= vars.product_short %> v2.0.11, in addition to the standard RabbitMQ server
metrics gathered by <%= vars.product_short %>, you can gather additional, detailed metrics for your system.
For more information about the additional metrics, see
[rabbitmq-server](https://github.com/rabbitmq/rabbitmq-server/tree/master/deps/rabbitmq_prometheus#selective-querying-of-per-object-metrics) in GitHub.
To limit the performance impact of gathering more data, you can choose to gather
additional metrics only for specific vhosts, or for only a subset of these metrics to be generated.
The process to configure additional metrics differs for the different service offerings:
- **For the on-demand offering:** You configure additional metrics when creating
or updating a service instance.
For more information, see [Collect Additional RabbitMQ Metrics in Loggregator (on-demand instances)](use.html#detailed-metrics).
- **For the pre-provisioned offering:** You configure additional metrics in <%= vars.ops_manager %>.
For more information, see [Collect Additional RabbitMQ Metrics in Loggregator (pre-provisioned instances)](install-config-pp.html#detailed-metrics).
### <a id="prometheus"></a> Collecting Metrics with Prometheus
Prometheus-style metrics are available at `SERVICE-INSTANCE-ID:15692/metrics`.
To pull these metrics from the service instances, you must deploy and configure a Prometheus instance.
For more information about the plugin and monitoring RabbitMQ using Prometheus and Grafana, see the
[RabbitMQ documentation](https://www.rabbitmq.com/prometheus.html).
The following Prometheus scrape config dynamically discovers RabbitMQ instances:
```
job_name: rabbitmq
metrics_path: "/metrics"
scheme: http
dns_sd_configs:
- names:
- q-s4.rabbitmq-server.*.*.bosh.
type: A
port: 15692
```
<p class="note">
<strong>Note:</strong> If you are using TLS in the on-demand service offering,
your port will be <code>15691</code>.
</p>
The regular expression in the scrape config name ensures that Prometheus discovers all future service
instances too.
If Prometheus is deployed with the Healthwatch v2 tile, then the above configuration is automatically applied.
<p class="note">
<strong>Note:</strong> By default, metrics are aggregated.
This results in a lower performance overhead at the cost of lower data fidelity.
For more information, see the
<a href="https://www.rabbitmq.com/prometheus.html#metric-aggregation">RabbitMQ documentation</a>.
</p>
#### <a id="per-object"></a> Scrape Per-Object Metrics
To collect metrics on a per-object scope, such as per-queue, do one of the following:
- Enable per-object metrics by setting `prometheus.return_per_object_metrics = true`.
For instructions, see [Expert Mode: Overriding RabbitMQ Server Configuration](./expert-override-config.html)
- Scrape the dedicated per-object metrics endpoint, for example:
```
job_name: rabbitmq
metrics_path: "/metrics/per-object"
scheme: http
dns_sd_configs:
- names:
- q-s4.rabbitmq-server.*.*.bosh.
type: A
port: 15692
```
<p class="note">
<strong>Note:</strong> Collecting per-object metrics on a system with many objects,
such as queues or connections, is very slow.
Ensure you understand the impact on your system and its load before enabling
this on a production cluster.
</p>
#### <a id="filter-per-object"></a> Filter the Per-Object Metrics
As of <%= vars.product_short %> v2.0.7, you can collect only the per-object metrics for
certain scopes of metrics.
This decreases the performance overhead, while retaining data fidelity for metrics that you are interested in.
For more information, see [Selective querying of per-object metrics](https://github.com/rabbitmq/rabbitmq-server/tree/master/deps/rabbitmq_prometheus#selective-querying-of-per-object-metrics).
For example, the following scrape config collects only the per-object metrics that allow you to see how
many messages sit in every queue and how many consumers each of these queues have:
```
job_name: rabbitmq
metrics_path: "/metrics/detailed?family=queue_coarse_metrics&family=queue_consumer_count"
scheme: http
dns_sd_configs:
- names:
- q-s4.rabbitmq-server.*.*.bosh.
type: A
port: 15692
```
### <a id="Grafana"></a> Grafana Dashboards
The RabbitMQ team has written dashboards that you can import into Grafana.
These dashboards include documentation for each metric.
* **[RabbitMQ-Overview](https://grafana.com/grafana/dashboards/10991):**
Dashboard for an overview of the RabbitMQ system
* **[Erlang-Distribution](https://grafana.com/grafana/dashboards/11352):**
Dashboard for the underlying Erlang distribution
For more information about these dashboards, see the
[RabbitMQ documentation](https://www.rabbitmq.com/prometheus.html).
If Grafana is deployed using the Healthwatch v2 tile, you can load these dashboards by selecting the
**Enable RabbitMQ dashboards** checkbox in the Healthwatch tile.
### <a id="heartbeats"></a> Component Heartbeats
Some components periodically emit Boolean heartbeat metrics to the Loggregator system.
<code>1</code> means the system is available, and <code>0</code> or the absence of a heartbeat metric
means the service is not responding and you must investigate the issue.
#### <a id="broker-heartbeat"></a> Service Broker Heartbeat
<table>
<tr><th colspan="2" style="text-align: center;"><br>_p_rabbitmq_service_broker_heartbeat<br><br></th></tr>
<tr>
<th width="25%">Description</th>
<td>
RabbitMQ service broker <code>is alive</code> poll that indicates if the component is
available and can respond to requests.
<br><br>
<strong>Use</strong>: If the service broker does not emit heartbeats, this indicates that it
is offline.
The service broker is required to create, update, and delete service instances, which are
critical for dependent tiles such as Spring Cloud Services and Spring Cloud Data Flow.
<br><br>
<strong>Origin</strong>: Doppler/Firehose<br>
<strong>Type</strong>: Boolean<br>
<strong>Frequency</strong>: 30 seconds (default), 10 seconds (configurable minimum)
</td>
</tr>
<tr>
<th>Recommended measurement</th>
<td>Average over last 5 minutes</td>
</tr>
<tr>
<th>Recommended alert thresholds</th>
<td>
<strong>Yellow warning</strong>: N/A<br>
<strong>Red critical</strong>: < 1
</td>
</tr>
<tr>
<th>Recommended response</th>
<td>
Search the RabbitMQ service broker logs for errors.
You can find this VM by targeting your <%= vars.product_short %> deployment
with BOSH, and running one of these commands:
<ul>
<li><strong>For on-demand:</strong> <pre class="terminal">bosh -d service-instance_GUID vms</pre></li>
<li><strong>For pre-provisioned:</strong> <pre class="terminal">bosh -d p-rabbitmq-GUID vms</pre></li>
</ul>
</td>
</tr>
</table>
#### <a id="haproxy-heartbeat"></a> HAProxy Heartbeat
<p class="note">
<strong>Note:</strong> The HAProxy is only used in the pre-provisioned service
offering, so HAProxy heartbeats are only present if this service offering is enabled.
</p>
<table>
<tr><th colspan="2" style="text-align: center;"><br> _p_rabbitmq_haproxy_heartbeat<br><br></th></tr>
<tr>
<th width="25%">Description</th>
<td>
RabbitMQ HAProxy <code>is alive</code> poll, which indicates if the
component is available can respond to requests.
<br><br>
<strong>Use</strong>: If the HAProxy does not emit heartbeats, this indicates
that it is offline. To be functional, pre-provisioned service instances require HAProxy.
<br><br>
<strong>Origin</strong>: Doppler/Firehose<br>
<strong>Type</strong>: Boolean<br>
<strong>Frequency</strong>: 30 seconds (default), 10 seconds (configurable minimum)
</td>
</tr>
<tr>
<th>Recommended measurement</th>
<td>Average over last 5 minutes</td>
</tr>
<tr>
<th>Recommended alert thresholds</th>
<td>
<strong>Yellow warning</strong>: N/A<br>
<strong>Red critical</strong>: < 1
</td>
</tr>
<tr>
<th>Recommended response</th>
<td>
Search the RabbitMQ HAProxy logs for errors.
You can find the VM by targeting your <%= vars.product_short %> deployment
with BOSH and running the following command, which lists <code>HAProxy_GUID</code>:
<pre class="terminal">bosh -d service-instance_GUID vms</pre>
</td>
</tr>
</table>
### <a id="kpi"></a> Key Performance Indicators
The following sections describe the metrics used as Key Performance Indicators (KPIs)
and other useful metrics for monitoring the <%= vars.product_short %> service.
KPIs for <%= vars.product_short %> are metrics that operators find most
useful for monitoring their <%= vars.product_short %> service to ensure smooth operation.
KPIs are high-signal-value metrics that can indicate emerging issues.
KPIs can be raw component metrics or derived metrics generated by applying formulas to raw metrics.
VMware provides the following KPIs as general alerting and response guidance for typical
<%= vars.product_short %> installations.
VMware recommends the following to operators:
- Continue to fine-tune the alert measures to your installation by observing historical trends.
- Expand beyond the guidance and create new, installation-specific monitoring metrics,
thresholds, and alerts based on learning from your own installation.
For a list of all <%= vars.product_short %> raw component metrics, see
[Component Metrics Reference](#reference) later in this topic.
#### <a id="kpi-heartbeat"></a> Component Heartbeats
If collecting metrics using Loggregator, several components in <%= vars.product_short %> emit heartbeat
metrics. For more information, see [Component Heartbeats](#heartbeats) earlier in this topic.
#### <a id="file-descriptors"></a> RabbitMQ Server File Descriptors
<table>
<tr><th colspan="2" style="text-align: center;"><br> rabbitmq_process_open_fds<br><br></th></tr>
<tr>
<th width="25%">Description</th>
<td>
The number of file descriptors consumed.
<br><br>
<strong>Use</strong>: If the number of file descriptors consumed becomes too large,
the VM might lose the ability to perform disk I/O, which can cause data loss.
<p class="note">
<strong>Note:</strong> nonpersistent messages are handled by retries or some other
logic by the producers.
</p>
<strong>Origin</strong>: Doppler/Firehose<br>
<strong>Type</strong>: Count<br>
<strong>Frequency</strong>: 30 seconds (default), 10 seconds (configurable minimum)
</td>
</tr>
<tr>
<th>Recommended measurement</th>
<td>Average over last 10 minutes</td>
</tr>
<tr>
<th>Recommended alert thresholds</th>
<td><strong>Yellow warning</strong>: > 250000 <br>
<strong>Red critical</strong>: > 280000</td>
</tr>
<tr>
<th>Recommended response</th>
<td>
The default <code>ulimit</code> for <%= vars.product_short %> is 300,000.
If this metric meets or exceeds the recommended thresholds for extended
periods of time, consider reducing the load on the server.
</td>
</tr>
</table>
#### <a id="erlang-processes"></a> Erlang Processes
<table>
<tr><th colspan="2" style="text-align: center;"><br> erlang_vm_process_count<br><br></th></tr>
<tr>
<th width="25%">Description</th>
<td>
The number of Erlang processes that RabbitMQ consumes. RabbitMQ runs on an Erlang VM.
For more information, see the <a href="https://www.erlang.org/docs">Erlang Documentation</a>.
<br><br>
<strong>Use</strong>: This is the key indicator of the processing capability of a node.
<br><br>
<strong>Origin</strong>: Doppler/Firehose<br>
<strong>Type</strong>: Count<br>
<strong>Frequency</strong>: 30 seconds (default), 10 seconds (configurable minimum)
</td>
</tr>
<tr>
<th>Recommended measurement</th>
<td>Average over last 10 minutes</td>
</tr>
<tr>
<th>Recommended alert thresholds</th>
<td>
<strong>Yellow warning</strong>: > 900000<br>
<strong>Red critical</strong>: > 950000
</td>
</tr>
<tr>
<th>Recommended response</th>
<td>
The default Erlang process limit in <%= vars.product_short %> v1.6 and later is 1,048,816.
If this metric meets or exceeds the recommended thresholds for extended
periods of time, consider scaling the RabbitMQ nodes in the tile <strong>Resource Config</strong> pane.
</td>
</tr>
</table>
### <a id="bosh"></a> BOSH System Health Metrics
<%# The below partial is in https://github.com/pivotal-cf/docs-partials %>
<%= partial vars.path_to_partials + '/services/bosh_health_metrics_pcf2' %>
All BOSH-deployed components generate the system health metrics listed in this section.
These component metrics are from <%= vars.product_short %> components, and serve as KPIs for
the <%= vars.product_short %> service.
#### <a id="ram"></a> RAM
<table>
<tr><th colspan="2" style="text-align: center;"><br> system_mem_percent <br><br></th></tr>
<tr>
<th width="25%">Description</th>
<td>
RAM being consumed by the <code>p.rabbitmq</code> VM.
<br><br>
<strong>Use</strong>: RabbitMQ is considered to be in a good state when it has few or no messages.
In other words, "an empty rabbit is a happy rabbit."
Alerting on this metric can indicate that there are too few consumers or apps that
read messages from the queue.
<br><br>
Healthmonitor reports when RabbitMQ uses more than 40% of its RAM for the past ten minutes.
<br><br>
<strong>Origin</strong>: BOSH HM<br>
<strong>Type</strong>: Percent<br>
<strong>Frequency</strong>: 30 seconds (default), 10 seconds (configurable minimum)
</td>
</tr>
<tr>
<th>Recommended measurement</th>
<td>Average over last 10 minutes</td>
</tr>
<tr>
<th>Recommended alert thresholds</th>
<td>
<strong>Yellow warning</strong>: > 40 <br>
<strong>Red critical</strong>: > 50
</td>
</tr>
<tr>
<th>Recommended response</th>
<td>Add more consumers to drain the queue as fast as possible.</td>
</tr>
</table>
#### <a id="cpu"></a> CPU
<table>
<tr><th colspan="2" style="text-align: center;"><br> system_cpu_user<br><br></th></tr>
<tr>
<th width="25%">Description</th>
<td>CPU being consumed by user processes on the <code>p.rabbitmq</code> VM.<br><br>
<strong>Use</strong>: A node that experiences context switching or high CPU usage becomes unresponsive.
This also affects the ability of the node to report metrics.
<br><br>
Healthmonitor reports when RabbitMQ uses more than 40% of its CPU for the past ten minutes.
<br><br>
<strong>Origin</strong>: BOSH HM<br>
<strong>Type</strong>: Percent<br>
<strong>Frequency</strong>: 30 seconds (default), 10 seconds (configurable minimum)<br>
</tr>
<tr>
<th>Recommended measurement</th>
<td>Average over last 10 minutes</td>
</tr>
<tr>
<th>Recommended alert thresholds</th>
<td>
<strong>Yellow warning</strong>: > 60 <br>
<strong>Red critical</strong>: > 75
</td>
</tr>
<tr>
<th>Recommended response</th>
<td>
Remember that "an empty rabbit is a happy rabbit". Add more consumers to drain the queue as fast as possible.
</td>
</tr>
</table>
#### <a id="ephemeral-disk"></a> Ephemeral Disk
<table>
<tr><th colspan="2" style="text-align: center;"><br> system_disk_ephemeral_percent<br><br></th></tr>
<tr>
<th width="25%">Description</th>
<td>
Ephemeral Disk being consumed by the <code>p.rabbitmq</code> VM.
<br><br>
<strong>Use</strong>: If system disk fills up, there are too few consumers.
<br><br>
Healthmonitor reports when RabbitMQ uses more than 50% of its Ephemeral Disk for the past ten minutes.
<br><br>
<strong>Origin</strong>: BOSH HM<br>
<strong>Type</strong>: Percent<br>
<strong>Frequency</strong>: 30 seconds (default), 10 seconds (configurable minimum)
</td>
</tr>
<tr>
<th>Recommended measurement</th>
<td>Average over last 10 minutes</td>
</tr>
<tr>
<th>Recommended alert thresholds</th>
<td>
<strong>Yellow warning</strong>: > 50 <br>
<strong>Red critical</strong>: > 75
</td>
</tr>
<tr>
<th>Recommended response</th>
<td>
Remember that "an empty rabbit is a happy rabbit". Add more consumers to drain the queue as
fast as possible. Insufficient disk space leads to node failures and might result in data
loss due to all disk writes failing.
</td>
</tr>
</table>
#### <a id="persistent-disk"></a> Persistent Disk
<table>
<tr><th colspan="2" style="text-align: center;"><br> system_disk_persistent_percent<br><br></th></tr>
<tr>
<th width="25%">Description</th>
<td>
Persistent Disk being consumed by the <code>p.rabbitmq</code> VM.<br><br>
<strong>Use</strong>: If system disk fills up, there are too few consumers.
<br><br>
Healthmonitor reports when RabbitMQ uses more than 50% of its Persistent Disk.
<br><br>
<strong>Origin</strong>: BOSH HM<br>
<strong>Type</strong>: percent<br>
<strong>Frequency</strong>: 30 seconds (default), 10 seconds (configurable minimum)
</td>
</tr>
<tr>
<th>Recommended measurement</th>
<td>Average over last 10 minutes</td>
</tr>
<tr>
<th>Recommended alert thresholds</th>
<td>
<strong>Yellow warning</strong>: > 50 <br>
<strong>Red critical</strong>: > 75
</td>
</tr>
<tr>
<th>Recommended response</th>
<td>
Remember that "an empty rabbit is a happy rabbit". Add more consumers to drain the queue as fast as possible. Insufficient disk space leads to node failures and might result in data loss due to all disk writes failing.
</td>
</tr>
</table>
## <a id="logging"></a> Logging
You can configure <%= vars.product_short %> to forward logs to an external syslog server, and customise the format of
the logs output.
### <a id="syslog-forwarding"></a> Configure Syslog Forwarding
Syslog forwarding is preconfigured and enabled by default.
VMware recommends that you keep the default setting because it is good operational practice.
However, you can opt out by selecting **No** for **Do you want to configure syslog?** in the
<%= vars.ops_manager %> **Settings** tab.
To enable monitoring for <%= vars.product_short %>, operators designate an external syslog endpoint
for <%= vars.product_short %> component log entries.
The endpoint serves as the input to a monitoring platform such as Datadog, Papertrail, or SumoLogic.
To specify the destination for <%= vars.product_short %> log entries:
1. From the <%= vars.ops_manager %> Installation Dashboard, click the <%= vars.product_short %> tile.
1. In the <%= vars.product_short %> tile, click the **Settings** tab.
1. Click **Syslog**.
![Screenshot of RabbitMQ tile settings with header called 'Syslog'.
The fields shown are described in the table in the next step.](images/syslog-config.png)
1. Configure the fields on the **Syslog** pane as follows:
<table class="nice">
<th>Field</th>
<th>Description</th>
<tr>
<td><strong>Syslog Address</strong></td>
<td>Enter the IP or DNS address of the syslog server</td>
</tr>
<tr>
<td><strong>Syslog Port</strong></td>
<td>Enter the port of the syslog server</td>
</tr>
<tr>
<td><strong>Transport Protocol</strong></td>
<td>Select the transport protocol of the syslog server. The options are <strong>TLS</strong>,
<strong>UDP</strong>, or <strong>RELP</strong>.</td>
</tr>
<tr>
<td><strong>Enable TLS</strong></td>
<td>Enable TLS to the syslog server.</td>
</tr>
<tr>
<td><strong>Permitted Peer</strong></td>
<td>If there are several peer servers that can respond to remote syslog connections,
enter a wildcard in the domain, such as <code>*.example.com</code>.</td>
</tr>
<tr>
<td><strong>SSL Certificate</strong></td>
<td>If the server certificate is not signed by a known authority, such as an internal syslog
server, enter the CA certificate of the log management service endpoint.</td>
</tr>
<tr>
<td><strong>Queue Size</strong></td>
<td>The number of log entries the buffer holds before dropping messages.
A larger buffer size might overload the system. The default is 100000.</td>
</tr>
<tr>
<td><strong>Forward Debug Logs</strong></td>
<td>Some components produce very long debug logs. This option prevents them from being
forwarded.
These logs are still written to local disk.</td>
</tr>
<tr>
<td><strong>Custom Rules</strong></td>
<td>
The custom rsyslog rules are written in
<a href="https://www.rsyslog.com/doc/v8-stable/rainerscript/index.html">RainerScript</a>
and are inserted before the rule that forwards logs.
For the list of custom rules you can add in this field, see
<a href="#rabbitmq-syslog-custom-rules">RabbitMQ Syslog Custom Rules</a> later in this topic.
For more information about the program names you can use in the custom rules, see
<a href="#program-names">RabbitMQ Program Names</a> later in this topic.
</td>
</tr>
</table>
1. Click **Save**.
1. Return to the <%= vars.ops_manager %> Installation Dashboard.
1. Click **Review Pending Changes**.
For more information about this <%= vars.ops_manager %> page,
see [Reviewing Pending Product Changes](https://docs.pivotal.io/ops-manager/install/review-pending-changes.html).
1. Click **Apply Changes** to redeploy with the changes.
### <a id="log-format"></a> Logging Format
With <%= vars.product_short %> logging configured, several types of components generate logs:
the RabbitMQ message server nodes, the service brokers, and (if present) HAProxy.
* The logs for RabbitMQ server nodes follow the format:
```
[job:"rabbitmq-server" ip:"192.0.2.0"]
```
* The logs for the pre-provisioned RabbitMQ service broker follow the format:
```
[job:"rabbitmq-broker" ip:"192.0.2.1"]
```
* The logs for the on-demand RabbitMQ service broker follow the format:
```
[job:"on-demand-broker" ip:"192.0.2.2"]
```
* The logs for HAProxy nodes follow the format:
```
[job:"rabbitmq-haproxy" ip:"192.0.2.3"]
```
RabbitMQ and HAProxy servers log at the <code>info</code> level and capture errors,
warnings, and informational messages.
<%= partial vars.path_to_partials + '/rabbitmq/log-formats' %>
## <a id="reference"></a> Component Metrics Reference
<%= vars.product_short %> component VMs emit the following raw metrics.
<p class="note">
<strong>Note:</strong> As of <%= vars.product_short %> v2.0, the format of the metrics has changed.
For a list of the changes to metric names in <%= vars.product_short %> v2.0, see
<a href="./migrate-2-0-metrics.html">Migrating Metrics from <%= vars.product_short %> v1.x to v2.0</a>.
</p>
### <a id="rabbitmq-metrics"></a> RabbitMQ Server Metrics
RabbitMQ server metrics are emitted by the `rabbitmq_prometheus` plug-in.
The list of metrics provided is extensive, and allows full observability of your messages, VM health, and more.
For the full list of metrics emitted, see the
[rabbitmq-server](https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbitmq_prometheus/metrics.md)
repository in GitHub.
### <a id="haproxy-metrics"></a>HAProxy Metrics (Pre-Provisioned Only)
<%= vars.product_short %> HAProxy components emit the following metrics.
<table>
<tr>
<th>Name Space</th>
<th>Unit</th>
<th>Description</th>
</tr>
<tr>
<td><code>_p_rabbitmq_haproxy_heartbeat</code></td>
<td>Boolean</td>
<td>Indicates whether the RabbitMQ HAProxy component is available and can respond to requests</td>
</tr>
<tr>
<td><code>_p_rabbitmq_haproxy_health_connections</code></td>
<td>Count</td>
<td>The total number of concurrent front-end connections to the server</td>
</tr>
<tr>
<td><code>_p_rabbitmq_haproxy_backend_qsize_amqp</code></td>
<td>Size</td>
<td>The total size of the AMQP queue on the server</td>
</tr>
<tr>
<td><code>_p_rabbitmq_haproxy_backend_retries_amqp</code></td>
<td>Count</td>
<td>The number of AMQP retries to the server</td>
</tr>
<tr>
<td><code>_p_rabbitmq_haproxy_backend_ctime_amqp</code></td>
<td>Time</td>
<td>The total time to establish the TCP AMQP connection to the server</td>
</tr>
</table>
### <a id="odb-metrics"></a>On-Demand Broker Metrics
The <%= vars.product_short %> on-demand broker emits the following metrics.
<table>
<tr>
<th>Name Space</th>
<th>Unit</th>
<th>Description</th>
</tr>
<tr>
<td><code>_on_demand_broker_p_rabbitmq_quota_remaining</code></td>
<td>Count</td>
<td>The total quota for on-demand service instances set for this broker</td>
</tr>
<tr>
<td><code>_on_demand_broker_p_rabbitmq_total_instances</code></td>
<td>Count</td>
<td>The total count of on-demand service instances created by this broker</td>
</tr>
<tr>
<td><code>_on_demand_broker_p_rabbitmq_{PLAN_NAME}_quota_remaining</code></td>
<td>Count</td>
<td>The total quota for on-demand service instances set for this broker for a specific plan</td>
</tr>
<tr>
<td><code>_on_demand_broker_p_rabbitmq_{PLAN_NAME}_total_instances</code></td>
<td>Count</td>
<td>The total count of on-demand service instances created by this broker for a specific plan</td>
</tr>
</table>