Add Prometheus Metrics #675

benyanke · 2021-09-20T13:18:48Z

Would you be open to either a PR, or documentation on how to use sanoid --monitoring-* flags but with prometheus? Prometheus is a growing monitoring tool that many sysadmins are using, including myself.

I'm thinking potential implementation could either be adding three new flags like --monitor-capacity-prometheus, or documentation on using a third party tool that could convert nagios format metrics into prometheus, like this (nag2prom doesn't exist yet, just an example):

sanoid --monitor-health | nag2prom > /var/lib/prometheus/node-exporter/sanoid-metrics.prom

Would you be open to a code or docs PR like this?

The text was updated successfully, but these errors were encountered:

jimsalterjrs · 2021-09-20T14:04:54Z

Yes. I'd probably prefer your "nag2prom" idea +docs, to avoid cluttering the code--at least at first.

…

On September 20, 2021 09:19:02 Ben Yanke ***@***.***> wrote: Would you be open to either a PR, or documentation on how to use sanoid --monitoring-* flags but with prometheus? Prometheus is a growing monitoring tool that many sysadmins are using, including myself. I'm thinking potential implementation could either be adding three new flags like --monitor-capacity-prometheus, or documentation on using a third party tool that could convert nagios format metrics into prometheus, like this (nag2prom doesn't exist yet, just an example): sanoid --monitor-health | nag2prom > /var/lib/prometheus/node-exporter/sanoid-metrics.prom Would you be open to a code or docs PR like this? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

benyanke · 2021-09-20T14:47:17Z

ok - I'll play with that.

phreaker0 · 2021-09-20T14:50:43Z

@benyanke you should look into zpool_influxdb (included with recent zfs versions), which can be scraped by prometheus via telegraf

equinox0815 · 2022-01-09T15:03:20Z

@benyanke have you made any progress on this?

benyanke · 2022-01-09T16:29:41Z

Sadly not. My perl is not great, so despite it being somewhat simple, I've not made much progress.

Hooloovoo · 2022-01-19T12:38:31Z

@jimsalterjrs would you be open to PR that just added a --monitoring-metrics-json and spat out all the relevant metrics that are included in --monitor-health and --monitor-capacity output in machine-readable JSON?

I can see the value in keeping the code uncluttered and so I'm happy to keep the prometheus-specific stuff in a separate script, but I would rather have a structured way to extract the variables rather than parsing commandline output.

jimsalterjrs · 2022-01-20T19:18:50Z

@Hooloovoo that sounds fine, as long as it's implemented cleanly.

benyanke · 2022-01-21T03:45:50Z

I like that idea, allows far more flexible integration into any monitoring stack, not just Prometheus.

Hooloovoo · 2022-01-30T21:58:16Z

I have started to put something together in:
https://github.com/Hooloovoo/sanoid/tree/add_metrics_json

This is my first time ever coding in Perl, so I have likely made silly mistakes. So far I have only done the --monitor-snapshots information, as this is the information that is hardest to just replace with other ZFS prometheus monitoring.

I have the bones of a simple Python script that uses the Prometheus Python library to write the metrics in text format so that it can be picked up by the textfile collector I already have running on the nodes.

I was planning to hold off on submitting an MP until I have actually made this all work, but I wanted to mention it to avoid anyone else duplicating the work.

At the moment I have taken the approach of only exposing the metrics and Sanoid configuration and I have not exported Sanoid's calculations of whether those metrics should result in a warning or critical, as I think this should be possible within the e.g. Prometheus alert (so e.g. the alert would trigger a critical if has_snapshots = 0 or newest_age_seconds > crit_age_seconds unless monitor_dont_crit is set), but I could easily be convinced I'm wrong with that approach. There probably isn't any harm in adding them, even if they are redundant -- I suppose this could either be a couple more, e.g. snapshot_critical and snapshot_warn or a single metric snapshot_health that was 0 for OK, 1 for warn and 2 for critical, or something?

Hooloovoo · 2022-02-20T21:52:36Z

I have uploaded my simple Python script here:
https://gitlab.com/aaron-w/sanoid_prometheus
along with some instructions on how I am using it. I have it running on a couple of machines and it seems to be doing what it is supposed to be doing. I would appreciate any views. Otherwise I'll try to get these initial changes merged (currently only dealing with the snapshot side; I'll hopefully add the ZFS health/capacity parts in the future).

Hooloovoo · 2022-08-17T19:50:56Z

I have put up a merge proposal that outputs JSON for snapshot information:
#761
There is an example in there showing what is included in the output. I have been using this successfully since February (as mentioned above).

So far this MP only deals with the snapshot information, as this is what really needs to come from Sanoid. There are other ways to extract the zpool health and capacity (I use a free Grafana cloud account and recent versions of grafana-agent export metrics like node_zfs_zpool_state{state="online"}, so this was most urgent for me.

I have designed the output format and code to accommodate someone (maybe me) adding the output of --monitor-health and --monitor-capacity into the same JSON output in the future.

Hooloovoo mentioned this issue Mar 28, 2022

Add tests for current behaviour of --monitor-snapshots #729

Open

Hooloovoo mentioned this issue Aug 16, 2022

Add metrics json #761

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Prometheus Metrics #675

Add Prometheus Metrics #675

benyanke commented Sep 20, 2021

jimsalterjrs commented Sep 20, 2021 via email

benyanke commented Sep 20, 2021

phreaker0 commented Sep 20, 2021

equinox0815 commented Jan 9, 2022 •

edited

Loading

benyanke commented Jan 9, 2022 •

edited

Loading

Hooloovoo commented Jan 19, 2022

jimsalterjrs commented Jan 20, 2022

benyanke commented Jan 21, 2022

Hooloovoo commented Jan 30, 2022 •

edited

Loading

Hooloovoo commented Feb 20, 2022

Hooloovoo commented Aug 17, 2022

Add Prometheus Metrics #675

Add Prometheus Metrics #675

Comments

benyanke commented Sep 20, 2021

jimsalterjrs commented Sep 20, 2021 via email

benyanke commented Sep 20, 2021

phreaker0 commented Sep 20, 2021

equinox0815 commented Jan 9, 2022 • edited Loading

benyanke commented Jan 9, 2022 • edited Loading

Hooloovoo commented Jan 19, 2022

jimsalterjrs commented Jan 20, 2022

benyanke commented Jan 21, 2022

Hooloovoo commented Jan 30, 2022 • edited Loading

Hooloovoo commented Feb 20, 2022

Hooloovoo commented Aug 17, 2022

equinox0815 commented Jan 9, 2022 •

edited

Loading

benyanke commented Jan 9, 2022 •

edited

Loading

Hooloovoo commented Jan 30, 2022 •

edited

Loading