From 4d8bf246a953514c1ac6750818fe986d1e970bf8 Mon Sep 17 00:00:00 2001 From: Areg Vrtanesyan Date: Tue, 15 May 2018 10:35:30 +0100 Subject: [PATCH 1/2] Initial attempt for merging https://github.com/robbrucks/libzbxpgsql-streaming to main stream of https://github.com/cavaliercoder/libzbxpgsql --- README-STREAMING.md | 69 + README.md | 20 + conf/libzbxpgsql-streaming.conf | 144 + conf/libzbxpgsql.conf | 1 + sql/replication_pump_func.sql | 29 + ...mplate_PostgreSQL_Server_3.0_Secondary.xml | 6015 +++++++++++++++++ ...mplate_PostgreSQL_Server_3.0_Streaming.xml | 498 ++ ...tgreSQL_Server_3.0_Streaming_Secondary.xml | 498 ++ 8 files changed, 7274 insertions(+) create mode 100644 README-STREAMING.md create mode 100644 conf/libzbxpgsql-streaming.conf create mode 100644 sql/replication_pump_func.sql create mode 100644 templates/Template_PostgreSQL_Server_3.0_Secondary.xml create mode 100644 templates/Template_PostgreSQL_Server_3.0_Streaming.xml create mode 100644 templates/Template_PostgreSQL_Server_3.0_Streaming_Secondary.xml diff --git a/README-STREAMING.md b/README-STREAMING.md new file mode 100644 index 0000000..cd9849f --- /dev/null +++ b/README-STREAMING.md @@ -0,0 +1,69 @@ +# libzbxpgsql-streaming +Monitoring Add-On for libzbxpgsql v1.1 to monitor PostgreSQL Streaming Replication on Zabbix + +Ref: https://github.com/robbrucks/libzbxpgsql-streaming + +## Setup + +*This Document assumes you have the libzbxpgsql v1.1 module installed, configured, and already succesfully monitoring your postgres database clusters (instances)* + +If you have set up Streaming Replication as in https://wiki.postgresql.org/wiki/Streaming_Replication #37 you can also add following Templates and configuration to the host. + + * Main Template called `Template_PostgreSQL_Server_3.0_Streaming.xml` + * Secondary Template called `Template_PostgreSQL_Server_3.0_Streaming_Secondary.xml` + + In order to use the Secondary one it is necessary to prepare it: + + * Variable `@Secondary@` is for UI Names + * Variable `@SECONDARY@` is for separating `PG_CONN` from main template - #112, #107 + + sed -e 's/@Secondary@/SomeNiceName/g; s/@SECONDARY@/INSTANCENAME/g;' Template_PostgreSQL_Server_3.0_Streaming_Secondary.xml > Template_PostgreSQL_Server_3.0_Streaming_SomeNiceName.xml + +This will distinguish instances running on same host but different ports. + +1. Copy the `libzbxpgsql-streaming.conf` file as `libzbxpgsql.conf` into the `/etc/zabbix` directory on your master and slave DB servers +1. Execute the SQL script `sql/replication_pump_func.sql` on each *master* DB cluster against the same database as defined in your {$PG\_DB} macro in Zabbix (the `postgres` database by default) + ``` + psql -f replication_pump_func.sql -U postgres -d postgres + ``` +1. If you will be using a DB user other than `postgres` to connect to the DB from the Zabbix agent, you will need to grant execute on the function to that user: + ``` + psql -c 'GRANT EXECUTE ON FUNCTION replication_pump() TO your_zabbix_user;' -U postgres postgres + ``` +1. Link the `Template App PostgreSQL Streaming` template to your master and each of your slave DB hosts via the Zabbix UI +1. Restart the zabbix-agent on your DB servers + +## What is monitored? +* Count of WAL log bytes waiting to be applied on each _connected_ slave ("lag bytes"; measured on the master) +* Number of seconds a slave is behind the master ("lag seconds"; measured on each slave) +* Whether or not replication has been paused on a slave + +## What does it alert on? +* If "lag bytes" exceeds the value of Zabbix macro variable "{$PG\_ALRT\_SLAVE\_LAG\_BYTES}" (default 100mb) +* If "lag seconds" exceeds the value of Zabbix macro variable "{$PG\_ALRT\_SLAVE\_LAG\_SECS}" (default 300 seconds) +* If replication is manually paused on a slave + +## What's up with the "Replication Master Log Pump" thingy? +Get ready for a lengthy explanation... + +On a master/slave setup using streaming replication, selecting from the `pg_last_xact_replay_timestamp()` function on the slaves will report the timestamp of the last update replayed. This works fine when the master has constant update activity, but if there is a period of time where there are no updates on the master then the replay timestamp will not get updated on the slaves (since there are no changes to stream). This can cause the slave to _appear_ to be significantly behind the master despite actually being up to date. + +I discovered that issuing a simple `NOTIFY` command on the master will cause the notification to be streamed to the slaves. The PostgreSQL documentation indicates that the notification is discarded if there are no corresponding listeners, so this appears to be a relatively harmless and lightweight method to force the replay timestamp to be updated on the slaves. + +The `NOTIFY` command does not make any changes to data or schemas in the database and it does not require any special permissions to execute. + +Unfortunately libzbxpgsql cannot issue a `NOTIFY` command directly, so I had to implement it using a function. + +So the `PostgreSQL Streaming Replication Master Log Pump` item runs this function every 30 seconds to execute a `NOTIFY` and "pumps" the WAL log stream. + +By issuing the `NOTIFY` every 30 seconds I can ensure that the replay timestamp on the slaves is updated at least that frequently, even if a master goes "quiet". Then if the timestamp fails to get updated for longer than 30 seconds I can alert that there is truly a problem with replication. + +This seemed a far more elegant solution than creating a single-row table with a timestamp, regularly updating it, and watching for the timestamp update on the slaves. It eliminates the need for a table, permissions, and frequent vacuums of the table. + +## But I can alert on lag bytes... +Yes, this template also measures the lag bytes as reported by the master in the `pg_stat_replication` view, and it will alert if lag bytes becomes high. But the `pg_stat_replication` view has a critical weakness: if communication with the slave is lost, the corresponding row in `pg_stat_replication` for that slave is immediately deleted and you will not know how far behind replication is. Since that row is gone, Zabbix can't measure any lag and can't alert that the slave is falling behind. + +## What am I trying to solve here? +I'm trying to solve one of the more commonly encountered problems with replication: How can I tell that a communication issue has stalled streaming replication? + +I think I've solved it - but please let me know if I've got something wrong... diff --git a/README.md b/README.md index a5a02ef..b0f31e0 100644 --- a/README.md +++ b/README.md @@ -45,6 +45,26 @@ To build the RPM package on a RHEL6+ family system with `rpm-build` installed: make rpm +## Templates + + For Zabbix 3.0 there are 2 templates: + + * Main Template called `Template_PostgreSQL_Server_3.0.xml` + * Secondary Template called `Template_PostgreSQL_Server_3.0_Secondary.xml` + + In order to use the Secondary one it is necessary to prepare it: + + * Variable `@Secondary@` is for UI Names + * Variable `@SECONDARY@` is for separating `PG_CONN` from main template - #112, #107 + + sed -e 's/@Secondary@/SomeNiceName/g; s/@SECONDARY@/INSTANCENAME/g;' Template_PostgreSQL_Server_3.0_Secondary.xml > Template_PostgreSQL_Server_3.0_SomeNiceName.xml + +This will distinguish instances running on same host but different ports. + +## Streaming Monitoring + +Please follow instructions as per README-STREAMING.md + ## License diff --git a/conf/libzbxpgsql-streaming.conf b/conf/libzbxpgsql-streaming.conf new file mode 100644 index 0000000..8d319eb --- /dev/null +++ b/conf/libzbxpgsql-streaming.conf @@ -0,0 +1,144 @@ +# File: /etc/zabbix/libzbxpgsql.conf +# +# This file contains configuration for all pg.* keys. +# +# By default, this file is loaded from /etc/zabbix/libzbxpgsql.conf, unless +# the PGCONFIGFILE environment variable is set to a different path. +# +# The config file is only read at startup of Zabbix agent. If you modify the +# config file, you will need to restart the Zabbix agent for it to take effect. +# +# Syntax errors in the config file will prevent Zabbix from starting. +# +# The config files are parsed by the C libconfig module: +# http://www.hyperrealm.com/main.php?s=libconfig +# +# Comment lines begin with a hash '#'. +# +# The format for defining named SQL queries is: +# queries = { +# SQLkey = "SQL statement"; +# }; +# +# Requirements: +# - The SQL key must be alphanumeric and can contain dashes and underscores +# (-DO NOT- use asterisks or spaces in the key name). +# - The entire SQL statement must be enclosed in double quotes. +# - If your SQL statement needs to utilize double-quotes, then they MUST be +# escaped by a backslash: +# "SELECT \"UPPERCASECOLUMN\" from table;"; +# - A semicolon is required at the end of each config entry. +# +# Example Query Setup (with substitution variables): +# * Zabbix agent key, including a named query: +# pg_query.integer[,,myquery,45,200] +# +# * Matching query from the config file: +# myquery = "Select $1::int + $2::int;"; +# +# * The agent will return the integer: 245 +# +# SQL statements can span multiple lines, and may optionally contain extra +# begin/end quotes on each line. The following two examples are both valid: +# +# GoodSQL1 = "select count(*) +# from pg_stat_activity;"; +# +# AlsoGood = "select count(*) " +# " from pg_stat_activity;"; +# + +# Example Queries +queries = { + teststr = "SELECT $1::text || $2::text;"; + testint = "SELECT $1::int;"; + testdbl = "SELECT $1::decimal;"; + testdsc = "SELECT * FROM pg_database;"; + + +###################################################### +# _____ _ _ _ _ +# | __ \ | (_) | | (_) +# | |__) |___ _ __ | |_ ___ __ _| |_ _ ___ _ __ +# | _ // _ \ '_ \| | |/ __/ _` | __| |/ _ \| '_ \ +# | | \ \ __/ |_) | | | (_| (_| | |_| | (_) | | | | +# |_| \_\___| .__/|_|_|\___\__,_|\__|_|\___/|_| |_| +# | | +# |_| +###################################################### +# NOTES! +# +# 1. The following SQL: +# "now() > pg_last_xact_replay_timestamp()" +# is required in certain corner cases where +# time drift between two servers causes a +# negative time lag which converts into a huge +# unsigned number shown for lag in Zabbix. +# +# 2. The "replpump" method appears to be a good +# way to force streaming to occur on "quiet" +# databases without performing updates, but +# I do not know if there are any potential +# long-term problems with constantly issuing +# a notification with no listeners. Based on +# emails to pg-general it *should not* be a +# problem. +# + +# Replication Master Discovery +dscrepmstr = "select case when client_hostname is not null " + "then client_hostname " + "else case when client_addr is not null " + "then host(client_addr) " + "else 'localhost'::text " + "end " + "end as \"PGSLAVE\" " + ", case when client_addr is not null " + "then host(client_addr) " + "else 'localhost'::text " + " end as \"PGSLAVEIP\" " + "from pg_stat_replication" + ";"; + +# Replication Slave Discovery +dscrepslave = "select 'Slave' as \"PGSLV\" " + ", 'repllag' as \"PGSLVLAG\" " + ", 'replpaused' as \"PGSLVPAUSED\" " + "where pg_is_in_recovery() = TRUE; " + ";"; + +# Force replication to stream when master is idle (EXPERIMENTAL!) +# (executed on master) +replpump = "select replication_pump();"; + +# Replication lag seconds (measured on slave) +repllag = "select case when pg_is_in_recovery() = TRUE " + "then case when now() > pg_last_xact_replay_timestamp() " + "then extract(epoch from now() - pg_last_xact_replay_timestamp())::int " + "else 0::int " + "end " + "else 0::int " + "end " + ";"; + +# Replication lag bytes (measured on master) +repllagbytes = "select pg_xlog_location_diff(pg_current_xlog_location(),replay_location) " + "from pg_stat_replication " + "where ( ( $1::text != 'localhost' " + "and client_addr = $1::inet) " + "or ( $1::text = 'localhost' " + "and client_addr is null) ) " + ";"; + +# Replication paused (measured on slave) +replpaused = "select case when pg_is_in_recovery() = TRUE " + "then case when pg_is_xlog_replay_paused() = TRUE " + "then 1::int " + "else 0::int " + "end " + "else 0::int " + "end" + ";"; + +}; + diff --git a/conf/libzbxpgsql.conf b/conf/libzbxpgsql.conf index 6d5a46e..70dfe9a 100644 --- a/conf/libzbxpgsql.conf +++ b/conf/libzbxpgsql.conf @@ -55,3 +55,4 @@ queries = { testdbl = "SELECT $1::decimal;"; testdsc = "SELECT * FROM pg_database;"; }; + diff --git a/sql/replication_pump_func.sql b/sql/replication_pump_func.sql new file mode 100644 index 0000000..919e0e8 --- /dev/null +++ b/sql/replication_pump_func.sql @@ -0,0 +1,29 @@ +SET ROLE postgres; + +-- If there are slaves and this DB is NOT in recovery, +-- then issue a fake notify command to force the log +-- to stream. This will update pg_last_xact_replay_timestamp +-- on all slaves. + +BEGIN; + + CREATE OR REPLACE FUNCTION replication_pump() + RETURNS void + AS $$ + DECLARE slavect int; + BEGIN + SELECT count(*) INTO slavect FROM pg_stat_replication; + IF slavect > 0 AND pg_is_in_recovery() = FALSE THEN + NOTIFY libzbxpgsql_fake_notify; + END IF; + END; + $$ + LANGUAGE plpgsql + VOLATILE + ; + + REVOKE ALL ON FUNCTION replication_pump() FROM public; + GRANT EXECUTE ON FUNCTION replication_pump() TO postgres; + +COMMIT; + diff --git a/templates/Template_PostgreSQL_Server_3.0_Secondary.xml b/templates/Template_PostgreSQL_Server_3.0_Secondary.xml new file mode 100644 index 0000000..aa6712c --- /dev/null +++ b/templates/Template_PostgreSQL_Server_3.0_Secondary.xml @@ -0,0 +1,6015 @@ + + + 3.0 + 2018-05-13T22:11:39Z + + + Templates + + + + + + + + {Template App PostgreSQL @Secondary@:pg.backends.free[{$PG_CONN_@SECONDARY@},{$PG_DB}].last()}<{$PG_BACKENDS_CRIT} + 0 + + PostgreSQL (@Secondary@) Backend connections are exhausted on {HOST.NAME} + 0 + + https://www.postgresql.org/docs/current/static/runtime-config-connection.html#GUC-MAX-CONNECTIONS + 0 + 4 + Less than {$PG_BACKENDS_CRIT} backends connections are available. + +Investigate the issue immediately and consider increasing max_connections. + 0 + 0 + + + + + ({TRIGGER.VALUE}=0 and {Template App PostgreSQL @Secondary@:pg.prepared_xacts_ratio[{$PG_CONN_@SECONDARY@},{$PG_DB}].last()}>{$PG_PXACT_WARN} and {Template App PostgreSQL @Secondary@:pg.prepared_xacts_ratio[{$PG_CONN_@SECONDARY@},{$PG_DB}].last()}<{$PG_PXACT_CRIT}) or ({TRIGGER.VALUE}=1 and {Template App PostgreSQL @Secondary@:pg.prepared_xacts_ratio[{$PG_CONN_@SECONDARY@},{$PG_DB}].last()}>{$PG_PXACT_WARN}) + 0 + + PostgreSQL (@Secondary@) Prepared transactions are near exhaustion on {HOST.NAME} + 0 + + https://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-MAX-PREPARED-TRANSACTIONS + 0 + 2 + 80% of the maximum configured prepared transactions are in use. + +Investigate the issue and consider increasing max_prepared_transactions. + 0 + 0 + + + + + {Template App PostgreSQL @Secondary@:pg.uptime[{$PG_CONN_@SECONDARY@},{$PG_DB}].change(0)}<0 + 0 + + PostgreSQL (@Secondary@) Server on {HOST.NAME} has just been restarted + 0 + + + 0 + 1 + + 0 + 0 + + + + + {Template App PostgreSQL @Secondary@:pg.connect[{$PG_CONN_@SECONDARY@},{$PG_DB}].max(#3)}<1 + 0 + + PostgreSQL (@Secondary@) Server on {HOST.NAME} is unreachable for the last 3 polls + 0 + + + 0 + 4 + + 0 + 0 + + + + + {Template App PostgreSQL @Secondary@:pg.version[{$PG_CONN_@SECONDARY@},{$PG_DB}].diff(0)}>0 + 0 + + PostgreSQL (@Secondary@) Server version was changed on {HOST.NAME} + 0 + + + 0 + 1 + + 0 + 0 + + + + + {Template App PostgreSQL @Secondary@:pg.db.xid_age[{$PG_CONN_@SECONDARY@},{$PG_DB}].last()}>{$PG_XID_CRIT} + 0 + + PostgreSQL (@Secondary@) Transaction IDs are exhausted on {HOST.NAME} + 0 + + https://www.postgresql.org/docs/current/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND + 0 + 4 + Less than 1 million (0.0005%) Transactions IDs are remaining for allocation. + +Perform a VACUUM immediately to reset the available Transaction IDs and review your auto-vacuum policy. + 0 + 0 + + + + + ({TRIGGER.VALUE}=0 and {Template App PostgreSQL @Secondary@:pg.db.xid_age[{$PG_CONN_@SECONDARY@},{$PG_DB}].last()}>{$PG_XID_WARN} and {Template App PostgreSQL @Secondary@:pg.db.xid_age[{$PG_CONN_@SECONDARY@},{$PG_DB}].last()}<{$PG_XID_CRIT}) or ({TRIGGER.VALUE}=1 and {Template App PostgreSQL @Secondary@:pg.db.xid_age[{$PG_CONN_@SECONDARY@},{$PG_DB}].last()}>{$PG_XID_WARN}) + 0 + + PostgreSQL (@Secondary@) Transaction IDs are near exhaustion on {HOST.NAME} + 0 + + https://www.postgresql.org/docs/current/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND + 0 + 2 + Less than 10 million (0.04%) Transactions IDs are remaining for allocation. + +Perform a VACUUM as soon a possible to reset the available Transaction IDs and review your auto-vacuum policy. + 0 + 0 + + + + + + + PostgreSQL (@Secondary@) Backend Connections + 900 + 200 + 0.0000 + 100.0000 + 1 + 1 + 0 + 1 + 0 + 0.0000 + 0.0000 + 0 + 0 + 0 + 0 + + + 0 + 0 + 000000 + 0 + 2 + 0 + + Template App PostgreSQL @Secondary@ + pg.backends.count[{$PG_CONN_@SECONDARY@},{$PG_DB}] + + + + + + PostgreSQL (@Secondary@) Buffers + 900 + 200 + 0.0000 + 100.0000 + 1 + 1 + 0 + 1 + 0 + 0.0000 + 0.0000 + 0 + 0 + 0 + 0 + + + 0 + 0 + 00C800 + 0 + 2 + 0 + + Template App PostgreSQL @Secondary@ + pg.buffers_alloc[{$PG_CONN_@SECONDARY@},{$PG_DB}] + + + + 1 + 0 + C80000 + 0 + 2 + 0 + + Template App PostgreSQL @Secondary@ + pg.buffers_clean[{$PG_CONN_@SECONDARY@},{$PG_DB}] + + + + 2 + 0 + 0000C8 + 0 + 2 + 0 + + Template App PostgreSQL @Secondary@ + pg.buffers_backend[{$PG_CONN_@SECONDARY@},{$PG_DB}] + + + + 3 + 0 + C800C8 + 0 + 2 + 0 + + Template App PostgreSQL @Secondary@ + pg.buffers_checkpoint[{$PG_CONN_@SECONDARY@},{$PG_DB}] + + + + + + PostgreSQL (@Secondary@) Checkpoints + 900 + 200 + 0.0000 + 100.0000 + 1 + 1 + 0 + 1 + 0 + 0.0000 + 0.0000 + 0 + 0 + 0 + 0 + + + 0 + 0 + 00C800 + 0 + 2 + 0 + + Template App PostgreSQL @Secondary@ + pg.checkpoint_sync_time[{$PG_CONN_@SECONDARY@},{$PG_DB}] + + + + 1 + 0 + 0000C8 + 0 + 2 + 0 + + Template App PostgreSQL @Secondary@ + pg.checkpoint_write_time[{$PG_CONN_@SECONDARY@},{$PG_DB}] + + + + + + PostgreSQL (@Secondary@) Configuration thresholds + 900 + 200 + 0.0000 + 100.0000 + 1 + 0 + 0 + 1 + 0 + 0.0000 + 0.0000 + 1 + 1 + 0 + 0 + + + 0 + 5 + C80000 + 0 + 2 + 0 + + Template App PostgreSQL @Secondary@ + pg.prepared_xacts_ratio[{$PG_CONN_@SECONDARY@},{$PG_DB}] + + + + 1 + 5 + F63100 + 0 + 2 + 0 + + Template App PostgreSQL @Secondary@ + pg.backends.ratio[{$PG_CONN_@SECONDARY@},{$PG_DB}] + + + + + + diff --git a/templates/Template_PostgreSQL_Server_3.0_Streaming.xml b/templates/Template_PostgreSQL_Server_3.0_Streaming.xml new file mode 100644 index 0000000..6c5414c --- /dev/null +++ b/templates/Template_PostgreSQL_Server_3.0_Streaming.xml @@ -0,0 +1,498 @@ + + + 3.4 + 2018-05-15T08:47:55Z + + + Templates + + + + + + diff --git a/templates/Template_PostgreSQL_Server_3.0_Streaming_Secondary.xml b/templates/Template_PostgreSQL_Server_3.0_Streaming_Secondary.xml new file mode 100644 index 0000000..eab558c --- /dev/null +++ b/templates/Template_PostgreSQL_Server_3.0_Streaming_Secondary.xml @@ -0,0 +1,498 @@ + + + 3.4 + 2018-05-15T08:48:14Z + + + Templates + + + + + + From 72950a914006d1396a3d3d1f7855b04c43cc6054 Mon Sep 17 00:00:00 2001 From: Areg Date: Tue, 15 May 2018 09:41:28 +0000 Subject: [PATCH 2/2] Update README.md Small fixes --- README.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index b0f31e0..d81c995 100644 --- a/README.md +++ b/README.md @@ -47,17 +47,19 @@ To build the RPM package on a RHEL6+ family system with `rpm-build` installed: ## Templates - For Zabbix 3.0 there are 2 templates: +For Zabbix 3.0 there are 2 templates: - * Main Template called `Template_PostgreSQL_Server_3.0.xml` - * Secondary Template called `Template_PostgreSQL_Server_3.0_Secondary.xml` +* Main Template called `Template_PostgreSQL_Server_3.0.xml` +* Secondary Template called `Template_PostgreSQL_Server_3.0_Secondary.xml` - In order to use the Secondary one it is necessary to prepare it: +In order to use the Secondary one it is necessary to prepare it: - * Variable `@Secondary@` is for UI Names - * Variable `@SECONDARY@` is for separating `PG_CONN` from main template - #112, #107 +* Variable `@Secondary@` is for UI Names +* Variable `@SECONDARY@` is for separating `PG_CONN` from main template - #112, #107 - sed -e 's/@Secondary@/SomeNiceName/g; s/@SECONDARY@/INSTANCENAME/g;' Template_PostgreSQL_Server_3.0_Secondary.xml > Template_PostgreSQL_Server_3.0_SomeNiceName.xml +``` +sed -e 's/@Secondary@/SomeNiceName/g; s/@SECONDARY@/INSTANCENAME/g;' Template_PostgreSQL_Server_3.0_Secondary.xml > Template_PostgreSQL_Server_3.0_SomeNiceName.xml +``` This will distinguish instances running on same host but different ports.