Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conf to monitor PostgreSQL Streaming replication #119

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions README-STREAMING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# libzbxpgsql-streaming
Monitoring Add-On for libzbxpgsql v1.1 to monitor PostgreSQL Streaming Replication on Zabbix

Ref: https://github.com/robbrucks/libzbxpgsql-streaming

## Setup

*This Document assumes you have the libzbxpgsql v1.1 module installed, configured, and already succesfully monitoring your postgres database clusters (instances)*

If you have set up Streaming Replication as in https://wiki.postgresql.org/wiki/Streaming_Replication #37 you can also add following Templates and configuration to the host.

* Main Template called `Template_PostgreSQL_Server_3.0_Streaming.xml`
* Secondary Template called `Template_PostgreSQL_Server_3.0_Streaming_Secondary.xml`

In order to use the Secondary one it is necessary to prepare it:

* Variable `@Secondary@` is for UI Names
* Variable `@SECONDARY@` is for separating `PG_CONN` from main template - #112, #107

sed -e 's/@Secondary@/SomeNiceName/g; s/@SECONDARY@/INSTANCENAME/g;' Template_PostgreSQL_Server_3.0_Streaming_Secondary.xml > Template_PostgreSQL_Server_3.0_Streaming_SomeNiceName.xml

This will distinguish instances running on same host but different ports.

1. Copy the `libzbxpgsql-streaming.conf` file as `libzbxpgsql.conf` into the `/etc/zabbix` directory on your master and slave DB servers
1. Execute the SQL script `sql/replication_pump_func.sql` on each *master* DB cluster against the same database as defined in your {$PG\_DB} macro in Zabbix (the `postgres` database by default)
```
psql -f replication_pump_func.sql -U postgres -d postgres
```
1. If you will be using a DB user other than `postgres` to connect to the DB from the Zabbix agent, you will need to grant execute on the function to that user:
```
psql -c 'GRANT EXECUTE ON FUNCTION replication_pump() TO your_zabbix_user;' -U postgres postgres
```
1. Link the `Template App PostgreSQL Streaming` template to your master and each of your slave DB hosts via the Zabbix UI
1. Restart the zabbix-agent on your DB servers

## What is monitored?
* Count of WAL log bytes waiting to be applied on each _connected_ slave ("lag bytes"; measured on the master)
* Number of seconds a slave is behind the master ("lag seconds"; measured on each slave)
* Whether or not replication has been paused on a slave

## What does it alert on?
* If "lag bytes" exceeds the value of Zabbix macro variable "{$PG\_ALRT\_SLAVE\_LAG\_BYTES}" (default 100mb)
* If "lag seconds" exceeds the value of Zabbix macro variable "{$PG\_ALRT\_SLAVE\_LAG\_SECS}" (default 300 seconds)
* If replication is manually paused on a slave

## What's up with the "Replication Master Log Pump" thingy?
Get ready for a lengthy explanation...

On a master/slave setup using streaming replication, selecting from the `pg_last_xact_replay_timestamp()` function on the slaves will report the timestamp of the last update replayed. This works fine when the master has constant update activity, but if there is a period of time where there are no updates on the master then the replay timestamp will not get updated on the slaves (since there are no changes to stream). This can cause the slave to _appear_ to be significantly behind the master despite actually being up to date.

I discovered that issuing a simple `NOTIFY` command on the master will cause the notification to be streamed to the slaves. The PostgreSQL documentation indicates that the notification is discarded if there are no corresponding listeners, so this appears to be a relatively harmless and lightweight method to force the replay timestamp to be updated on the slaves.

The `NOTIFY` command does not make any changes to data or schemas in the database and it does not require any special permissions to execute.

Unfortunately libzbxpgsql cannot issue a `NOTIFY` command directly, so I had to implement it using a function.

So the `PostgreSQL Streaming Replication Master Log Pump` item runs this function every 30 seconds to execute a `NOTIFY` and "pumps" the WAL log stream.

By issuing the `NOTIFY` every 30 seconds I can ensure that the replay timestamp on the slaves is updated at least that frequently, even if a master goes "quiet". Then if the timestamp fails to get updated for longer than 30 seconds I can alert that there is truly a problem with replication.

This seemed a far more elegant solution than creating a single-row table with a timestamp, regularly updating it, and watching for the timestamp update on the slaves. It eliminates the need for a table, permissions, and frequent vacuums of the table.

## But I can alert on lag bytes...
Yes, this template also measures the lag bytes as reported by the master in the `pg_stat_replication` view, and it will alert if lag bytes becomes high. But the `pg_stat_replication` view has a critical weakness: if communication with the slave is lost, the corresponding row in `pg_stat_replication` for that slave is immediately deleted and you will not know how far behind replication is. Since that row is gone, Zabbix can't measure any lag and can't alert that the slave is falling behind.

## What am I trying to solve here?
I'm trying to solve one of the more commonly encountered problems with replication: How can I tell that a communication issue has stalled streaming replication?

I think I've solved it - but please let me know if I've got something wrong...
22 changes: 22 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,28 @@ To build the RPM package on a RHEL6+ family system with `rpm-build` installed:

make rpm

## Templates

For Zabbix 3.0 there are 2 templates:

* Main Template called `Template_PostgreSQL_Server_3.0.xml`
* Secondary Template called `Template_PostgreSQL_Server_3.0_Secondary.xml`

In order to use the Secondary one it is necessary to prepare it:

* Variable `@Secondary@` is for UI Names
* Variable `@SECONDARY@` is for separating `PG_CONN` from main template - #112, #107

```
sed -e 's/@Secondary@/SomeNiceName/g; s/@SECONDARY@/INSTANCENAME/g;' Template_PostgreSQL_Server_3.0_Secondary.xml > Template_PostgreSQL_Server_3.0_SomeNiceName.xml
```

This will distinguish instances running on same host but different ports.

## Streaming Monitoring

Please follow instructions as per README-STREAMING.md


## License

Expand Down
144 changes: 144 additions & 0 deletions conf/libzbxpgsql-streaming.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# File: /etc/zabbix/libzbxpgsql.conf
#
# This file contains configuration for all pg.* keys.
#
# By default, this file is loaded from /etc/zabbix/libzbxpgsql.conf, unless
# the PGCONFIGFILE environment variable is set to a different path.
#
# The config file is only read at startup of Zabbix agent. If you modify the
# config file, you will need to restart the Zabbix agent for it to take effect.
#
# Syntax errors in the config file will prevent Zabbix from starting.
#
# The config files are parsed by the C libconfig module:
# http://www.hyperrealm.com/main.php?s=libconfig
#
# Comment lines begin with a hash '#'.
#
# The format for defining named SQL queries is:
# queries = {
# SQLkey = "SQL statement";
# };
#
# Requirements:
# - The SQL key must be alphanumeric and can contain dashes and underscores
# (-DO NOT- use asterisks or spaces in the key name).
# - The entire SQL statement must be enclosed in double quotes.
# - If your SQL statement needs to utilize double-quotes, then they MUST be
# escaped by a backslash:
# "SELECT \"UPPERCASECOLUMN\" from table;";
# - A semicolon is required at the end of each config entry.
#
# Example Query Setup (with substitution variables):
# * Zabbix agent key, including a named query:
# pg_query.integer[,,myquery,45,200]
#
# * Matching query from the config file:
# myquery = "Select $1::int + $2::int;";
#
# * The agent will return the integer: 245
#
# SQL statements can span multiple lines, and may optionally contain extra
# begin/end quotes on each line. The following two examples are both valid:
#
# GoodSQL1 = "select count(*)
# from pg_stat_activity;";
#
# AlsoGood = "select count(*) "
# " from pg_stat_activity;";
#

# Example Queries
queries = {
teststr = "SELECT $1::text || $2::text;";
testint = "SELECT $1::int;";
testdbl = "SELECT $1::decimal;";
testdsc = "SELECT * FROM pg_database;";


######################################################
# _____ _ _ _ _
# | __ \ | (_) | | (_)
# | |__) |___ _ __ | |_ ___ __ _| |_ _ ___ _ __
# | _ // _ \ '_ \| | |/ __/ _` | __| |/ _ \| '_ \
# | | \ \ __/ |_) | | | (_| (_| | |_| | (_) | | | |
# |_| \_\___| .__/|_|_|\___\__,_|\__|_|\___/|_| |_|
# | |
# |_|
######################################################
# NOTES!
#
# 1. The following SQL:
# "now() > pg_last_xact_replay_timestamp()"
# is required in certain corner cases where
# time drift between two servers causes a
# negative time lag which converts into a huge
# unsigned number shown for lag in Zabbix.
#
# 2. The "replpump" method appears to be a good
# way to force streaming to occur on "quiet"
# databases without performing updates, but
# I do not know if there are any potential
# long-term problems with constantly issuing
# a notification with no listeners. Based on
# emails to pg-general it *should not* be a
# problem.
#

# Replication Master Discovery
dscrepmstr = "select case when client_hostname is not null "
"then client_hostname "
"else case when client_addr is not null "
"then host(client_addr) "
"else 'localhost'::text "
"end "
"end as \"PGSLAVE\" "
", case when client_addr is not null "
"then host(client_addr) "
"else 'localhost'::text "
" end as \"PGSLAVEIP\" "
"from pg_stat_replication"
";";

# Replication Slave Discovery
dscrepslave = "select 'Slave' as \"PGSLV\" "
", 'repllag' as \"PGSLVLAG\" "
", 'replpaused' as \"PGSLVPAUSED\" "
"where pg_is_in_recovery() = TRUE; "
";";

# Force replication to stream when master is idle (EXPERIMENTAL!)
# (executed on master)
replpump = "select replication_pump();";

# Replication lag seconds (measured on slave)
repllag = "select case when pg_is_in_recovery() = TRUE "
"then case when now() > pg_last_xact_replay_timestamp() "
"then extract(epoch from now() - pg_last_xact_replay_timestamp())::int "
"else 0::int "
"end "
"else 0::int "
"end "
";";

# Replication lag bytes (measured on master)
repllagbytes = "select pg_xlog_location_diff(pg_current_xlog_location(),replay_location) "
"from pg_stat_replication "
"where ( ( $1::text != 'localhost' "
"and client_addr = $1::inet) "
"or ( $1::text = 'localhost' "
"and client_addr is null) ) "
";";

# Replication paused (measured on slave)
replpaused = "select case when pg_is_in_recovery() = TRUE "
"then case when pg_is_xlog_replay_paused() = TRUE "
"then 1::int "
"else 0::int "
"end "
"else 0::int "
"end"
";";

};

1 change: 1 addition & 0 deletions conf/libzbxpgsql.conf
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,4 @@ queries = {
testdbl = "SELECT $1::decimal;";
testdsc = "SELECT * FROM pg_database;";
};

29 changes: 29 additions & 0 deletions sql/replication_pump_func.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
SET ROLE postgres;

-- If there are slaves and this DB is NOT in recovery,
-- then issue a fake notify command to force the log
-- to stream. This will update pg_last_xact_replay_timestamp
-- on all slaves.

BEGIN;

CREATE OR REPLACE FUNCTION replication_pump()
RETURNS void
AS $$
DECLARE slavect int;
BEGIN
SELECT count(*) INTO slavect FROM pg_stat_replication;
IF slavect > 0 AND pg_is_in_recovery() = FALSE THEN
NOTIFY libzbxpgsql_fake_notify;
END IF;
END;
$$
LANGUAGE plpgsql
VOLATILE
;

REVOKE ALL ON FUNCTION replication_pump() FROM public;
GRANT EXECUTE ON FUNCTION replication_pump() TO postgres;

COMMIT;

Loading