Structure
Replication is carried out from
infdb
(''master'') to
drinfdb
(''slave'').
This is ''asynchronous streaming replication without log shipping'': it ensures a consistent point-in-time copy of the master database. Replication can be made synchronous/asynchronous with a single command.
Full generalised details about
PostgreSQL replication are noted at
DICE postgresql replication but it is important that you understand the linked
PostgreSQL documents before making changes.
Common Operations
Check replication status
On the master, check replication status thus:
$ /afs/inf.ed.ac.uk/group/cos/utils/pg-activity r
or
postgres=> SELECT
application_name AS app, usename, state, sync_state
FROM pg_stat_replication;
You should see output along the lines of:
app | usename | state | sync_state
------+---------+-----------+------------
cake | repl | streaming | async
Adding / removing a slave
New asynchronous recovery servers can be brought up entirely automatically as the postgresql component will perform an automatic init from the ''basebackup'' it retrieves on first start.
'''On the master:'''
Add the following LCFG configuration:
INFDB_REPL_ACCESS(<node>,<ip>)
Increment this counter if required:
PG_OPTION(max_wal_senders,3) /* set to #slaves + 1 */
'''On the slave''':
Add the following LCFG configuration to make a server into "an infdb":
#undef INFORMATICS_DATABASE_SERVER
#define DICE_INFDB_STATEMENT_LOGGING
#include <dice/options/infdb-server.h>
Also add the following to make it a replication slave:
#define PSQL_MASTER_HOSTNAME infdb.inf.ed.ac.uk
#define PSQL_REPLICATION_INIT
#define PSQL_REPLICATION_STREAMING
#include <dice/options/postgresql-replication-slave.h>
You might also wish to add other niceties to stop the slave behaving like a full infdb server:
#define DICE_OPTIONS_INFDB_NO_BACKUP
!cron.objects mREMOVE(pgld) /* pgluser fails on ro servers */
!cron.additions mREMOVE(sync2) /* remove cron job for prometheus */
It is sometimes necessary to set some server configuration variables to match the master (e.g.
max_connections
). If this is the case, server startup will fail but recovery simply requires setting the appropriate
PG_OPTION
before restarting the component.
Repairing a slave
Recovery slaves (or existing servers repurposed to slave duties) can be repaired by the following procedure '''on the slave''':
$ om postgresql stop
$ nsu
# export `qxprof postgresql.pgdata` && mv "${pgdata}" "${pgdata}_old"
# exit
$ om postgresql start
Remove the
_old
database copy when it is safe to do so!
Enable / disable synchronous replication
It's worth remembering that asynchronous replication will not result in slave servers which are any less ''consistent'', nor will it in practice write its updates any later, than a synchronously replicated slave. The fundamental difference between the two modes is in the point at which the master's
COMMIT
completes. A master writes synchronously to the first available listed slave; if none of the listed synchronous slaves are available the master will '''hang indefinitely''' on writes.
'''On the master''':
Add or remove the hostname (note this is a voluntary string on the part of the slave: really it's the
application_name
but the postgresql component equates the two) in the master's profile:
PG_OPTION(synchronous_standby_names,'<node>')
In case of master failure / unavailability
This section ongoing: see #1345.
Temporary read-only switchover
In case of master failure, it is trivial to switch over to a ''read-only'' slave (not true failover): simply do one of:
- alter the destination server of clients (e.g. portal, UI) to
drinfdb
, or
- take the more drastic step of altering DNS for
infdb
to point to the new server
Failover / master switchover
There is no single procedure for r/w promotion, and no configuration for automatic failover (yet), partly because some thought will be required depending on the type and nature of the failure; context will guide the best course of recovery action. Read
PostgresRestoration before proceeding.
Manual Promotion
Manual promotion is achieved with just a few steps:
This server is now writable and 'live' from a data point of view. It cannot now be used as a slave again (unless reinitialised). However it's still considered to be a recovery server, and future configuration will result in errors from both the server and its LCFG component.
To '''abort the promotion''', you will need a new working master. Having created one, simply repair the slave (changing the
PSQL_MASTER_HOSTNAME
define if necessary, then following the
Repairing a Slave procedure above).
To '''make the promotion permanent''', and define the server as a new master:
- remove all of the following from the machine profile:
-
PSQL_MASTER_HOSTNAME
, PSQL_REPLICATION_*
#defines
-
postgresql-replication-slave.h
#include.
- any setting of
default_transaction_read_only
except '=off='
- restart postgresql
- Follow advice in PostgresInstallation for making this '''the''' infdb server.
- Add replication master configuration to this server (see above).
See also the general advice in the official
PostgreSQL failover documentation.
Having configured the new server as a master, it is a very good idea to commission new slave(s) immediately (see above), hardware performance permitting (and if it doesn't, think again about promoting at all...).
--
TimColles - 27 Nov 2018