Hypatia PostgreSQL Replication


Replication is carried out from infdb (''master'') to drinfdb (''slave'').

This is ''asynchronous streaming replication without log shipping'': it ensures a consistent point-in-time copy of the master database. Replication can be made synchronous/asynchronous with a single command.

Full generalised details about PostgreSQL replication are noted at DICE postgresql replication but it is important that you understand the linked PostgreSQL documents before making changes.

Common Operations

Check replication status

On the master, check replication status thus:

$ /afs/inf.ed.ac.uk/group/cos/utils/pg-activity r


postgres=> SELECT
  application_name AS app, usename, state, sync_state
  FROM pg_stat_replication;

You should see output along the lines of:

 app  | usename |   state   | sync_state 
 cake | repl    | streaming | async

Adding / removing a slave

New asynchronous recovery servers can be brought up entirely automatically as the postgresql component will perform an automatic init from the ''basebackup'' it retrieves on first start.

'''On the master:''' Add the following LCFG configuration:

Increment this counter if required:
PG_OPTION(max_wal_senders,3) /* set to #slaves + 1 */

'''On the slave''': Add the following LCFG configuration to make a server into "an infdb":

#include <dice/options/infdb-server.h>

Also add the following to make it a replication slave:

#define PSQL_MASTER_HOSTNAME infdb.inf.ed.ac.uk
#include <dice/options/postgresql-replication-slave.h>

You might also wish to add other niceties to stop the slave behaving like a full infdb server:

!cron.objects           mREMOVE(pgld)    /* pgluser fails on ro servers */
!cron.additions         mREMOVE(sync2)   /* remove cron job for prometheus */

It is sometimes necessary to set some server configuration variables to match the master (e.g. max_connections). If this is the case, server startup will fail but recovery simply requires setting the appropriate PG_OPTION before restarting the component.

Repairing a slave

Recovery slaves (or existing servers repurposed to slave duties) can be repaired by the following procedure '''on the slave''':

 $ om postgresql stop
 $ nsu
 # export `qxprof postgresql.pgdata` && mv "${pgdata}" "${pgdata}_old"
 # exit
 $ om postgresql start

Remove the _old database copy when it is safe to do so!

Enable / disable synchronous replication

It's worth remembering that asynchronous replication will not result in slave servers which are any less ''consistent'', nor will it in practice write its updates any later, than a synchronously replicated slave. The fundamental difference between the two modes is in the point at which the master's COMMIT completes. A master writes synchronously to the first available listed slave; if none of the listed synchronous slaves are available the master will '''hang indefinitely''' on writes.

'''On the master''': Add or remove the hostname (note this is a voluntary string on the part of the slave: really it's the application_name but the postgresql component equates the two) in the master's profile:


In case of master failure / unavailability

This section ongoing: see #1345.

Temporary read-only switchover

In case of master failure, it is trivial to switch over to a ''read-only'' slave (not true failover): simply do one of:

  • alter the destination server of clients (e.g. portal, UI) to drinfdb, or
  • take the more drastic step of altering DNS for infdb to point to the new server

Failover / master switchover

There is no single procedure for r/w promotion, and no configuration for automatic failover (yet), partly because some thought will be required depending on the type and nature of the failure; context will guide the best course of recovery action. Read PostgresRestoration before proceeding.

Manual Promotion

Manual promotion is achieved with just a few steps:

  • Determine values of pgowner, pgctlbin and pgdata from the postgresql component.
  • Run the promote command
    nsu <pgowner> -c '<pgctlbin> -D <pgdata> promote'
    (a preconfigured promote component method is planned.)
  • Remove any setting of default_transaction_read_only except '=off=' from the machine profile.

This server is now writable and 'live' from a data point of view. It cannot now be used as a slave again (unless reinitialised). However it's still considered to be a recovery server, and future configuration will result in errors from both the server and its LCFG component.

To '''abort the promotion''', you will need a new working master. Having created one, simply repair the slave (changing the PSQL_MASTER_HOSTNAME define if necessary, then following the Repairing a Slave procedure above).

To '''make the promotion permanent''', and define the server as a new master:

  • remove all of the following from the machine profile:
    • postgresql-replication-slave.h #include.
    • any setting of default_transaction_read_only except '=off='
  • restart postgresql
  • Follow advice in PostgresInstallation for making this '''the''' infdb server.
  • Add replication master configuration to this server (see above).

See also the general advice in the official PostgreSQL failover documentation.

Having configured the new server as a master, it is a very good idea to commission new slave(s) immediately (see above), hardware performance permitting (and if it doesn't, think again about promoting at all...).

-- TimColles - 27 Nov 2018

Topic revision: r1 - 27 Nov 2018 - 14:03:51 - TimColles
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies