Pandemic Top Five for Hypatia (School Database)

There is a lot of general documentation (including disaster recovery stuff like server re-installation and upgrade procedures). This should all be up to date.

Best way to test the procedure is to fire up any or all of the above as parallel services; all of the below can be duplicated and promoted to the live service with just a few changes.

Database Server

There is nothing special about the Hypatia PostgreSQL service; it is just a standard PostgreSQL service. We have yet to see any failure on it so there really is nothing to cover here. The only scenario is probably where another server needs to be brought on line with live data, for which case see the links above. A raw check of the database service itself (removing the TheonUI and/or Portal layer) can be simply done using "psql" connecting to the appropriate host and the "infdb" database (psql -h infdb).

If this works then the service is probably fine. If not check the status of post* processes on the host, try an "om postgresql restart", and check the database logs. On infdb these are configured to go to postgresql.option_log_directory, at time of writing /disk/backup/pglogs. The only thing which typically prevents a postgresql server from starting is a configuration error, and the log should help find this in most cases.

There are important issues to consider prior to restoration, which are are covered in the HypatiaPostgresRestoration document.

Portal Server

This is effectively just a standard Apache web service hosting static pages. The Portal aspect is for automatic page re-generation. This is done by the "apache" user crontab. So again there is not much to say here as we have yet to see any failure on the web service itself. Sometimes page re-generation fails for various reasons. The best approach is to investigate the log files in "/disk/data/portal/logs" and try a re-run (commands in crontab). Wholescale failures are unlikely unless a bad package update has deployed or there is a database permission/connection problem - such will be obvious in the logs. Individual page generation failure is more likely down to broken data or a broken conduit update. In the latter case revert to an earlier package release. There is no master data on the Portal server so a reboot and/or re-install is harmless (though a reinstall without the /disk/data/portal cache will require a full conduit run before it is useful).

See dice/options/hypatia-portal-server.h (and included headers) for specific details.

In the absence of any infdb server the portal server will continue to serve its static data, but not generated reports. In the absence of the real infdb server it is possible to configure the portal services to use any other infdb host, including the read-only replicated slave: see any portal machine profile for details.

UI Server

This is effectively just a standard Apache web service with static web pages and using WSGI for the UI server. There is not much to say here as we have yet to see any failure on the service itself. There is no master data on the UI server so a reboot and/or re-install is harmless.

See dice/options/hypatia-theonui-server.h (and included headers) for specific details.

In the absence of the real infdb server it is possible to move UI services to use any other infdb host. However this service should be made read-only unless a full replacement infdb using up-to-date data is online, since data reconciliation is significantly harder than recovery. See any UI server profile for details.

Incoming Server

The "incoming" server runs on the database (infdb) server and takes feed data and pushes it into the database. This all happens under the "postgres" user account via received email (postfix->procmail) or by crontab schedule. Incoming mail needs first to be handled by the daidb user account on the mail server, so a full trace will require investigation there, but this is rare.

While the service itself is yet to really have any failure, individual feed processes often fail for numerous different reasons. The best starting point is to look at the logs in /disk/data/incoming/logs.

If there are SQL errors then it will be something in the feed data itself causing a problem. A pandemic level fix would tend to be to isolate (manually remove) that data from the raw feed (the file(s) in holding) and re-run as this requires the least knowledge to correct.

For details of each feed process see dice/options/hypatia-incoming-server.h (and its live header) for specific details.

In the absence of the real infdb server there is no reason to run these services.

Basecamp Server

The SVN/Trac server (aka svn.theon) is for internal development purposes. Recovery under a pandemic situation is not required but, like other services, a simple reinstall, restore-from-backup and possibly a restart is all that's required.

Incoming mails directed to the Trac server will be held at the relay until a server is operational on the appropriate address. It uses a pipe command within the alias file.

Complete restoration is covered in BasecampInstallation.

-- TimColles - 27 Nov 2018

Topic revision: r2 - 27 Nov 2018 - 13:28:22 - TimColles
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies