School Database Top Five(ish)

There is a lot of general documentation (including disaster recovery stuff like server re-installation and upgrade procedures). Most is largely uptodate.

For other information refer to the LCFG header dice/options/infdb-server.h. The general service is managed completely through this header. Everything else (conduits, genrep and upload/sync scripts) are in the conduits sub-directory of the local daidb user account. There is a README in that with a reasonable amount of structural information.

User support know about a number of more day to day operational processes such as account creation, genrep account setup and database permissions control.

Database Server

This sometimes crashes. The usual fix is simply to restart after checking it really is broken. To do this run nsu ingres then sql infdb and you should get a prompt, any error about not being able to connect then the engine is probably not running or in a broken state. Restart as yourself with om ingres restart. If that fails try om ingres restart -- -f. If that fails kill any ingres owned processes (using -9 on kill only as a last resort) and then do om ingres restart. After a successful restart you also need to restart the event daemon with om eventd restart.

If sql infdb works but users still can't do things some other things to check might be that there isn't a stuck lock. To do this do nsu ingres and then RunLock. This shouldn't return anything. If it does this then you may have a hung session holding onto tables preventing others from updating/querying those tables (will appear as if the interface or reporter has hung). You can attempt to kill the offending session but in practice this rarely works, just follow the database restart procedure as above. Often in this scenario some users can carry on using the database quite happily (if they aren't accessing the locked tables) in which case it is nice to warn the users about the restart. You can use the db-announce mailing list for doing that.

Establish what is broken for the user. Is it the user interface or account management tools? In practice we very rarely have failures on these. For the former axnet must be running on the database server, if not try om axnet restart. For the latter dbiproxy must be running on the database server, if not try om dbiproxy restart.

Web Page Generation Failures

These can be quite common. Errors are mailed to rat-unit. To fix look at the offending error at the end of the log file (also available in /home/daidb/conduits/history). Common problems can be as follows.

  • CVS server down/broken - not your problem, will sort itself on next run.
  • CVS protocol errors - unlikely to get these outside of the session rollover period. These are a bug in the CVS server which cannot handle the number of changes in a single commit being asked. Fix is to manually commit the changes in the checked out CVS tree (conduits/publish/*) in smaller groups.
  • HTML validation error - usually a user has entered some data that hasn't been properly trapped and causes a validation error (a common one is multiple staff entries for the same person). In general fix the data. Every generated page has the name of the conduit that generated it at the top, you can use this to help track/debug the problem. If that looks difficult revert the CVS tree and stop the offending conduit from running (comment it out in conduits/EXECUTOR.sh).

Any fixes above should be done as the daidb user (nsu daidb). You will also need to set the CVS login with export CVS_PASSFILE=/home/daidb/conduits/.cvspass.

Upload/Sync Failures

The sync processes are below.

  • PAVD - this is new and hasn't failed yet (see PAVDSync for more details)!
  • PGT - this is new and hasn't failed yet (see PGTSync for more details)!
  • SMS - base load doesn't generally fail if nothing changes. It has failed in the past through SMS mail not being sent at all, or being sent with incorrect data/columns. The contact for the SMS source data itself is Scott Larnach, if it doesn't look right contact him. The source data from each run is held in inputs/smsdata.txt. The incoming mail runs etc/smsfeed which creates this file and a lock file inputs/smsdata.lck. The crontab triggered etc/smsload loads the file into the database and removes the lock file. If it doesn't look like the load is happening check these files and check there isn't a stray lock file. After the load a crontab triggered conduit update_itodatabase runs which syncs database tables to the fresh SMS data. This can often go wrong, although in general problems only hit during the session rollover period.
  • IRS* - this loads Informatics reports, hasn't broken in a while. Works similarly to SMS, so follow through from the procmail and the crontab for daidb.

-- TimColles - 25 Sep 2009

Topic revision: r1 - 25 Sep 2009 - 16:16:32 - TimColles
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies