RT Guy

I thought we had this written down somewhere, but can't find it!

Whoever is RT Guy should carry out the following routine tasks:

  • Check and respond to Services Unit RT tickets, even if just to say "I'll pass this to ...".
  • Respond to services-unit email.
  • Check the mirror machine logs - note that it takes a wee while for the page to generate http://groups.inf.ed.ac.uk/cos/mirrors/ (currently restricted access to inf.ed.ac.uk)
    • warnings of "tape status wrong" can be because the tibs backup of the mirror is more than 48 hours old
  • If any new mirrors have appeared in the status page, configure the appropriate mirror server so that the mirrors actually take place.
  • Check SAN machines (Nexsan and Evolution), see ATABoxesInfo.
    Recent events:
    Device Event Noticed Action?
    sataboy1 Intermittent PSU blowers 0 & 1 failure 2017 none - due to be decommissioned
    kbe3 disk failure, degraded (2 disks removed) May 2018 shutdown, powered off (and all disks wiped)

  • Mail things
    • check and deal with (ie delete non-relevant stuff) from postmaster mail box imap:Other Users/infsys/postmaster. Don't worry about things going back to 2012.
    • mail.inf and virtualrelay.inf now both send a daily status email showing the number of messages in the various queues (regular,lost,quarantine). Large amounts of mail in the regular queue should be investigated. Messages in the lost queue (mailq -qL) should also be looked at and dealt with, /disk/mailraid/scripts/(processlostmail|movelostmail) mail help.
      • mail queue blockages can be reduced by moving messages to a holding queue (quarantined) with /usr/sbin/sendmail -qS"<>" -Q"info message here" (where -qS"<>" selects the set of messages to be quarantined by Sender substring, the null "<>" in this case being used for error reporting and detecting mail loops)
      • processlostmail will display each message in the lost queue, and prompt for deletion.
        (This type of mail is usually generated because of a spam filter update delay, and so can be deleted - but the occasional false positive does creep through, which is why a quick visual check is a good idea.)
      • movelostmail just moves the lost queue to a named holding location. (This is only really used in disaster recovery.)
    • Remember that mailman list reminders are sent out on the first of each month, which may generate a significant number of failure messages (for old list members who no longer have local accounts). Most of these will, eventually, be automagically disabled by mailman.

  • TiBS backups: there'll be email reports - check for warnings, errors, and failures:
    • WARNING: "...orphaned objects in /tibscache..." just refers to transient files - can be ignored
    • WARNING: "...No extended meta data objects found..." may refer to request for additional info that's not available for some reason - can be ignored
    • If any of the /tibscacheN fill up - and there are no current write-to-tape processes that may reduce this, then feel free to clear space (see ServicesUnitFullTibsCache).
  • For DRIVE_IDENT errors, see TiBS page.
  • For failed client mirrors (not fully configured):
    If mirroring has been requested in the client profile, but a server has not been configured:
    • choose mirror server at the non-local site (make sure there's enough disk space)
    • add MIRROR_INF_CLIENT(client host name, client data tag, mirror disk path) entry to server profile
      (MIRROR_CLIENT macros may exist for non-inf aliased clients)
  • For failed client mirrors (broken connection):
    If you see "FAILED: error in rsync protocol data stream" (usually the result of the client machine disappearing for whatever reason) and want to retry that particular mirror by hand, use
    om rmirror run <tag>
    where <tag> is the rmirror tag from rmirror.disklist on rmirror server.
  • For failed client mirrors (deletions over safety limit):
    If too much data has been removed (> 70%), automatic mirroring will fail.
    If data is backed up to tape (use /usr/tibs/bin/hostaudit -l -n <hostname> on pergamon to check mirror partition), then a run can be forced (data can be restored if this was incorrect). If data is not backed up to tape, then find someone to confirm data deletion is OK before forcing a run.
    Once confirmed, run:
    om rmirror run -- -f <data-name>
    on mirror-host (the machine that mirrors the data, not the client itself), where "<data-name>" is the name of the data location to be mirrored (as listed by qxprof rmirror.disklist on the same mirror-host).
  • Remember to mail next in the rota when your time is up.
  • Disk usage reports may well throw up full partitions, some of these we know about - see the FullPartitionsList.
    • Logwatch may report that AFS /vice partitions are filling up. If these partitions are /vicepa, /vicepb, or /vicepc on any server, then they are likely to hold user home directories - any filling-up should be investigated as a matter of urgency. AFS partitions that are not in this range are likely to be group space, and not quite so urgent.
    • Be aware that over-subscribed quotas are a potential problem, and one that is not automatically monitored. See Neil's script.

-- RogerBurroughes - 01 Dec 2015

Topic revision: r37 - 16 May 2018 - 10:04:29 - RogerBurroughes
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies