Mail Services "Top 5"

The Basics

We have 2 mail servers (both in the VMs hosted in the AT server room) that do different jobs, but both will accept internal SMTP relay requests. Only these two machine have firewall holes to allow SMTP (port 25) traffic OUT to the world. However, inbound SMTP on port 25 is blocked to all machines, including the mail servers, except from EdLAN. This is so that incoming mail falls back to the secondary MX records (the IS mail relays) for anti-spam/virus checking, before then relaying to us.

crunchie.inf
Aliases - mail.inf, lists.inf, postbox.inf. Our main mail server that deals with @inf addresses including the mailman lists. No longer provides any IMAP access.

beeknow.inf
Aliases - virtualrelay.inf, smtp.inf. Deals with the other mail domains we host, eg the legacy ones @cogsci, and research ones eg @lcfg.org. Also provides an externally accessible Authenticated SMTP service.

All the other DICE clients are configured via the mail component to send all mail to "mail.inf.ed.ac.uk" for delivery.

Remember that user mail is now handled by staffmail.ed and shortly Office365 (exseed.ed.ac.uk). All that happens is that we have an /opt/sendmail/aliases-staffmail file on mail.inf that defines the inf UUN to @staffmail address to forward to (or alternative for visitors and the like). Support know what to do with this file.

If things are not working

  • Check sendmail is running on servers, 'om mail restart' to stop and start it, log file is /var/lcfg/log/mail.log
  • If mailman lists aren't working. Look for mailman processes on lists.inf, use systemctl to stop and restart mailman eg. systemctl stop mailman. Apache is needed for the lists.inf web interface.

More detail

crunchie.inf (mail.inf, postbox.inf, lists.inf)

Hardware - none it's a VM

Currently a KVM host on banjo.

Most important bits of config/data are symlinks to somewhere /disk/mailraid/, which is a hang over from when the mail server was a physical machine with a RAID partition for the important stuff.

Software

Runs:

  • sendmail - mail component is configured to keep the sendmail config in /opt/sendmail (symlink to mailraid)
  • procmailrc - /etc/procmailrc (symlink)
  • mailman - no component. uses systemd. Most data/config /opt/mailman (symlink)
  • apache - apacheconf component configured for /opt/apache (symlink), now only mailman web interface (lists.inf) uses it. No more IMP/Horde.

Crons:

  • Various mailman crons as required by mailman
  • local mailman crons to synchronise DB generated memberships in to lists.
  • enumaddresses generates list of all valid @inf addresses and leaves the result in an rsync target for the EUCS mail relays to fetch so they can do their anti-spam stuff

Disaster recovery

Assuming complete destruction of hardware/service, then you have 5 days to replace the service before mail sent to users is returned to the sender with a permanent failure. If you don't think you'll have the service back within that time, then speak to IS postie (postie@ed) and explain your problem, they may be able to tweak things so mail is not bounced. They'll probably be in touch anyway when they spot the mail backing up on their servers. They would normally contact "postie@inf", which is an alias for the services-unit list.

If the usual members of the services-unit are all incapacitated, then you probably want to get yourself onto the services-unit list, or somehow receive mail sent to that address.

To restore the service you basically need to replicate the current setup, which shouldn't be much more than taking the relevant bits out of crunchie's profile.

Important While you are reinstalling, you'll want to alter the #ifdef NOTDEF_quiescent line so that the enclosed resources are modified to stop various things running. Otherwise you might start accepting mail before it is properly configured to deal with it. Mailman also needs to be completely restored/configured before it is restarted.

Once the replacement machine is installed, you'll need to restore the data, namely the contents of /disk/mailraid/ from mirrors (at the time of writing stanier:rmirror33/crunchie-mail) or backups (which are taken from the mirror). Note that /var/spool/mqueue/ may have contained some ephemeral mail that was passing through the mailserver when it died; unless there was no such mail, or you can somehow recover that directory, then mail will have been lost, and most likely without a record of who the sender or recipient were.

Talking of mirrors you may want to make sure that your last good mirror doesn't get zapped by a partial recovery, so either take an extra copy or disable the mirrors temporarily.

I should also say that /disk/mailraid/home/ has been mostly cleared of home directories (mail) following the move to staffmail. There's still some sysman dirs and system stuff though.

If you've changed IP address for the new machine, make sure you update the various DNS entries: mail, postbox, lists, www.lists (and the .informatics versions).

You'll also need to tell postie about a new host name or IP so they can update their rsync to fetch our list of addresses from the new hardware.

Once you are all reinstalled and restored, then check that sendmail (mail component) is configured correctly and dealing with mail as expected: eg

[crunchie]root: sendmail -bv neilb sys-announce rt
"|/var/mailman/mail/mailman post sys-announce"... deliverable: mailer prog, user "|/var/mailman/mail/mailman post sys-announce"
rt... deliverable: mailer local, user rt
neilb@sNOSPAMtaffmail.ed.ac.uk... deliverable: mailer esmtp, host [smtp.staffmail.ed.ac.uk], user neilb@sNOSPAMtaffmail.ed.ac.uk

"NOSPAM" to stop harvesting.

You could also start apache by hand "om apacheconf start" to see that lists.inf.ed.ac.uk looks OK and lists are there and configured/have members/archives etc.

If that looks good, then start mailman (systemctl start mailman) and sendmail (om mail start), and then watch the log /var/lcfg/log/mail.log and try sending some mail to see if it gets delivered as expected. If not stop sendmail (mail) ASAP, and investigate.

Check that ~mailman/.ssh is a symlink to /opt/mailman/DOTssh. The file component should have taken care of that. This is so a mailman cron can fetch the updated DB mailing list memberships files from cvs.

beeknow.inf (smtp.inf, virtualrelay.inf)

Hardware - none it's a VM

When it was a physical machine, as there could be transient, unique data in /var/spool/mqueue/, then this is symlinked to /disk/data/mqueue/ which used to be a RAID1 device.

Now that it is virtual, this symlink still survives, but we are relying on the underlying reslience of the VM infrastructure to save us from disk failures.

Software

Runs:

  • sendmail - mail component is configured to have the sendmail config in /opt/mail (a symlink to /disk/data/mail)
  • cron - to generate a list of valid email addresses so that ...
  • rsync - the EUCS can rsync the list do to anti-spam things at their end

Though there are two header files for the two mail related jobs that beeknow does, they are mainly for setting up x509 certs (in the case of auth-smtp), firewall holes etc. The real sendmail work is done in the hand maintained sendmail /opt/mail/virtrelay.inf.mc, which generates the sendmail.cf

Disaster Recovery

Again, assuming complete destruction of the hardware/service, then you have 5 days to replace the virtualrelay.inf half of the service, before mail starts being returned to sender with a delivery failure, eg mail to user@aiaiREMOVE_THIS.ed. If you don't think you'll have the service back within that time, then speak to IS postie (postie@ed) to explain your problem, they may be able to tweak things so mail is not bounced. They'll probably be in touch anyway when they spot the mail backing up on their servers.

The lack of smtp.inf following a disaster, will frustrate some people, but it isn't the end of the world. They could use smtp.staffmail instead.

If it is possible to retrieve data off the virtual disk, then you should do this to retrieve any transient /var/spool/mqueue/ contents. This can be done after your replacement service is up and running, but again within the 5 days.

To restore the service you basically need to replicate the current machines setup on some new hardware, and then restore /disk/data/ (or it's new equivalent location) from the mirrors (bulleid:rmirror30/beeknow-vmr) or backups (which are taken from the mirror).

IMPORTANT Again you should make sure that mail isn't running until you think it is fully configured as it was before, otherwise you may end up incorrectly rejecting mail. So while setting up the new machine, and before it starts answering as virtualrelay.inf, then add:

!systemd.units          mREMOVE(lcfgmail)
!systemd.wanted_units_multiusertarget   mREMOVE(lcfg-mail.service)

to the machine's profile, and do an om mail stop if it is already running. Then as you restore the configuration, etc you can test mail will be diverted as expected before restarting mail (sendmail). eg Note that for anti spam @ has become (AT).

root: sendmail -bv neilb
neilb... deliverable: mailer relay, host mail.inf.ed.ac.uk, user neilb(AT)mail.inf.ed.ac.uk

root: sendmail -bv neilb(AT)inf.ed.ac.uk
neilb(AT)inf.ed.ac.uk... deliverable: mailer esmtp, host [mail.inf.ed.ac.uk], user neilb(AT)inf.ed.ac.uk

root: sendmail -bv neilb(AT)aiai.ed.ac.uk
neilb(AT)aiai.ed.ac.uk... No such user
# You should check a deliberate non-valid address to make sure we're
# not accepting ALL mail for some reason.

root: sendmail -bv bugs(AT)lcfg.org
bugs(AT)lcfg.org... deliverable: mailer esmtp, host [mail.inf.ed.ac.uk], user mp-unit(AT)inf.ed.ac.uk

One other gotcha, might be the SSL certs for smtp.inf, the x509 certificate server might need to be prodded to accept that your replacement machine is the new smtp.inf.

That's it.

Other docs

For existing docs on the service see:

  • the "Mail Related" items in ServicesUnitNuggets for more day-to-day stuff
  • the services unit docs. Note these may still talk about old things like IMP and Horde which we don't do any more, and old hardware (mandy) which used to host smtp.inf and the virtual relay - mostly for historical interest.

-- NeilBrown - 11 Mar 2019

Topic revision: r10 - 11 Mar 2019 - 12:39:13 - NeilBrown
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies