Mail Services "Top 5"
The Basics
We have 2 mail servers (both in the VMs hosted in the AT server room) that do
different jobs, but both will accept internal SMTP relay
requests. Only these two machine have firewall holes to allow SMTP
(port 25) traffic OUT to the world. However, inbound SMTP on port 25
is blocked to all machines, including the mail servers, except from
EdLAN. This is so that incoming mail falls back to the secondary MX
records (the IS mail relays) for anti-spam/virus checking, before
then relaying to us.
- crunchie.inf
- Aliases - mail.inf, lists.inf, postbox.inf. Our main mail server that deals with @inf addresses including the mailman lists. No longer provides any IMAP access.
- beeknow.inf
- Aliases - virtualrelay.inf, smtp.inf. Deals with the other mail domains we host, eg the legacy ones @cogsci, and research ones eg @lcfg.org. Also provides an externally accessible Authenticated SMTP service.
All the other DICE clients are configured via the
mail
component
to send all mail to "mail.inf.ed.ac.uk" for delivery.
Remember that user mail is now handled by staffmail.ed and shortly Office365 (exseed.ed.ac.uk). All that
happens is that we have an
/opt/sendmail/aliases-staffmail
file on
mail.inf that defines the inf UUN to @staffmail address to forward to
(or alternative for visitors and the like). Support know what to do
with this file.
If things are not working
- Check
sendmail
is running on servers, 'om mail restart' to stop and start it, log file is /var/lcfg/log/mail.log
- If mailman lists aren't working. Look for
mailman
processes on lists.inf, use systemctl to stop and restart mailman eg. systemctl stop mailman
. Apache is needed for the lists.inf web interface.
More detail
crunchie.inf (mail.inf, postbox.inf, lists.inf)
Hardware - none it's a VM
Currently a KVM host on banjo.
Most important bits of config/data are symlinks to somewhere
/disk/mailraid/, which is a hang over from when the mail server was a
physical machine with a RAID partition for the important stuff.
Software
Runs:
- sendmail - mail component is configured to keep the sendmail config in /opt/sendmail (symlink to mailraid)
- procmailrc - /etc/procmailrc (symlink)
- mailman - no component. uses systemd. Most data/config /opt/mailman (symlink)
- apache - apacheconf component configured for /opt/apache (symlink), now only mailman web interface (lists.inf) uses it. No more IMP/Horde.
Crons:
- Various mailman crons as required by mailman
- local mailman crons to synchronise DB generated memberships in to lists.
-
enumaddresses
generates list of all valid @inf addresses and leaves the result in an rsync target for the EUCS mail relays to fetch so they can do their anti-spam stuff
Disaster recovery
Assuming complete destruction of hardware/service, then you have 5
days to replace the service before mail sent to users is returned to
the sender with a permanent failure. If you don't think you'll have
the service back within that time, then speak to IS postie (
postie@ed)
and explain your problem, they may be able to tweak things so mail is
not bounced. They'll probably be in touch anyway when they spot the
mail backing up on their servers. They would normally contact "postie@inf", which is an alias for the services-unit list.
If the usual members of the services-unit are all incapacitated, then you probably want to get yourself onto the services-unit list, or somehow receive mail sent to that address.
To restore the service you basically need to replicate the current
setup, which shouldn't be much more than taking the relevant bits out
of crunchie's profile.
Important While you are reinstalling, you'll want to alter the
#ifdef NOTDEF_quiescent
line so that the enclosed resources are
modified to stop various things running. Otherwise you might start
accepting mail before it is properly configured to deal with
it. Mailman also needs to be completely restored/configured before it
is restarted.
Once the replacement machine is installed, you'll need to restore the
data, namely the contents of /disk/mailraid/ from mirrors (at the time
of writing stanier:rmirror33/crunchie-mail) or backups (which are taken from the
mirror). Note that /var/spool/mqueue/ may have contained some
ephemeral mail that was passing through the mailserver when it died;
unless there was no such mail, or you can somehow recover that
directory, then mail will have been lost, and most likely without a
record of who the sender or recipient were.
Talking of mirrors you may want to make sure that your last good
mirror doesn't get zapped by a partial recovery, so either take an
extra copy or disable the mirrors temporarily.
I should also say that /disk/mailraid/home/ has been mostly cleared of
home directories (mail) following the move to staffmail. There's still
some sysman dirs and system stuff though.
If you've changed IP address for the new machine, make sure you update
the various DNS entries: mail, postbox, lists, www.lists (and the
.informatics versions).
You'll also need to tell postie about a new host name or IP so they
can update their rsync to fetch our list of addresses from the new hardware.
Once you are all reinstalled and restored, then check that sendmail
(mail component) is configured correctly and dealing with mail as expected: eg
[crunchie]root: sendmail -bv neilb sys-announce rt
"|/var/mailman/mail/mailman post sys-announce"... deliverable: mailer prog, user "|/var/mailman/mail/mailman post sys-announce"
rt... deliverable: mailer local, user rt
neilb@sNOSPAMtaffmail.ed.ac.uk... deliverable: mailer esmtp, host [smtp.staffmail.ed.ac.uk], user neilb@sNOSPAMtaffmail.ed.ac.uk
"NOSPAM" to stop harvesting.
You could also start apache by hand "om apacheconf start" to see that
lists.inf.ed.ac.uk looks OK and lists are there and configured/have
members/archives etc.
If that looks good, then start mailman (
systemctl start mailman
)
and sendmail (om mail start), and then watch the log
/var/lcfg/log/mail.log
and try sending some mail to see if it gets
delivered as expected. If not stop sendmail (mail) ASAP, and
investigate.
Check that
~mailman/.ssh is a symlink to
/opt/mailman/DOTssh. The file component should have taken care of
that. This is so a mailman cron can fetch the updated DB mailing list
memberships files from cvs.
beeknow.inf (smtp.inf, virtualrelay.inf)
Hardware - none it's a VM
When it was a physical machine, as there could be transient, unique
data in /var/spool/mqueue/, then this is symlinked to
/disk/data/mqueue/ which used to be a RAID1 device.
Now that it is virtual, this symlink still survives, but we are
relying on the underlying reslience of the VM infrastructure to save
us from disk failures.
Software
Runs:
- sendmail - mail component is configured to have the sendmail config in /opt/mail (a symlink to /disk/data/mail)
- cron - to generate a list of valid email addresses so that ...
- rsync - the EUCS can rsync the list do to anti-spam things at their end
Though there are two header files for the two mail related jobs that
beeknow does, they are mainly for setting up x509 certs (in the case of
auth-smtp), firewall holes etc. The real sendmail work is done in the
hand maintained sendmail /opt/mail/virtrelay.inf.mc, which generates
the sendmail.cf
Disaster Recovery
Again, assuming complete destruction of the hardware/service, then you
have 5 days to replace the virtualrelay.inf half of the service,
before mail starts being returned to sender with a delivery failure,
eg mail to
user@aiaiREMOVE_THIS.ed. If you don't think you'll have the service
back within that time, then speak to IS postie (
postie@ed) to explain
your problem, they may be able to tweak things so mail is not
bounced. They'll probably be in touch anyway when they spot the mail
backing up on their servers.
The lack of smtp.inf following a disaster, will frustrate some people,
but it isn't the end of the world. They could use
smtp.staffmail instead.
If it is possible to retrieve data off the virtual disk, then you
should do this to retrieve any transient /var/spool/mqueue/
contents. This can be done after your replacement service is up and
running, but again within the 5 days.
To restore the service you basically need to replicate the current
machines setup on some new hardware, and then restore /disk/data/
(or it's new equivalent location) from the mirrors
(bulleid:rmirror30/beeknow-vmr) or backups (which are taken from the mirror).
IMPORTANT Again you should make sure that mail isn't running
until you think it is fully configured as it was before, otherwise you
may end up incorrectly rejecting mail. So while setting up the new
machine, and before it starts answering as virtualrelay.inf, then add:
!systemd.units mREMOVE(lcfgmail)
!systemd.wanted_units_multiusertarget mREMOVE(lcfg-mail.service)
to the machine's profile, and do an
om mail stop
if it is already
running. Then as you restore the configuration, etc you can test mail
will be diverted as expected before restarting mail (sendmail). eg
Note that for anti spam @ has become (AT).
root: sendmail -bv neilb
neilb... deliverable: mailer relay, host mail.inf.ed.ac.uk, user neilb(AT)mail.inf.ed.ac.uk
root: sendmail -bv neilb(AT)inf.ed.ac.uk
neilb(AT)inf.ed.ac.uk... deliverable: mailer esmtp, host [mail.inf.ed.ac.uk], user neilb(AT)inf.ed.ac.uk
root: sendmail -bv neilb(AT)aiai.ed.ac.uk
neilb(AT)aiai.ed.ac.uk... No such user
# You should check a deliberate non-valid address to make sure we're
# not accepting ALL mail for some reason.
root: sendmail -bv bugs(AT)lcfg.org
bugs(AT)lcfg.org... deliverable: mailer esmtp, host [mail.inf.ed.ac.uk], user mp-unit(AT)inf.ed.ac.uk
One other gotcha, might be the SSL certs for smtp.inf, the x509
certificate server might need to be prodded to accept that your
replacement machine is the new smtp.inf.
That's it.
Other docs
For existing docs on the service see:
- the "Mail Related" items in ServicesUnitNuggets for more day-to-day stuff
- the services unit docs. Note these may still talk about old things like IMP and Horde which we don't do any more, and old hardware (mandy) which used to host smtp.inf and the virtual relay - mostly for historical interest.
--
NeilBrown - 11 Mar 2019