Architecture

Each local package "bucket" lives in a dedicated AFS volume. For example, the binary packages for the LCFG layer of the SL7 platform live in the volume bin.el7.lcfg and the associated source packages in the volume src.el7.lcfg. The volumes all live in one partition (the code does not currently support them being distributed amongst many partitions); with two read only copies. Each volume is mounted, read-only, twice under the /afs/inf.ed.ac.uk/pkgs/rpms tree; once under the branch "layers" and once under the branch "os". So the volume bin.el7.lcfg is mounted as /afs/inf.ed.ac.uk/pkgs/rpms/os/el7/x86_64/lcfg and as /afs/inf.ed.ac.uk/pkgs/rpms/layers/lcfg/el7. They are also mounted, read-write, under /afs/.inf.ed.ac.uk/pkgs/rpms. The reason for the two "views" is that we want package submission to be by "os" and package access by "layer", for easier access control.

Package submission can be done manually using "pkgsubmit" or automatically by Package Forge. AFS access rights control who can submit packages to the repository. Currently only users with the "system:administrators" right will have write access; all other authenticated users have read access. A few system accounts also have write access: "pkgforge_builder", and "refreshpkgs".

One machine is nominated the "package master". The package master runs a daemon "refreshpkgs" which periodically checks each local package bucket : once per minute for binary, once per hour for source. If there have been any changes made to the bucket since the previous check, the daemon will call the "freshenrpms" script to update the updaterpms package list "rpmlist" and to update the "yum" metadata (using "createrepo"). The daemon will then release the associated AFS volume. This daemon is managed by the lcfg-refreshpkgs component. This component includes a nagios passive monitoring script so any failure in the "refreshpkgs" process will be reported. The header <dice/options/pkgmaster.h> contains all the configuration required for the "package master".

Upstream software, e.g. SL and EPEL, lives in a package repository site mirror - see PackagesSiteMirror for further details.

A number of machines act as "package slaves". These serve the binary packages from the AFS filesystem to the clients over http. As only authenticated users can access the package buckets over AFS, each package slave must have an AFS admin uid. The apache waklog module is used to authenticate the package slave to the AFS service. The header <dice/options/pkg-slave.h> contains all the configuration required for "package slaves". Note that an AFS package slave must have a large (min 70Gb) AFS or Squid cache so that it can serve the packages out of local disk.

The "package master" can be one of the "package slaves".

To improve performance, squid based "package accelerators" are placed between the AFS based "package slave" and the end clients.

MachineSorted ascending Location Type Role CNAME Header
     
deneb Forum R330 master + slave http.pkgs.inf.ed.ac.uk <dice/options/pkg-master.h>, <dice/options/pkg-slave.h>
jornets KB vm on amarela slave(export) + rsync (export) exporthttp.pkgs.inf.ed.ac.uk, rsync.pkgs.inf.ed.ac.uk <dice/options/pkg-slave.h>, <dice/options/pkg-rsync.h>
maia AT R210-II squid accelerator cache.pkgs.inf.ed.ac.uk <dice/options/rpmaccel.h>
regulus Forum R210-II squid accelerator cache.pkgs.inf.ed.ac.uk <dice/options/rpmaccel.h>
salamanca KB R320 DR (disaster recovery) pkg server dr.pkgs.inf.ed.ac.uk <dice/options/lcfg-dr-server.h>

Machines use the round robin cache.pkgs.inf.ed.ac.uk name to access the package repository. Normally there are at least two machines responding to this address. In extremis, should both of these machines be dead or inaccessible, this name can be configured to point direct to http.pkgs.inf.ed.ac.uk which is deneb, the package master. Note however, that this will give reduced performance and have a noticeable impact on machine install time. The following LCFG config will direct a machine to use the package master:

!updaterpms.rpmpath    mSUBST(cache.pkgs,http.pkgs)

Disaster Recovery

A DR server, salamanca, based at KB, takes a nightly snapshot of the packages for the current server platforms. This is so that in a disaster situation, the package service is not reliant on AFS. The DR server exports its copy of the packages at the url http://dr.pkgs.inf.ed.ac.uk. The following LCFG config will direct a machine to use the DR copy of the packages:

!updaterpms.rpmpath    mSUBST(cache.pkgs,dr.pkgs)

Tasks

Adding new package buckets

  • Edit the <live/pkg-master.h> header, adding an entry for each layer in the new platform to the refreshpkgs.buckets resource. See the man page for refreshpkgs (only installed on the package master server) for information on the fields. Remember to set the age field to 60 for SRPMs.
  • Check that /etc/buckets.conf on the package master server has entries for your new platform.
  • In an AFS admin pagsh (i.e. using asu) on the package master server, run /usr/sbin/updatepkgsvolumes. This should create the required AFS volumes and filesystem mount points and set permissions appropriately.

Adding a new layer

  • TO BE COMPLETED
  • Until /usr/sbin/updatepkgsvolumes is modified to do this for you ....
  • Need to create directory in /afs/.inf.ed.ac.uk/{rpms,srpms}/pkgs/layers

Creating a new OS

  • mkdir /afs/.inf.ed.ac.uk/pkgs/rpms/os/{osname}
  • mkdir /afs/.inf.ed.ac.uk/pkgs/srpms/os/{osname}

Dropping an OS

  • for all buckets in /afs/.inf.ed.ac.uk/pkgs/rpms/os/{osname} - fs rmmount {osname}
  • for all in /afs/.inf.ed.ac.uk/pkgs/rpms/layers/*/{osname} - fs rmmount {osname}
  • for all buckets in /afs/.inf.ed.ac.uk/pkgs/srpms/os/{osname} - fs rmmount {osname}
  • for all in /afs/.inf.ed.ac.uk/pkgs/srpms/layers/*/{osname} - fs rmmount {osname}
  • for all {osname} buckets in vos listvol -server {readonlyserver} -partition {readonlyserverpartition} - vos remove
  • for all {osname} buckets in vos listvol -server {readwriteserver} -partition {readwriteserverpartition} - vos remove

Installing a new package slave (cache server)

  • For performance sake, use a physical machine.
  • Your machine must have sufficient disk space, on the partition containing /var/cache/afs, to cache the current distributions. This should be at least 70Gb, preferably more. By default, /var/cache/afs lives on /dev/sda3, so use !fstab.size_sda3 mSET(free)
  • Give the new slave read access to the AFS package repository - see "Give a machine access to the AFS packages repository"
  • Add the <dice/options/pkg-slave.h> header, before or after install.
  • Install the machine

e.g.

#include <dice/options/pkg-slave.h>

/* Use all available space for /var/cache/afs */
!fstab.size_sda3        mSET(free)

Adding package slave (cache server) functionality to an existing machine

  • This is basically the same procedure as for installing a new package slave (see above).
  • After adding the <dice/options/pkg-slave.h> header, you will need to run updaterpms to install the apacheconf component, and then start the apacheconf component.

Decommissioning a package slave (cache server)

Installing a new package accelerator (squid)

  • For performance sake, use a physical machine.
  • Your machine must have sufficient spare disk space, on the partition holding /var/spool/squid, to hold the squid cache. The default cachesize is 70Gb.
  • If you can't dedicate a partition for /var/spool/squid, you can symlink it to a directory on a partition with sufficient space.
  • Include the header <dice/options/rpmaccel.h>.
  • If you have more than 70Gb spare space (on the partition holding /var/spool/squid), you can increase the cache size (in megabytes) using the #define RPMACCEL_CACHEDIRSIZE macro.
  • Install the server with bonded ethernet.
  • Check that squid is working using, e.g. wget http://{machine-name}/rpms/layers/lcfg/sl6/rpmlist
  • Add an appropriate #verbatim entry for cache.pkgs using rfe dns/inf.

Adding package accelerator (squid) functionality to an existing machine

  • Your machine must have sufficient spare disk space, on the partition holding /var/spool/squid, to hold the squid cache. The default cachesize is 70Gb.
  • Include the header <dice/options/rpmaccel.h>.
  • If you can't dedicate a partition for /var/spool/squid, you can symlink it to a directory on a partition with sufficient space.
  • If you have more than 70Gb spare space, you can increase the cache size (in megabytes) using the #define RPMACCEL_CACHEDIRSIZE macro.
  • Add bonded ethernet, if not already present.
  • Run the updaterpms component to install the rpmaccel component. Now start the rpmaccel component.
  • Check that squid is working using, e.g. wget http://{machine-name}/rpms/layers/lcfg/sl6/rpmlist
  • Add an appropriate #verbatim entry for cache.pkgs using rfe dns/inf.

Moving the package master function to a different machine

  • Add the <dice/options/pkg-master.h> header to an existing machine, and run updaterpms to install the refreshpkgs component.
  • Copy the refreshpkgs.keytab file (see refreshpkgs.keytab resource) from bruegel to the new machine; if this isn't possible, see CreateNewRefreshPkgsKey on how to create a new package master key.
  • Start the refreshpkgs component.
  • To test, create an arbitrary file in one of the package buckets (R/W branch) and check in /var/lcfg/log/refreshpkgs that the refreshpkgs daemon notices the change and performs a volume release. Please remember to remove the test file.

Give a machine access to the AFS packages repository

If you want a machine to have access to the AFS packages repository, you should really consider using something like k5start or waklog to provide proper authenticated access. However, as an aid to transitioning existing services which access the packages repository, IP address ACLs can be used to allow access from specific machines.

  • Allocate an AFS admin UID from AFSAdminUids. Remember to record your allocation in the table. Best let the services unit know you're doing this.
  • Create a PTS user entry for your required IP address using pts createuser -name {IP address} -id {AFS admin UID}
  • Add the IP address to the AFS group pkgaccess using pts adduser {IP address} pkgaccess. Be aware that IP ACL propagation is significantly slower (several hours) than for user ACLs.

Removing a machine's access to the AFS packages repository

If you wish to remove access to the AFS packages repository, for example when decommissioning a package slave :-

  • Remove the machine's IP address from the AFS group pkgaccess using pts removeuser {IP address} pkgaccess. You can use pts membership pkgaccess to list the current members of that group.
  • Destroy the PTS user entry for the IP address using pts delete {IP address}
  • Free up the AFS admin UID from AFSAdminUids.

How do I forcibly refresh the yum repodata for a package bucket

If you wish to forcibly refresh the yum repodata for a package bucket you use freshenrpms with the --fullrebuild option :-

/usr/sbin/freshenrpms --fullrebuild bin.el7.lcfg el7 lcfg rpms /afs/.inf.ed.ac.uk/pkgs

Monitoring and setting package bucket quotas

Each AFS package volume has a quota, effectively a maximum size which it can't exceed. Since the sizes of package volumes vary, so do the quotas. An hourly cron job on the package master compares the sizes of the package volumes with their quotas, and mails the MPU if it spots a problem.

The cron job runs a script called buckets. This warns if a volume's size is more than 80% of its quota, or if there is less than 5GB of free space left in a volume or in the host partition. It can also be run by hand on the packages master:

[spider]cc: ssh sites.pkgs.inf.ed.ac.uk
X11 forwarding request failed on channel 0
Last login: Wed Sep 20 13:29:15 2017 from spider.inf.ed.ac.uk
[deneb]cc: buckets
bin.sl6.lcfg          15%      1605904     10485760 
src.sl6.lcfg           2%       160403     10485760 
bin.sl6.world         56%     11692119     20971520 
src.sl6.world         49%     10282414     20971520 
bin.sl6.devel         32%      5050966     15728640 
    etc.
There are a few options:
[deneb]cc: buckets --help

buckets - how full are the package volumes relative to their quotas?

Options:
    --help, --usage  Print this text.
    --partition      Also examine how full the host partition is.
    --quiet          No output; a non-zero return value indicates that 
                     a volume or partition is close to being full.
    --units          'k' or 'K', use kilobytes (the default); 
                     'm' or 'M', megabytes; 
                     'g' or 'G', gigabytes.
e.g.
      buckets --partition --units=g

Note: it only works on the master packages server.
The quotas for the current platform are set to make the volumes less than 50% full. For old platforms we allow the volumes to fill beyond 50% of their quota. We have a minimum volume quota of 20GB, or 10GB on old platforms.

To set the quota for a package bucket use fs setquota like this:

fs setquota -path /afs/.inf.ed.ac.uk/pkgs/rpms/os/el7/x86_64/uoe -max 30G
fs setquota -path /afs/.inf.ed.ac.uk/pkgs/rpms/os/el7/x86_64/inf -max 80G
fs setquota -path /afs/.inf.ed.ac.uk/pkgs/rpms/os/el7/x86_64/lcfg -max 20G
See man fs_setquota for details.

Questions

What does "refreshpkgs: *FAIL* : flock failed : failed to lock ourselves - must already be another running copy" in /var/lcfg/log/refreshpkgs mean ?

The "refreshpkgs" daemon runs every minute to check for changes to the repository. If there have been a lot of changes since the last check, the AFS volume release can take more than a minute to complete. When this happens, the next call of refreshpkgs will result in the above error message. This can also happen during the night when the AFS servers are busy doing backups. You can safely ignore this message if it's reported for a small number of consecutive minutes. If it continues for more than 10 mins, the first thing to check is whether there are any underlying AFS issues.

How do I purge files from the rpmaccel caches ?

To remove a file you just need to do something like this as root:

squidclient -p 80 -m PURGE /afs/inf.ed.ac.uk/pkgs/rpms/layers/inf/el7/VirtualBox-4.3-4.3.20_96996_el7-1.x86_64.rpm

Note that the -p flag must be used to set the correct port number. If it worked you will get a 200 response:

HTTP/1.0 200 OK
Server: squid/2.6.STABLE17
Date: Tue, 18 Oct 2011 13:23:28 GMT
Content-Length: 0

It there is no match in the cache you will get a 404 response:

HTTP/1.0 404 Not Found
Server: squid/2.6.STABLE17
Date: Tue, 18 Oct 2011 13:23:36 GMT
Content-Length: 0

Freshenrpms appears to be taking ages and reporting "No free memory"

This appears to be caused by a corrupt createrepo database. The simplest solution is to remove the top-level "repodata" directory of the affected bucket and then force a rebuild ( see How do I forcibly refresh ...).

Why do recently submitted packages occasionally not appear for use on end machines?

There is a potential race condition where if a package is submitted whilst the "freshenrpms" script is running, it will not trigger a volume release. This is because "refreshpkgs" uses the AFS volume "last modified" value to decide whether or not there have been updates to the bucket. It records a timestamp per bucket after it has run the "freshenrpms" script, to record when it last checked the associated AFS volume. It must timestamp after the "freshenrpms" script has run, otherwise the "freshenrpms" script would, itself, trigger "refreshpkgs" into refreshing that bucket. This is normally not a significant problem as "freshenrpms" usually takes only a few seconds to run, but if it takes longer (see above questions), there is a risk that a new package won't be noticed. See below on how to force a run of "freshenrpms" for a particular bucket.

How do I trigger a "freshenrpms" for a specific bucket?

Change directory to the writeable version of the bucket (note the dot before the "inf") then touch any file there - e.g.

cd /afs/.inf.ed.ac.uk/pkgs/rpms/os/el7/x86_64/lcfg/
touch rpmlist
Then wait a minute or two for "refreshpkgs" to notice that something has changed, and process the bucket.

-- AlastairScobie - 20 Jul 2009 (updated 19 Dec 2017)

-- StephenQuinney - 24 Jan 2019 (reviewed)

Topic revision: r34 - 01 Apr 2019 - 08:45:27 - StephenQuinney
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies