TWiki
>
DICE Web
>
DevelopmentMeeting
>
MPU203FinalReport
(revision 13) (raw view)
Edit
Attach
---+ !!SL6 Upgrade for MPU Services This is the final report for the SL6 Upgrade for MPU Services, which is <a href="https://devproj.inf.ed.ac.uk/show/203">devproj.inf project number 203</a>. The aim of the project was to upgrade all MPU-run services from SL5 to SL6. The project was planned and tracked at SL6ServerUpgradeList. %TOC% ---++ Service By Service ---+++ Packages Master We waited a long time before upgrading the packages master because the SL5 version used perl-AFS and for a long time this did not support 64 bit. When perl-AFS 2.6.3 became available we decided to use it on a 64 bit machine but with !OpenAFS 1.4.14. In the longer term we will rewrite the refreshpkgs code to the use the AFS::Command module but the priority was to get away from SL5. * Before the upgrade we noted these tips: * Use AFS 1.4, not 1.6 * Use Toby's newer version of the Perl AFS module package. Make sure it's in the =world= bucket rather than in =devel=. * Test everything on an SL6 box first. * Use a copy of the keytab from _brendel_ rather than generating a new one as the new one would automatically invalidate the one currently in use! * When we tested refreshpkgs on an SL6 VM we did this: * Touch an =rpmlist= to trigger =refreshpkgs=. * If that worked, try a package submission. * Check the contents of =rpmlist=. * Check =yum=. =yum clean all= should be run first to clear the local machine's cache. * There may be a missing dependency on =createrepo=. * Expect failures involving perl-AFS. * Create a new test bucket and check that that works. * We also found that the testing VM hadn't been given enough memory to handle large runs of =refreshpkgs=. * We found an issue with the ==createrepo checkts== option used by =refreshpkgs=. We're running a locally fixed version until the fix appears upstream. * Soon after the upgrade was completed an issue appeared wherein there was a delay of up to 25 minutes between package submission and availability. This was caused by a software upgrade which happened shortly after _brendel_ was upgraded to SL6; so it had no chance of showing up in our testing beforehand. It was fixed by upgrading the !OpenAFS version on the package buckets' AFS server to 1.6.2-0.pre3. ---+++ Packages Slaves and Export The packages export server provides [[http://exporthttp.pkgs.inf.ed.ac.uk/rpms][exporthttp.pkgs.inf.ed.ac.uk]] and [[rsync://rsync.pkgs.inf.ed.ac.uk][rsync.pkgs.inf.ed.ac.uk]] both of which are used for exporting packages to other LCFG users in the University. =mod_waklog= on SL6 was a significant requirement of this and a number of other services. When it came to the upgrade itself were no real problems; the service was already in its current form and virtualised so just needed an OS upgrade. ---+++ Package Cache and PXE The package cache servers were among the first MPU machines to be upgraded to SL6. Before the upgrade we identified a need to use =disk-layout.h= for SL6. Since that introduces a default size of 40GB for the root filesystem - far too much for many servers - we also introduced =small-server.h= which has a far smaller default size. At the same time we introduced an automatic swap size algorithm similar to that used upstream. TSL6 introduced changes to squid e.g. to netmasks. ---+++ Packages Mirrors AFS This upgrade was held back for a long time whilst we resolved various issues with nexsan storage arrays and multipath fibre-channel support on SL6. This was the first AFS fileserver to be upgraded to SL6 and run the openafs 1.6 release and served as a good test case for the upgrade of the main Informatics fileservers. There were concerns that the lack of perl-AFS module would be an issue but the LCFG openafs component still worked fine. The important thing to remember with upgrading this server is that the value of the =updaterpms.rpmpath= resource has to be altered to use _dr.pkgs_ instead of _cache.pkgs_ otherwise it doesn't work... This clearly demonstrated the usefulness of having the DR server available for more than just disaster situations. ---+++ LCFG Master Since almost all aspects of our configuration are done using LCFG, upgrading the LCFG Master itself had to be done very carefully. The MPUpgradingMasterLCFGServerSL664 page lists details of the entire upgrade procedure, with some discussion on what went wrong and what could have been done better. We rehearsed this upgrade well in advance by building up a parallel LCFG Master on a VM and having it master a parallel LCFG system with a test LCFG slave of its own. This was a good way of showing up problems which may not otherwise have been spotted in advance of the real upgrade. The LCFG Master is also the _ordershost_. This functionality was moved to an SL6 VM before the upgrade then moved to the LCFG master after the upgrade. The manual steps for bootstapping the ordershost are in dice/options/ordershost.h (at the bottom). Problems encountered _before_ the upgrade included: * an issue involving web-svn. * a lot of LCFG components did not have sl6_64 versions of their defaults RPMs. It took some months to have the missing RPMs produced or establish that components without them could be dropped. * a certificate error involving rfe servers. The upgrade to SL6 was done on the then LCFG master _tobermory_. The service's move from _tobermory_ to _schiff_ was done at a later date. ---+++ LCFG Slaves We upgraded our slaves _mousa_ and _trondra_ one at a time, looking for problems after the first of the upgrades. A problem with _rsync_ was discovered this way. The move to the virtual machines _bol_ and _metsu_ came later in the year. It was prompted by the unacceptable time taken by _mousa_ and _trondra_ to do complete LCFG profile rebuilds following the upgrades. Moving them to KVM gave us slaves with faster CPU and much faster disk, which were an advantage in preventing the extreme slowdown experienced with full profile rebuilds. The inf level LCFG slave was first duplicated on an SL6 VM. This allowed us to compare SL6-generated and SL5-generated profiles. When the results were satisfactory the SL5 (virtual) server was decommissioned. ---+++ DIY DICE DIY DICE was the first LCFG service to be hosted on SL6. Before we did the upgrade we tested the LCFG server code on SL6. ---+++ Disaster Recovery Service The DR server _sauce_ was upgraded to SL6 after the other LCFG servers. We noted these points before the upgrade: * It's important to remember to preserve the current contents of the sauce disk space as it would take a long time to regenerate it. * test slave using sauce as a master server * test client using sauce as a slave server * test client using sauce as a package server After the upgrade these tests showed up a couple of problems, as noted at the [[MPunitMeeting20120529#Server_Upgrades][29 May MPU meeting]]. These were subsequently fixed. ---+++ !PkgForge master This upgrade was fairly straightforward. Although !PkgForge has a large set of package dependencies these had already been prepared when the client tools were upgraded to SL6 some time before. The only issue was a dependency on mod_waklog which needed patching before it could be used on SL6 with openafs 1.6 ---+++ LCFG Web Services This upgrade was held back for a long time whilst we resolved various issues with nexsan storage arrays and multipath fibre-channel support on SL6. The original plan was to move this service from _budapest_ to _bakerloo_ but in the end we decided that hosting it on real hardware was not justified and instead we moved it to KVM (_polecat). This was a useful process since it gave us a chance to learn how to configure extra disks for VMs. The locally-written software used to generate the main LCFG website all worked fine on SL6. As part of the upgrade we decided to switch to the latest version of the TWiki software, this presented a few problems since it was a major version change (4 to 5). In particular we had problems with theming until Alastair worked out the magic to make version 5 use the same skin as version 4. ---++++ LCFG Bug Tracking On SL5 all of the LCFG web services were hosted on the same machine. To ease the complication of configuration we moved the LCFG Bugzilla service to a separate KVM for SL6. At the same time Bugzilla itself had to be upgraded from the outdated 3.0 series to the new 4.2 series. This was the latest stable release at the time of the upgrade and is likely to be maintained and receive security fixes for some time to come. The main points to note were how to get working RPMs for Bugzilla 4.2 and the steps involved in upgrading the Bugzilla setup from the old version to the new. Both of these points are covered on the ManagingBugzilla page. ---+++ !VMware hosting Since the free version of !VMware we used ceased to be supported some time ago there was no possibility of our upgrading it to SL6 - indeed we had to maintain an outdated version of SL5 specially to support the !VMware service. Instead !VMware is being replaced by a KVM hosting service. ---+++ KVM Hosting This service started on SL6 so didn't need an upgrade. ---+++ Storage Array Monitoring The old VM _zig_ used for monitoring the IBM storage array previously had to be a self-managed machine because it required a specific version of mysql which conflicted with that installed as part of DICE SL5. With the much smaller SL6 server installations, and the resulting reduced dependencies, it is now simple to get the monitoring software installed. The new VM _giz_ runs DICE SL6 so now benefits from standard DICE nagios monitoring. ---++ Summary of Machine Moves The MPU decided to combine the SL6 upgrades with hardware upgrades and moves, and a number of the upgrades were done by preparing a replacement SL6 service on a separate machine then switching over to it, so many MPU services moved host during this project. Excluding the new KVM servers, our physical host count reduced by 3. $ Packages Master : This service stayed on _brendel_, a Dell R200 from 2009. $ Packages Slaves and Export : This moved from its SL5 KVM _cochin_ to a new SL6 KVM _porto_. $ Package Cache and PXE : The package cache service, and PXE service, moved from _split_ and _schiff_ to two new Dell R210 II machines _hare_ and _wildcat_. _split_ was retired and _schiff_ redeployed as our new LCFG master server. $ Packages Mirrors AFS : The packages mirrors AFS service stayed on _telford_, a Dell 1950 from 2008, though the data itself is on the MPU SAN. $ LCFG Master : This moved to _schiff_ from our elderly Dell 860 _tobermory_ which was retired. $ LCFG Slaves : These moved from elderly and slow Dell 860s _mousa_ and _trondra_ (now retired) to KVM guests _metsu_ and _bol_. The test slave moved from KVM _alboran_ (SL5) to _vole_ (SL6) and the inf level slave moved from _ashkenazy_ (SL5) to KVM _barents_ (SL6). $ DIY DICE : This service was already virtualised. It moved from KVM _madurai_ (SL5) to KVM _sernabatim_ (SL6). $ Disaster Recovery Service : The LCFG and Packages DR service stayed on _sauce_, an HP DL180 from 2010. $ !PackageForge master : This moved from !VMware _ardbeg_ (SL5) to KVM _pinemarten_ (SL6). $ LCFG Web Services : These moved from _budapest_, a Dell 1950 from 2007, to two KVM virtual machines. The bugzilla service moved to _heda_ and the other services moved to _polecat_. $ !VMware hosting : This service could not be upgraded to SL6. It has been deprecated and at the time of writing has one guest VM left. As !VMware host servers were no longer needed they were redeployed as additional KVM host servers. $ KVM hosting : This service started on SL6 so needed no upgrading. The number of KVM host servers has grown thanks both to purchases of new hardware and to redeployment of former !VMware host servers as additional KVM host servers. $ Storage Array Monitoring : This was already virtualised; it moved from KVM domain _zig_ to _giz_ for the SL6 upgrade. ---++ Hours Taken | *Period* | *Hours* | | 2011 T3 | 27 | | 2012 T1 | 168 | | 2012 T2 | 82 | | 2012 T3 | 125 | | 2013 T1 | 14 | | Total | %CALC{"$SUM( $ABOVE() )"}% |
Edit
|
Attach
|
P
rint version
|
H
istory
:
r16
<
r15
<
r14
<
r13
<
r12
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r13 - 08 Mar 2013 - 14:32:42 -
AlastairScobie
DICE
DICE Web
DICE Wiki Home
Changes
Index
Search
Meetings
CEG
Operational
Computing Projects
Technical Discussion
Units
Infrastructure
Managed Platform
Research & Teaching
Services
User Support
Other
Service Catalogue
Platform upgrades
Procurement
Historical interest
Emergencies
Critical shutdown
Where's my software?
Pandemic planning
This is
WebLeftBar
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback
This Wiki uses
Cookies