Renumbering the KB switches

Thoughts under construction...

At the moment (March 2014) all the KB switches are on 129.215.216/24. We would really prefer them to be on their own VLAN/subnet, most likely 192.168.95/24. This page sets out how we might go about it.

One possible plan:

  1. Don't do anything with the "new" PoE switch cs2 at this stage. Ship it out to KB with cs1, and install it in the rack there at the same time, but don't try to patch it in, power it up or configure it until the rest of this renumbering process has been completed.
  2. Set up the new switch cs1 in advance in the Forum, but don't enable OSPF on vlan 216 yet. It should be safe to do everything else. The switch at this point will have a 192.168.64/23 address, but that's OK for now. In the .pc file be sure to use the wire-B address, so that we can still configure it regardless of what vlan 1 is doing. Note that we'll be trying to advertise 129.215.216/24 as a connected subnet, but that should be overridden by the KB switches. Note that all non-infrastructure ports will likely want to be set to "S" rather than "default", so that they'll be on the correct VLAN for later.
  3. Change the address for cs0 in its .pc file to its wire-B one. This will allow us to speak to it later.
  4. Set up a wire-B port on cs1 for use out at KB. (This will be port 45) Also set one up on cs0. (Again this will be port 45.)
  5. Turn cs1 off again, to minimise the potential for breakage!
  6. Take cs1 (and cs2) out to KB. RT#66127
  7. We're all done to here... Note that although cs1.ports isn't yet pushed automatically, changes can, and indeed should, be made to it to keep it fully up-to-date. Any missing updates will be pushed to the switch as part of this process.
  8. Plug the screen and keyboard into elder. Log in and nsu as root. Do this now so that the root shell is there and ready for when you need it.
  9. Remove mh1 from the rack. At this point all servers should fail over their bonding if they're not already using cs0.
  10. Install cs1 but don't turn it on. Patch everything in ready to go apart from the DA cable to cs0.
  11. With the laptop connected to the wire-B port on cs0, log in to each of wallace-b, slatkin-b and elder-b in turn. For each: om stop routing, then om start routing static 129.215.64.64. They should now still be able to reach the lcfg slaves, but won't be participating in any routing exchanges.
  12. At this point OSPF has been stopped on all the network servers, so external traffic will have failed over to go via one of the Forum external routers. Internal traffic will be routed as normal.
  13. ssh in to cs0 and turn off OSPF on vlan 1. 129.215.216/24 should now be advertised as a "connected" subnet rather than as a an OSPF "link".
  14. Turn off cs0. There will now be a short break in service for all machines. Re-patch the EdLAN fibre to cs1. Turn on cs1. Once it has come up, all servers should pick up where they left off, only now using their other bonded interface and cs1 as their router.
  15. On the root shell on the screen/keyboard on elder, cd to /disk/home/elder/KBnet/ and make cs1.timestamp. This will have the effect of flushing out any final configuration changes to cs1.
  16. Re-patch the laptop to the spare wire-B port on cs1. Now would be a good time to check that connectivity is as expected! It should also be possible to ssh to the network servers, and in particular to rfe the switch configurations on elder. It shouldn't be necessary to change anything there yet though.
  17. ssh in to the wire-B address for cs1. Enable OSPF on vlan 216. Set the vlan 1 IP address to its correct (192.168.95/24) value.
  18. We now need to renumber cs0. Unpatch elder, slatkin and wallace from it so that their bonding remains with cs1. Patch in the DA cable and turn cs0 back on. All the server links will come up again now, of course, so we hope that bonding isn't set to prefer the cs0 links. For those which are, there will now be another short break in service!
  19. ssh in to the wire-B address for cs0. Remove the 129.215.216/24 address from vlan 1. Assign the new 192.168.95/24 address for vlan 1. Assign the 129.215.216/24 address to vlan 216. Turn on OSPF on vlan 216.
  20. ssh in to elder-b. rfe kbnet/Makefile to add cs1 and remove mh1 throughout, so that your next changes actually take effect! rfe kbnet/cs0.ports and change all "default" ports to use "S" instead. Be careful to change those ports which default to "default"! rfe kbnet/cs0.pc and remove the temporary rdisc-offset so that cs0 is the preferred router again.
  21. At this point renumbering of the switches should be complete, so it would be a good time to do some tests again.
  22. Re-patch the EdLAN fibre back from cs1 to cs0 and test again. (Given the 10Gbps link between the two switches, this would actually be less of an issue than before, but it's as well to be consistent with the documentation and it does mean that "primary_reselect always" bonding works as expected.
  23. Taking a deep breath, it's now time to renumber the network servers. First, edit live/include/live/netinf-routing.h and look for the ROUTING_FOR_KB #ifdef block. Enable the #ifdef notyet branch and remove the now-obsolete "not notyet" branch. Check in and wait for the changes to propagate.
  24. Enable the 192.168.95/24 addresses for elder, slatkin and wallace in dns/inf, and wait for the changes to propagate. Kick the servers as necessary.
  25. rfe the lcfg profile for elder. Change "s" to be vlan 216. Add a new definition for vlan 1. Close the edit and wait for the changes to propagate.
  26. Reboot elder. It should now come up with the correct IP addresses on vlans 1 and 216. OSPF should be running. ssh back in and check.
  27. Repeat the above for slatkin and wallace.
  28. Re-patch elder, slatkin and wallace to their cs0 ports.
  29. For both cs0 and cs1: tidy up any remaining configuration things. In particular, timeserver and traphost addresses will likely be wrong. Also set the addresses in the .pc files to use the 192.168.95/24 ones, and check the spanning-tree priorities. Remove the old mh1 trunk from cs0.
  30. Update live/include/live/netinf-KB.h so that we now poll cs1 instead of mh1 (and add cs2 while you're there). Make a corresponding change to the snmp/scripts/genIndex script on elder so that we index the new switches.
  31. Consider rebooting cs1 to fail back all the servers' bonds.

That should be it! It's now time to set up cs2. At least that one will be a clean-from-factory-defaults configuration (and if it's not already, use the usual two pointy objects to clear things down).

-- GeorgeRoss - 11 Mar 2014

Topic revision: r5 - 18 Mar 2014 - 12:16:31 - GeorgeRoss
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies