Emergency Shutdown procedures for GPU servers

This will evolve into definitive docs for shutting down gpu servers in the event of aircon failures or other issues. At the moment it's just chunks of bash that might shut stuff down in the correct manner order if copy/pasted into a shell. Do not copy/pase this to test it if it works bad things will happen, if it doesn't then equally bad things might happen

Note that in an emergency it's probably better to push the power button that to do nothing.

Panic Mode: Forum

It doesn't matter about cluster nodes just shut them down as normal machines.
ssh lcfg-master
cd  /var/rfedata/lcfg/profiles
#kill all the gpu nodes, but ostrom exports a filesystem so we want to leave that for now
for H in `grep -l gpu-devel * | xargs grep -l wire_s3|grep -v ostrom`
do
    ssh -x $H nsu -c \" /sbin/poweroff  \" &
done
for H in 18 19 20 21 22 23 24  25
do
    ssh -x landonia$H nsu -c \" /sbin/poweroff  \" &
done
for H in daisy1 daisy2
do
    ssh -x $H sudo /sbin/poweroff &
done
sleep 20
ssh ostrom nsu -c \" /sbin/poweroff  \" &

Panic Mode: Appleton tower

The CDT cluster needs a clean shutdown slurm, then nodes then filesystem The teaching/research cluster needs the nodes, then the scheduler then the filesystem indally we shut down glorious last as it has the scheduler vms on it.

CDT Cluster

#shutdown nodes (there aren't 21 charles nodes but nevermind you're panicing at this point
for H in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21
do
ssh  -x james$H nsu -c \" /sbin/poweroff  \" &
ssh  -x charles$H nsu -c \" /sbin/poweroff  \" &
done
for H in ann mary apollo1 apollo2
do
ssh  -x $H nsu -c \" /sbin/poweroff  \" &
done
This will have shut down all the nodes. If you want the infrastructure to go then:
#kill the scheduler
ssh cdtscheduler
nsu
#shutdown clustering software
scontrol shutdown
poweroff
#shutdown the filesystem
ssh malcolm01
nsu
gluster volume stop cdtcluster_home
poweroff
for H in malcolm02 malcolm03 malcolm04
do
ssh -x $H nsu -c \" /sbin/poweroff  \" &
done

Teachnig/reseach/mlp/cluster of many names

ssh atconsoles #shutdown the nodes damnii first for H in 01 20 03 04 05 06 07 08 09 10 11 12 do ssh -x damnii$H nsu -c \" /sbin/poweroff \" & done #lethat for H in letha01 letha02 letha03 letha04 letha05 letha06 meme do ssh -x $H nsu -c \" /sbin/poweroff \" & done #and the landonia nodes for H in 01 02 03 04 05 06 07 08 09 do ssh -x landonia$H nsu -c \" /sbin/poweroff \" & done

filesystem next if we have to

ssh reekie01
nsu
gluster volume stop teaching-home
gluster volume stop pgr-home
poweroff
for H in 02 03 04 05 06 07 08
do
ssh -x reekie$H nsu -c \" /sbin/poweroff  \" &
done
# we currently (at time of writing) have an archive filesystem, that might or not exist at point of panic
ssh 823nas
nsu
gluster volume stop teaching-archive
poweroff
for H in 824nas 825nas
do
ssh -x $H nsu -c \" /sbin/poweroff  \" &
done
#Now for the infrastructure
ssh atconsoles
for H in uhtred agneda glorious forthea taggart
do
ssh -x $H nsu -c \" /sbin/poweroff  \" &
done

Cluster nodes

Ideally you should use the slurm scontrol command to down cluster nodes as this will mean that everything comes back cleanly however if you are in a hurry it will not damage the cluster to just hit the power buttons. There is an
scontrol shutdown
command which should shutdown all the cluster daemons but given that currentlyonly the cdt and ilcc clusters are on one site it probably doesn't make sense to use it. Also it's likely that whilst we want to shutdown the nodes in the event of an aircon failure the scheduler and other nodes can probably stay up.

Powering it all back on

Much Much more complicated.

-- IainRae - 04 Mar 2020

Topic revision: r5 - 06 Mar 2020 - 14:14:09 - IainRae
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies