Thoughts on enhancing the Simple KVM Service

We could enhance our virtualisation facilities in any number of ways, but before we get too ambitious it might be wise to first address the main shortcoming of the existing Simple KVM Service. This service has made it fairly easy and quick for computing staff to fire up a new DICE virtual machine as needed. To that extent it has been a big success. However from the MPU point of view it needs improvement, because maintenance can be a lot more time-consuming than it really needs to be.

It's a pain in the neck to free up a KVM server for maintenance work (reboots, updating firmware, etc.).

Each server runs a number of VMs, each of which can support multiple services, used by various overlapping groups of users around the world. Time has to be taken to carefully compile a list of each service running on each VM on the KVM server in question, together with the financial, political and technical importance of each; to assess what action would therefore be appropriate for each (migration to another server, user announcement, temporary shutdown) and to plan and execute the migrations and issue the announcements often several days beforehand.

Migrating VMs from one KVM server to another can itself be fraught with problems.

Separate storage for each server.
Each KVM server uses its own storage pool(s), so each VM's storage has to be copied from the origin storage pool to the destination storage pool. The copying process takes a long time (tens of minutes for each VM).
Uniquely named storage pools.
Storage pools have different names, so the name of the pool used by the VM has to be changed or fudged somehow for the migration to work.
KVM/QEMU software problems.
Software bugs and incompatibilities have led to failed migrations. (The version of software in use when a VM was last started is the one at issue so updates to VM software don't immediately help.)
Network differences.
VMs use different network subnets at each site. This makes it difficult to preserve a VM's network connection when migrating from one site to another, effectively making cross-site migration impossible.

Now that we have access to a variety of distributed parallel fault-tolerant file systems, it should be possible to share one single storage pool across several KVM servers. This would have advantages:

  • The sudden loss of a server would not result in data loss from the storage pool, as the data would be redundantly shared across all participating servers.
  • All of the pool's data would be available to all of the servers. Migration would therefore be dramatically quicker (taking seconds rather than tens of minutes) as it would only involve the copying of the running state of the VM from one server to another; the VM's storage would not have to move.
Migration problems would then be limited to the network issue and to the bugs inherent in the migration process. The migration bugs which have affected us before now are believed to have been solved. For the time being we could limit migration to intra-site only (as at present) to avoid VM network problems.

Courses of action:

  • Briefly check the current bugginess or otherwise of KVM migration. This shouldn't take long.
  • Inquire into the possibility of intersite networking solutions, either to share subnets across sites or to make it more straightforward for a running VM to shift from one subnet to another. This might possibly bear fruit but probably not as the infrastructure unit has already indicated that cross-site shared subnets are generally a really bad idea.
  • Set up a shared fault-tolerant filesystem between test KVM servers and play about with it to test configurations, reliability etc. Other Schools, notably SEE, already successfully use such a shared filesystem for their virtualisation services. It would be a good idea for us to compare notes with them and share technology and solutions where we can. In addition Iain Rae and Graham Dutton have had an initial look at this with hopeful results.

-- ChrisCooke - 27 Feb, 2 Apr 2015

Topic revision: r3 - 02 Apr 2015 - 08:51:57 - ChrisCooke
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies