Currently there seem to be three choices when it comes to where and how to store your virtual machine images, these would be;
- Local storage, either RAW images or Cooked (eg; QCOW2) format
- Remote storage, typically a shared and/or replicated system like NFS or Gluster
- Shared storage over dedicated hardware
There are “many” issues with each of these options in terms of latency, performance, cost and resilience – there is no 'ideal' solution. After facing this problem over and over again, we've come up with a fourth option.
Cache your storage on a local SSD, but hold your working copy on a remote server, or indeed servers. Using such a mechanism, we've managed to eradicate all of the negatives we experienced historically other options.Features
Which means ...
- Virtual machines run against SSD image caches local to the hypervisor
- Images are stored remotely and accessed via TCP/IP
- The Cache is LFU (*not* LRU) which makes it relatively 'intelligent'
- Bandwidth related operations are typically 'shaped' to reduce spikes
- Cache analysis (1 command) will give you an optimal cache size for your VM usage
- The storage server support sparse storage, inline compression and snapshots
- The system supports TRIM end-to-end, VM deletes are reflected in backend usage
- All reads/writes are checksummed
- The database is log-structured and takes sequential writes [which is very robust and very quick]
- Database writing is “near” wire-speed in terms of storage hardware performance
- Live migration is supported
- The cache handles Replica's and will parallel write and stripe read (RAID 10)
- Snapshot operations are hot and “instant” with almost zero performance overhead
- Snapshots can be mounted RO on temporary caches
- Cache presents as a standard Linux block device
- Raw images are supported to make importing pre-existing VM's easier
In terms of how these features compare to traditional mechanisms, network bottlenecks are greatly reduced as the vast majority of read operations will be serviced locally, indeed if you aim for a cache hit rate of 90%, then you should be able to run 10x the number of VM's as an NFS based solution on the same hardware (from an IO perspective) Write operations are buffered and you can set an average and peak rate for writing (per instance) so write peaks will be levelled with the local SSD acting as a huge [persistent] write buffer. (this write buffer survives shutdowns and will continue to to flush on reboot)
If you assume a 90% hitrate, then 90% of your requests will be subject to a latency of 0.1ms (SSD) rather then 10ms (HD) , so the responsiveness of instances running on cache when compared (for example) to NFS is fairly staggering. If you take a VM running Ubuntu Server 12.04 for example and type “shutdown -r now”, and time hitting the return key to when it comes back with a login prompt, my test kit takes under 4 seconds - as opposed to 30-60 seconds on traditional NFS based kit.
And when it comes to cost, this software has been designed to run on commodity hardware, that means desktop motherboards / SSD's on 1G NIC's – although I'm sure it'll be more than happy to see server hardware should anyone feels that way inclined.
The software is still at the Beta stage, but we now have a working interface for OpenNebula. Although it's not complete it can be used to create, run and maintain both persistent and non-persistent images. Note that although this should run with any Linux based hypervisor, every system has it's quirks – for now we're working with KVM only and using Ubuntu 13.10 as a host. (13.04 should also be Ok, but there are issues with older kernels so 12.xx doesn't currently fly [as a host])
As of today we have a public rack-based testbed so we should be able to provide a demonstration within the next few weeks, so if you're interested in helping / testing, please do get in touch.