Sunday, April 6, 2008

Virtualization Madness

I now have all of my hardware for the virtualization project I've been working on and have been doing final testing and setup configuration lately. I've had the awesome opportunity to really torture the VM setup on a test cluster in the lab. I started out with testing storage solutions on a 10x10 (physical x virtual) cluster of Dell 1950's with 8GB of RAM and two Core 2 Duo's. I still have the demo NetApp 3070 that proved out NFS storage for VM images, and now I have all of the upgrades in the 1950's to push them to 32GB of RAM and hardware RAID controllers. I've learned quite a bit in the process of all of this testing and thought I'd share some tips here.

Many of the VM's I'll be running in production will come from a P2V migration, but because my predecessors were smart enough to concentrate all of our custom content under a single mountpoint, most of the systems can be migrated to new OS images (and thus upgrade ancient OS's in the process).

Tip #1: when building servers (physical or virtual), ignore the FHS for your local content. The common place these days is /srv, so put anything that does not come in a distribution package here. Backups are as easy as "tar -czvf /someplace/`hostname -s`-srv.tar.gz /srv". Migrations and cluster scaleouts similarly easy: "rsync -ave ssh /srv/./ username@newbox:/srv". When a package really wants you to conform to FHS, work around it with symlinks.

Edit: it was pointed out to me that FHS actually recommends /srv. To be honest, I haven't looked at it in years. In any case, my point remains valid, since most distros interpret FHS in their own way and still put things like web content and database files under /var.

Ok, so that wasn't so much about virtualization. A somewhat little known fact is that almost all of the virtualization players out there, be it Xen, KVM, VMWare, or even Microsoft, is that the VM's themselves are not actually that hard to migrate between them. You only have to figure it out once then, especially for Linux VM's, you can script it and do them in bulk. Probably the two most valuable tools for this are kpartx and qemu-img. Xen doesn't seem to install qemu-img with its Qemu stuff, but it's well-worth keeping around on your dom0's.

Tip #2: learn to use kpartx and qemu-img, even if you're using LVM or individual LUN's for your VM's. qemu-img can read and write raw, vmdisk, qcow2, and a few other formats and is pretty deft at enabling sparse file support, which is pretty nifty. For instance, if you download a VMWare appliance and want to run it under Xen, it's trivial to convert to a raw image with "qemu-img convert vmappliance.img -O raw vmappliance-raw.img". kpartx is nice because it will map out partitions within an image or LVM volume using device mapper. So once that image is created, do "kpartx -va vmappliance-raw.img" then you can mount the partitions without messing around with weird offset options to losetup.

One of the problems I've run into quite a lot over the last couple years' of playing with Xen & co. is that most initramfs scripts output far too fragile and stupid environments. With the availability of busybox and gobs of RAM these days, there is absolutely no reason I should have to screw around for hours rebooting a box because these filesystems are not smart enough to drop into a debug shell when things go wrong. I have published a simplistic script that I occasionally use to build initrd's at http://tobert.org/unix/index.html. But often, for support reasons, it's not practical to run a custom generation script. With the 2.6 Linux kernel, it's actually way easier to edit these buggers than it was back in the day, since now they're simply compressed cpio archives.

Tip #3: learn how to hack initrd's and get those years of your life back. Here's how to tear it apart:
mkdir /tmp/initrd
cd /tmp/initrd
gunzip -c /boot/initrd-`uname -r`.img |cpio -ivdm
The first thing to look at is the "init" script. For instance, when CLVM locking is stopping you from getting to single-user, simply crack that file open and comment out all the LVM initialization code. It's mostly a simple shell script. Another trick is to copy busybox into the bin directory, symlink lash, then add a "/bin/lash -i" to the init script right before root gets mounted. To put everything together again, you have to use the "newc" cpio format, so the command is (from the top of the initrd):
find . |cpio -oH newc |gzip -c > /boot/initrd-`uname -r`.img
To save yourself a lot of frustration, I highly recommend playing around with initramfs hacking in VM's first, since the hack/reboot/fail/reboot/hack/reboot cycle is so much faster.

Xen is pretty neat and it's nice how it's integrated with EL5 so I can just use yum to keep up to date. While I'm deploying Xen for my production stuff in the coming weeks, I'm watching closely for KVM to reach a level of maturity where I can start migrating over to it. I expect this to happen this year, but I won't go anywhere near it for production until it starts surviving my torture tests (another post, another day). For some more eloquent writing about why KVM can be better, check out Ulrich Drepper's LiveJournal, specifically here and here. So, what can you do to keep your VM's easy to migrate when something better comes along? Tip #1 takes you a long ways, since even if you have to reinstall the OS, it's a pretty trivial operation (especially if you use Cobbler).

Tip #4: don't tie your VM's too tightly to one solution. Obviously, the first step is to use libvirt rather than XenAPI. Once I figured out all of the bits & pieces, it only took about an hour - mostly waiting for the damned computers - to get all my test VM's converted from Xen to KVM paravirt. kpartx was invaluable since it let me mount the VM filesystems from the host. All of my VM's are on NetApp NFS, so a simple shell loop made quick work of mounting all 100 filesystems in my test cluster.
cd /net/vm-disks
for vmdisk in *.img
do
mkdir -p /mnt/$vmdisk
# run kpartx and grab partition #1 all at once
DEVICE=`kpartx -v -a /net/vm-disks/$vmdisk |head -n 1 \
|awk '{print $3}'`
mount /dev/mapper/$DEV /mnt/$vmdisk
done
Once they're all mounted, it's pretty easy to loop through all of them and make a change, such as copy in a new /etc/modprobe.conf or an updated initramfs with virtio network/block drivers. I'm especially excited about KVM virtio-net with NFS root, since virtio-net is shaping up to be quite a bit faster than xennet.
# install a normal kernel
cd /mnt
for vmdisk in *.img
do
chroot /mnt/$vmdisk yum -y install kernel
done
# and so on ...
You might even get away with some of these tricks on Windows VM's using ntfstools and ntfs-3g.

Tip #5: when searching for best practices and tuning information, there is a lot of excellent documentation available for us Xen and KVM users in the form of VMWare documentation. For example, I've had really good luck with reading NetApp's docs for NetApp + VMWare (the block alignment and Oracle RAC on NFS docs are particularly good). When vendors say "we really don't do much with Xen," I ask them for VMWare whitepapers instead. Most of the concepts are the same regardless of the hypervisor, so learn both sets of terminology and make the best of all the great documentation out there.

As always, remember to make backups ...