15 April 2009

leveraging Linux for virtualization: the dark side

I work on KVM, which is a relatively small kernel module that transforms the Linux kernel into a hypervisor. A hypervisor really is a kernel: it contains a scheduler (at least the good ones do ;), device drivers (at least interrupt controller, probably console, maybe more), memory management, interrupt handlers, bootstrap code, etc.

This is the key observation behind KVM's design. "Hmm, we need a kernel... and hey, we've already got one!" We just need to add some code to make it schedule kernels instead of userspace tasks. In fact, one of the major technical faults of the Xen project was that it needed to duplicate — often copy outright — Linux code, for features such as power management, NUMA support, an ACPI interpreter, PIC drivers, etc. By integrating with Linux, KVM gets all that for free.

There is a drawback to leveraging Linux though.

procon
use Linux's schedulerstuck with Linux's scheduler
use Linux's large page supportstuck with Linux's large page support
get lots of fancy Linux featuresstuck with the footprint of Linux's fancy features

Seeing a theme here? Let me share a little anecdote:

My team had been doing early development on KVM for PowerPC 440, and we were scheduled to do a demo at the Power.org Developer's Conference back in 2007. Unfortunately we weren't able to get Linux booting as a guest in time, but we had a simple standalone application we used instead. So when I say "early development" I mean "barely working."

A friend of mine walked up to the demo station and asked "Does nice work?" Now remember, basic functionality was missing. We couldn't even boot Linux. The only IO was a serial console. We had never touched a line of scheduler code, and certainly hadn't tested scheduling priorities. Despite all that, nice just worked because we were leveraging the Linux scheduler.

There's a down-side though. The Linux scheduler is famously tricky, and almost nobody wants to touch it because even slight tweaks can cause disastrous regressions for other workloads. The Linux scheduler does not support gang scheduling, where all threads of a particular task must be scheduled at once (or not at all).

Gang scheduling is very interesting for SMP guests using spinlocks. One virtual CPU could take a spinlock and then be de-scheduled by the host. Unaware of this important information, all the other virtual CPUs could spin waiting for the lock to be released, resulting in a lot of wasted CPU time. Gang scheduling is one way to avoid this problem by scheduling all virtual CPUs at once.

Since Linux doesn't support gang scheduling, and only a handful of people in the world have the technical skill and reputation to change that, that's basically a closed door.

This is just one example, but I think you can see that re-purposing Linux for virtualization is a tradeoff between functionality and control. If one were to write a new scheduler for a hypervisor, they'd need to implement nice themselves... but they would also be free to implement gang scheduling.

No comments:

Post a Comment