This is the key observation behind KVM's design. "Hmm, we need a kernel... and hey, we've already got one!" We just need to add some code to make it schedule kernels instead of userspace tasks. In fact, one of the major technical faults of the Xen project was that it needed to duplicate — often copy outright — Linux code, for features such as power management, NUMA support, an ACPI interpreter, PIC drivers, etc. By integrating with Linux, KVM gets all that for free.
There is a drawback to leveraging Linux though.
pro | con |
---|---|
use Linux's scheduler | stuck with Linux's scheduler |
use Linux's large page support | stuck with Linux's large page support |
get lots of fancy Linux features | stuck with the footprint of Linux's fancy features |
Seeing a theme here? Let me share a little anecdote:
My team had been doing early development on KVM for PowerPC 440, and we were scheduled to do a demo at the Power.org Developer's Conference back in 2007. Unfortunately we weren't able to get Linux booting as a guest in time, but we had a simple standalone application we used instead. So when I say "early development" I mean "barely working."
A friend of mine walked up to the demo station and asked "Does nice work?" Now remember, basic functionality was missing. We couldn't even boot Linux. The only IO was a serial console. We had never touched a line of scheduler code, and certainly hadn't tested scheduling priorities. Despite all that, nice just worked because we were leveraging the Linux scheduler.
There's a down-side though. The Linux scheduler is famously tricky, and almost nobody wants to touch it because even slight tweaks can cause disastrous regressions for other workloads. The Linux scheduler does not support gang scheduling, where all threads of a particular task must be scheduled at once (or not at all).
Gang scheduling is very interesting for SMP guests using spinlocks. One virtual CPU could take a spinlock and then be de-scheduled by the host. Unaware of this important information, all the other virtual CPUs could spin waiting for the lock to be released, resulting in a lot of wasted CPU time. Gang scheduling is one way to avoid this problem by scheduling all virtual CPUs at once.
Since Linux doesn't support gang scheduling, and only a handful of people in the world have the technical skill and reputation to change that, that's basically a closed door.
This is just one example, but I think you can see that re-purposing Linux for virtualization is a tradeoff between functionality and control. If one were to write a new scheduler for a hypervisor, they'd need to implement nice themselves... but they would also be free to implement gang scheduling.
No comments:
Post a Comment