By
Jorge Castro (mailto:jorge@whiprush.org)
Just a few weeks ago the
Linux.Ars (http://arstechnica.com/2004/linux.ars-20040331-1.html) crew started
delving (http://arstechnica.com/etc/linux/2003/linux.ars-12242003.html) into the world of the new Linux Kernel, version 2.6. Since that time
they have received a number of questions about other parts of the kernel,
particularly the work done with preemption. Rather than attempt to answer these
questions themselves, they decided to ask one of the most prominent Linux kernel
hackers of today to answer a few questions about the kernel, and boy, did he
ever answer them.
Recently hired by
Ximian (http://www.ximian.com/) (now a subsidiary of
Novell (http://www.novell.com/)) in order to further improve the Linux kernel, Mr. Love has another,
more interesting task ahead of himointegration of all this low level work into
the Linux desktop, specifically the
GNOME Desktop and Developer
Platform (http://www.gnome.org/). The work is already coming to fruition as developer releases of
"Project Utopia" (as it has been dubbed) have already been released.
So what does exactly does Project Utopia bring to the Linux desktop? As
an example of its benefits, all sorts of devices like cameras, mp3 players, and
memory sticks will not only work out of the box when plugged in, but will be
fully integrated into the desktop to provide the user with a transparent
experience. No manual mounting, no driver disc from a third party, and no arcane
knowledge of Linux is required. So sit back and let's see how Robert Love plans
to make the Linux Desktop "Just Work".
Those of you who have tried the new 2.6 Linux kernels will undoubtedly
have noticed how much more responsive the system feels under interactive use
than earlier kernels. Others who have tried the
kernel preemption patches (ftp://ftp.kernel.org/pub/linux/kernel/people/rml/preempt-kernel/v2.4/) or Con Kolivas'
patches for interactive use (http://www.plumlocosoft.com/kernel/) will appreciate the difference as well. A large part of the credit for
this work goes to Robert M. Love.
The Linux Desktop continues to evolve at a rapid pace. Now that kernel
2.6 has been with released with its many improvements in latency, other
integration work has begun. Some of the most interesting work toward these goals
(improved latency and integration of the kernel and the rest of the desktop) is
Project Utopia (http://primates.ximian.com/~rml/project_utopia/), a project to improve the way the Linux desktop deals with device
management and event notification from the kernel. Major components of it are
HAL (http://hal.freedesktop.org/) (an abstraction layer for hardware that provides a unified model of the
devices in the system to interested applications, along with notification of any
hardware changes),
D-BUS (http://dbus.freedesktop.org/) (a means for applications to communicate with one another; it's used by
HAL and e.g. desktop environments to talk to each other about things like device
discovery and changes),
udev (http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev-FAQ) by
Greg Kroah-Hartman (http://www.kroah.com/) (a means to maintain the special files in /dev based on devices and
drivers present in the system) and the
GNOME volume manager (http://primates.ximian.com/~rml/blog/archives/000315.html) (automatically mounts hot-plugged storage volumes).
We asked Mr. Love some questions, and he answered them comprehensively.
If you find the content going over your head, try hitting the footnotes (linked
in superscript.)
Ars Technica:
[1] (http://arstechnica.com/#rml1)1 People have been voicing some concerns
on the
linux-kernel mailing list (http://vger.kernel.org/vger-lists.html#linux-kernel) about how the new thread scheduler tends to make the effects of setting
static priorities (e.g. using nice or renice) less predictable (so, for
instance, a thread with a nice level of -10 may not get all that much more
attention than a thread with a nice level of 0 if they demand CPU time in
certain patterns, even if these usage patterns change frequently). Is this true,
and if so, are there any workarounds that would make things more predictable? Is
Nick Piggin's work (http://www.kerneltrap.org/~npiggin/) on this going to figure in future kernels, and if so, how?
Robert Love:
The 2.6 process scheduler intentionally dynamically modifies the priority of
processes to better optimize the system for I/O and interactive use. This is
done via an "interactivity estimator" that gives a small priority bonus to
I/O-bound processes and a small priority punishment to CPU-bound processes.
Processes can receive as much as ±5 nice levels in either direction from their
given static priority. Processes at some theoretical medium of I/O-vs-CPU usage
receive zero points and thus remain at their given static priority.
The intention behind this is twofold. First, optimizing for I/O is
usually a good thing to do. I/O-bound processes, by definition, spend much of
their time sleeping and waiting on I/O (whether it be disk I/O, keyboard
activity, sound buffers, etc.). Giving preference to an I/O-bound process allows
it to quickly run, dispatch more I/O, and continue to wait. This enhances the
overall performance of the system.
Second, favoring I/O-bound processes implies favoring interactive
processes (such as your text editor, mailer, web browser, and so on) since
interactive processes are I/O-bound (usually blocked on keyboard or mouse
input). Favoring interactive processes improves the smoothness and "feel" of the
desktop, improving the user experience.
Other operating systems accomplish these goals in other ways. For
example, last I checked, the default timeslice in Windows was ridiculously small
— like 10ms. This favors I/O-bound processes. Both Windows and Solaris also give
a priority bonus to the process that has window focus in the GUI. The kernel
developers, myself included, feel these approaches have shortcomings and that
our approach is more robust.
I have not heard complaints that the interactivity estimator is
"unpredictable" in the sense that it does all sorts of wild things. Since it can
only reward/punish a task ±5 nice levels, the effect should never be too
dramatic a departure from what the user intended. The interactivity estimator
can incorrectly estimate a task's interactivity, however, and that could result
in diminished system performance. A lot of work went into tuning the estimator
late in the 2.6 pre-kernel and these issues are hopefully resolved.
As far as Nick Piggin's work goes, I watch it closely. He is a great
hacker and definitely has some good ideas in his policy changes. If it tends to
work out for the better, then I certainly think his work may end up in 2.6
proper.
Ars: What's going
on with explicit hyperthreading
[2] (http://arstechnica.com/#rml2)2 support for Pentium 4? As we understand
it, the 2.6 scheduler treats logical processor pairs as independent entities
with independent caches and independent functional units. There's a
batch scheduler (http://kerneltrap.org/node/view/1877) in the works that promises to schedule things with an awareness that
resources are shared, as well as scheduling similar-priority threads together.
What's planned, ultimately, for this work?
Love: The batch
scheduler is altogether unrelated (it implements the equivalent of a SCHED_IDLE
class with batch-scheduling-like behavior).
Optimizing for HT — or, actually, SMT in general — is a different
problem, although not a huge one. The issue is that the scheduler treats each
logical processor has a separate processor and thus gives each logical processor
its own runqueue. This is what one would expect, actually.
In a multiprocessor system, however, with multiple physical processors
each with SMT, load balancing among the processors is not perfect with this
layout. For example, consider the load-balancing situation in a dual P4 (which
has four virtual processors, total) where three virtual processors are free and
the other one has two processes. The goal of the load balancer is to "balance"
the load, evening out the distribution of processes. Ideally, we would want one
of the processes moved to a virtual processor on a different physical
processor. Moving it to the free virtual processor on the same physical package
provides only a small performance increase, since the HT units share so many
chip resources.
The easiest way to solve this is to just stick some logic in the load
balancer to understand SMT and try to load balance across different physical
processors more readily than other local virtual processors. But this is a hack.
The better solution is to introduce the concept of shared runqueues,
where the SMT units in a given physical package can all share a runqueue. This
means that the load balancer automatically only balances between physical
processors and that we can get a better understanding of the cost of balancing,
since cache is shared among the local virtual processors.
I think that this work, too, will eventually find its way into the
kernel.
Footnotes
1 The Linux
2.5/2.6 series kernels added in huge changes to thread scheduling. We went over
the 2.6 scheduler in
December 2003 (http://www.arstechnica.com/etc/linux/2003/linux.ars-12242003.html). There were a number of enhancements in the scheduler that were meant to
dole out processor time to certain tasks that were deemed to be interactive (I/O
bound) over those that were deemed to be CPU hogs (processor-bound or memory
bandwidth–bound). This makes programs like media players and UI components whose
behavior is largely I/O-bound — waiting for user input — receive CPU time much
sooner than, say, a
Folding@Home (http://fah.stanford.edu/)
task, which spends most of its time doing calculations. This works really well,
but a few people (e.g.
this individual (http://lkml.org/lkml/boring/2004/1/4/66), among a number of others) found that the prioritizing behavior wasn't
to their tastes. This prompted Nick Piggin, author of the anticipatory I/O
scheduler that is now default in Linux 2.6, to
make some changes to scheduler policy (http://kerneltrap.org/node/view/754). These changes seem to be appreciated by the critics.
2 While
enumerating and enabling the logical processors on a Pentium 4 Hyperthreading
processor is supported in Linux 2.6, the current scheduler in the stock kernel
doesn't do anything special to distinguish the shared resources on logical
processors from resources on different physical processors. The result is that
there is a greater demand on things like caches, execution units and the like
than would be absolutely necessary. A number of people are working on a
Hyperthreading-aware scheduler (notably Ingo Molnar and Nick Piggin). Con
Kolivas produced a patch that added Hyperthreading support to the aforementioned
batch scheduler to deal with the fact that the Pentium 4 and Xeon cannot honor
priorities in Hyperthreading.
Ars: How will the
new
CFQ (http://lwn.net/Articles/22429/)3 I/O scheduler work? In what cases does
it improve upon the anticipatory and deadline I/O schedulers currently in 2.6?
Is it intended to go into a future 2.6 kernel?
Love: The CFQ
(complete fair queuing) I/O scheduler is something that I am very interested in.
It is going to be part of the desktop kernel package I am putting together at
Ximian. I think it is very well suited to desktop systems.
The idea behind it is to round robin I/O requests from each process,
evenly distributing the disk's bandwidth among processes on the system, thus
being "fair" on a per-process scale. This ensures that no one process can hog
the disk's bandwidth. Thus disk latency is greatly improved at the potential
cost to overall throughput. The CFQ I/O scheduler is best suited when disk
response is the primary concern, such as with desktop and multimedia workloads.
I think the CFQ I/O scheduler will definitely make it into 2.6 very soon.
There is no reason not to as I/O schedulers are now pluggable components in the
2.6 kernel.
I/O schedulers are a complicated subject. I wrote a primer to I/O
schedulers for
this month's Linux Journal (http://www.linuxjournal.com/modules.php?op=modload&name=NS-lj-issues/issue118&file=index). My book,
Linux Kernel Development (http://www.amazon.com/exec/obidos/tg/detail/-/0672325128/qid=1074703009/sr=1-1/ref=sr_1_1/002-5805941-7672855?v=glance&s=books), also discusses this topic.
Ars
[3] (http://arstechnica.com/#rml4)4 : A lot of work is going into Project
Utopia to provide a user-space framework for dynamic device management (things
like device detection, automatic driver loading, even things like filesystem
mounts and notification via D-BUS or similar to subscribed apps so they can do
something about it), and you're involved in this deeply. How much is this work
tied into the GNOME desktop, or for matter, any desktop environment? For
instance, if we wanted to have a network daemon running on a headless machine
(no desktop environment installed) deal with something like additional storage
attached via FireWire, could the daemon use this framework to deal with
detection, driver loading, mounting, etc. easily, even if there was no trace of
a
Freedesktop.org (http://www.freedesktop.org/)-compliant desktop environment on the computer? Would distributions be
able to pick up these pieces and integrate them in their base system and their
initscripts in place of (or along with) things like
kudzu (http://rhlinux.redhat.com/kudzu/),
mdetect (http://packages.debian.org/unstable/utils/mdetect) and
hotplug (http://linux-hotplug.sourceforge.net/)?
Love: Project
Utopia's goal is to fully integrate the Linux system, from the kernel on up the
stack, through the GNOME desktop, its applications, and finally to the user.
Therefore, Project Utopia is very GNOME-specific.
But Project Utopia is composed of many small components, and each
component is intentionally being developed separately and abstractly. Thus, a
GNOME desktop (or any desktop) is not required for much of the functionality and
another desktop environment could (and should!) provide the missing pieces.
The system is architected in such a way that the only components actually
at the desktop layer are policy mechanisms, such as gnome-volume-manager, and
glue layers/libraries, such as any forthcoming notification system.
Components such as udev and hotplug are obviously entirely agnostic to
the rest of the system, as they are (or will be) required pieces of nearly any
Linux system. Other components, such as D-BUS and HAL, can likewise fit into any
system. I very much hope that both of those projects find wide adoption.
In response to your example, I think that a server with no desktop
environment would still benefit from this work. In fact, it would just use
Project Utopia as far up the stack as needed, definitely making use of udev,
D-BUS, and HAL.
Ars: Regarding system status changes — you've indicated that this is all
done in userspace, without polling. This suggests that a filesystem change
notification framework such as dnotify
[4] (http://arstechnica.com/#rml5)5 is being used as the basis. People are
considering replacing dnotify, though, since it's a bit clunky (not to mention
inefficient) to monitor entire directories when the app is only interested in
one or a few files. What is intended as a replacement, if anything, and will the
projects you're working on be adapted to use it?
Love: Indeed,
dnotify is one of the mechanisms used to avoid polling. Others include good ol'
blocking on read and the forthcoming kernel events layer that I am working on.
I also agree that — let's be honest here — dnotify sucks. Calling it
clunky is nice. It is cumbersome and awkward to use, although at the end of the
day it does get the job done.
I think a better person to ask about a replacement for dnotify is someone
who uses the API extensively. I know firsthand that the Nautilus maintainers
could readily describe a more ideal interface. Unfortunately there is no
replacement under development, although I am sure people would be happy to use a
sane replacement.
Footnotes
3 Jens Axboe's
"Complete Fair Queueing" I/O scheduler is a recent development, and one that is
much-anticipated for desktop systems. An I/O scheduler is essentially is a
policy on the pattern in which the kernel should order its requests from disk
devices and the like, in order to maximize aspects of I/O performance. The
primary I/O schedulers in the 2.6 Linux kernel are the anticipatory scheduler
and the deadline scheduler.
4 Project Utopia
is slated to be a big part of future GNOME desktops, providing the ability for
the GUI to do things like automatically mounting filesystems off iPods, USB
keychain drives and the like, to automatically start a media player to play a
DVD inserted in the drive, and so on. We were wondering if the framework could
be used for things other than the desktop.
5 dnotify is a
mechanism to let applications monitor directories on the filesystem for changes,
including file insertion, deletion, rename, update, etc. Readers who are more
interested can find information in the file Documentation/dnotify.txt in their
copy of the Linux kernel source tree.
Ars: The 2.6 kernel was made fully preemptible
[5] (http://arstechnica.com/#rml6)6 thanks largely to your
efforts (http://www.tech9.net/rml/linux/). However, there still remain some bits of preemption-unfriendly code. Do
you think that preemption is progressing well, and in general are you happy with
the state of the 2.6 scheduler? Specifically, are you satisfied with the
interactivity of the new scheduler and the preemptible kernel? What changes do
you think need to be made, if any?
Love: Yes, I am very happy with the state of scheduling latency in the
2.6 kernel.
A lot of tuning is still needed, but also a lot of bad areas have been
fixed and the kernel is overall much more fair than before. Some specific areas
of tuning are in filesystem code and RCU
[6] (http://arstechnica.com/#rml7)7. People are working on the RCU issues
now.
Ars: The push for
device management in user space is now in full swing it would seem; udev, D-BUS
and HAL are all progressing. What advantages does pushing the device management
into user space actually bring? Do you see a complete transition from
kernel-based solutions in the future? What implications for Linux on the desktop
does user space device management carry, and specifically what implications for
GNOME?
Love: A user-space device naming solution, in particular udev, offers six
main benefits.
First, and namely, it provides a mechanism for persistent device naming.
Using the logic in udev and a simple configuration file, a given disk partition
can always be "hda5" and your favorite joystick can always be named "snake_eyes,"
regardless of where, when, or in what order the device was connected to the
system.
Second, and also important, we no longer have to worry about minor/major
numbers anymore. They no longer matter one bit, whatsoever. They simply become
an arbitrary cookie that user-space uses to communicate with the kernel. We can,
and will, randomly generate them.
Third, we no longer have to manually maintain /dev with hacks like the
MAKEDEV script, which have to be updated whenever a new type of device is added
to the kernel. And we get a /dev tree that only contains the valid devices on
the system, and not a big pile of stink that is the current /dev.
Fourth, udev
can do neat things via its configuration script and the fact it emits a D-BUS
signal. This means HAL can listen to it and know all about device node additions
and removals.
Fifth, udev
is a small and simple binary, in user-space. Unlike kernel memory, user-space
memory is abundant, swappable, and protected. Also, since udev is in user-space,
policy is entirely up to the user — if there is not a good reason for something
to be in the kernel, then it should not be.
Finally,
this is all done elegantly and without any hacks, simply by leveraging
information and mechanism that already exists, today, in the form of hotplug and
sysfs
[7] (http://arstechnica.com/#rml8)8.
It is just the Right Thing to do.
Ars:
The addition of gnome-volume-manager will certainly make GNOME more media
friendly, and should prove that udev is what it claims to be. Have you been
satisfied with D-BUS and udev while working on this addition to GNOME? You
mentioned in your
blog (http://primates.ximian.com/~rml/blog/)
that parts of gnome-volume-manager had to use the kernel events interface. Do
you think in the near future gnome-volume-manager and other applications that
monitor hardware events will be able to be pulled completely away from the
kernel events layer
[8] (http://arstechnica.com/#rml9)9
or that it is here for at least a while more?
Love:
I am very satisfied with all levels of the Project Utopia stack, including udev
and D-BUS. I am most satisfied, however, with HAL. HAL is definitely the shining
centerpiece of Project Utopia. HAL made gnome-volume-manager a simple policy
engine, implementable as a finite-state machine, which simply listens for
certain HAL events and reacts with user-configured policy. Any information that
gnome-volume-manager needs it gets from HAL. In fact, it keeps no internal state
whatsoever, aside from its configuration settings. HAL seriously rocks.
I do not
think that needing the kernel events layer is a bad thing. It is a good thing,
that is why I am writing it. It allows us to asynchronously send event-related
messages to user-space, via D-BUS signals. I hope more things move to it, not
away from it!
Ars: There
is obviously a great deal of work going into Linux at the moment to get it more
usable on the desktop; however, the overwhelming majority of Linux users are
either in the server or embedded market. Do you think that any of the changes
that have been occurring to both the Linux kernel and the user-space device
management could benefit the server market or are they specifically targeted for
the desktop user?
Love: No,
this stuff is very important for both of these markets, too.
Things like
udev are needed both in the embedded space and the server space. Removal of
major/minor limitations, persistent device naming, and so on are crucial to many
facets of Linux. HAL, too, will greatly simplify and improve both enduser
management of Linux and application development under Linux.
-------------------------------------------------------------------------------------
Footnotes
6
In kernels 2.4 and earlier, if a task performed a system call that required the
kernel to do something that took a while to do, the kernel would keep processing
the system call on behalf of the task on the processor where it was running
until it was done with that portion of processing. Mr. Love's work enables even
kernel system calls to be preempted in favor of other tasks, and then continued
later. This causes tasks to spend less time waiting to run while the kernel is
doing something, and makes the system feel more responsive. Also, a lot of work
was done to cause long-running system calls to themselves yield the processor
for other tasks.
7
See this Linux Journal article for a nice overview of the whats, hows and whys
of read-copy updates (RCUs).
8
For those who are not familiar with sysfs, a lot of system device and bus
information was moved from /proc into a new filesystem called sysfs, typically
mounted at /sys, in the Linux 2.5/2.6 kernels.
9
Mr. Love is working on a new kernel interface that programs can use to be
notified of system events, particularly device change/enumeration events. Parts
of Project Utopia are based on this.