The xv6 kernel uses a basic round robin scheduler. To understand scheduling more deeply, I replaced it with a stride scheduler. This post compares round robin and stride scheduling, explains how I added it to xv6, and what I learned along the way.

TL;DR / Key Takeaways Link to heading

  • xv6 uses a simple round robin scheduler that treats all runnable processes equally.
  • I replaced it with a stride scheduler to explore proportional-share scheduling.
  • Stride scheduling assigns each process a “stride” value that determines how often it runs; smaller strides mean higher priority.
  • In a few lines of code, I was able to achieve stride scheduling effectiveness on par with the theoretical expectations.
  • Along the way, I uncovered an interesting startup lockup in init worth looking into further.

What’s a Scheduler, Again? Link to heading

One of the most important responsibilities of an operating system is to select which processes (think: running programs) to run and when. If there were, say, one process running and one CPU on the system, then there’s no problem - just let that process run. But, on real systems, there are many processes running at a time and relatively few processors on which to run. To provide the illusion of letting every process run, the OS allows these processes to run for very short bursts of time, then allowing another one to run for a short time, and so on. Choosing which process to run is the role of the OS scheduler.

An aside on terminology: Scheduling theory comes not just from the world of computers, but from the world of manufacturing. So, there are some terms that can sometimes be used interchangeably or which can be a bit confusing. For instance, processes are sometimes called “jobs.” Scheduling policies aren’t just called algorithms: they’re also called “disciplines.” And, a set of jobs is a “workload.” OK, let’s get back to the post.

How many ways could there be to schedule jobs? Well, a lot, it turns out - different scheduling disciplines work better or worse for different workloads. We’re just going to talk about two, though: round robin and stride scheduling.

Round Robin Link to heading

The xv6 kernel uses a pretty straightforward scheduling discipline by default, known as round robin scheduling. Round robin loops continuously over a list of all the processes on the system, choosing the “next” one that could run and then running it for a short period of time. It then picks up the loop where it left off, finding whichever process is next in line that is ready to run and then running it. This cycle continues ad infinitum. (Or at least until the machine is turned off.)

Round Robin in Theory Link to heading

Let’s consider how this works, theoretically. The table below shows two hypothetical processes, both ready to run at the same starting time, “time 0.” Because process 1 is first in the process list, the scheduler chooses it first. After that process runs for a while, the scheduler picks back up and resumes where it left off in the list.

Time Process 1 Process 2 Choice
0 0 0 Process 1
1 1 0 Process 2
2 1 1 Process 1
3 2 1 Process 2
4 2 2 Process 1
5 3 2 Process 2
6 3 3 Process 1
7 4 3 Process 2
8 4 4 Process 1
9 5 4 Process 2

The corresponding diagram, below, visualizes the number of times each process is scheduled, according to the table above. Process 1 goes, then process 2, repeating over and over. The “diamond” pattern would continue repeating as long as both processes still had work to do and as long as no other processes started.

Theoretical Schedule Allocation by Time (Round Robin)

Not surprisingly, the split is 50-50 between these theoretical processes based on the previous table.

Theoretical Schedule Allocation (Round Robin)

Round Robin in Practice Link to heading

Now that we know how round robin ought to behave, let’s see how xv6 compares. I wrote a test program that spawns two child processes. Each loops for a while, periodically printing to the console a total of 1000 times. This gives a rough approximation of scheduling behavior.

I’ll spare you the table, but here’s the timeline from one test execution.

Observed Schedule Allocation by Time (Round Robin)

Not bad! From looking at the numbers, I can tell you this arbitrary test run had short bursts where one process or the other had a couple more cycles than the other, but concurrency is a tricky thing, and I think it’s pretty evident that these processes are being treated pretty equally. Process 0 was the first to finish, but when it reached 1000, Process 1 had run 995 times. The split is pretty even here.

Observed Schedule Allocation (Round Robin)

So, now we have a handle on how round robin should work and have a pretty good feeling that it works as intended in xv6. Let’s turn our attention to stride scheduling.

Stride Scheduling Link to heading

I mentioned earlier that different scheduling disciplines are better or worse for different workload characteristics. Round robin is a very fair algorithm, treating all processes exactly the same. But, not all processes are the same. Some may be more important than others, and you might want to give them more CPU cycles than less important ones. This scenario is one where round robin falls short.

Some other disciplines support proportional share scheduling. In proportional sharing, you get to specify the relative importance of different processes. For example, you could say, “I want my web browser to get twice the CPU cycles as my email client.” Some such disciplines rely on probability and randomness, giving important jobs a higher chance of getting picked, but choosing at random (this is known as a lottery scheduler). On the other hand, stride scheduling is a deterministic, yet fairly easy-to-implement and effective option.

Stride Scheduling Theory Link to heading

Stride scheduling relies on two critical bits of information about processes. The first is the stride, and the other is the pass.

Every process starts out with some stride value that is greater than zero, as well as a pass set to 0. Each time the job is scheduled, the scheduler increases the pass by the value of the stride. When the scheduler runs, it checks the current pass value on all runnable processes, choosing to schedule the one with the lowest pass value. (In the event of a tie, it just chooses one, either at random or simply the first runnable process among those tied.)

In pseudocode, this looks something like:

scheduler: // runs in a continuous loop
  best process = none
  for each runnable process:
    if the current best process is none:
      this is now the best process
    else, if this process's pass value < the best process's pass:
      this is now the best process
    else, we already have a better option:
      move on to the next runnable process
  
  if the best process is still none:
    wait a short time and then start again at the top
  else:
    schedule the best process
    (process runs for a short time)
    increase this process's pass by its stride value
    start over again at the top

The net effect of this is that processes with lower stride values get scheduled more frequently than those with higher stride values. As an analogy, consider two people walking together: one with very long legs and the other with very short legs (like me). These people have different strides. Intuitively, you can probably imagine that to cover the same distance, the person with the shorter stride will need to take more steps than the person with the longer stride. If we assume each “step” is similar to being scheduled on the CPU, then, this is exactly the same thing that happens in stride scheduling.

How does the stride get set? A simple approach is to assign a priority to a process, and then use that as the denominator in some fixed division of a larger number. For example, if our numerator were fixed at 10,000, then a job with a priority of 50 would be 10000 / 50 = 200, whereas a higher priority job of 100 would get a stride of 10000 / 100 = 100. Therefore, the higher priority job gets the lower stride. If no priority is set, the system should choose a default stride.

Let’s consider, graphically, how this looks in theory. The table below shows a similar situation as we used in our round robin discussion, where two jobs are ready at the same time. But, now, process 1 has a higher priority, with a stride of 100, and process 2 has a lower priority, with a stride of 200.

Time Process 1 (Stride 100) Process 2 (Stride 200) Choice
0 0 0 Process 1
1 100 0 Process 2
2 100 200 Process 1
3 200 200 Process 1
4 300 200 Process 2
5 300 400 Process 1
6 400 400 Process 1
7 500 400 Process 2
8 500 600 Process 1
9 600 600

Before we look at the schedule allocation, let’s take a look at how the pass values change over this timespan.

Stride Schedule Pass Values Theoretical

In the diagram, every time a process’s line goes up, it means it was scheduled. Note that, generally, the process with stride 100 is scheduled twice as often as the one with stride 200. And, this is the goal, right? If the priority of one job is twice as high as another, we might expect that it ought to get scheduled twice as often as the other.

The following diagrams show this in terms of the number of times each gets scheduled in the example above.

Theoretical Schedule Allocation by Time (Stride Scheduling)

Theoretical Schedule Allocation (Stride Scheduling)

With the theory in place, let’s look at how I went about adding stride scheduling to xv6.

Adding Stride Scheduling to xv6 Link to heading

I approached adding stride scheduling to xv6 in two major steps:

  1. Add the necessary “infrastructure” for stride scheduling, but with all processes having equal strides.
  2. Add the ability to set the priority on different processes, and test this behavior.
Implementing Basic Stride Scheduling Link to heading

The basic stride scheduling mechanisms can be found in this git commit.

I added stride and pass fields to the process data structure, struct proc in kernel/proc.h. I also defined a numerator (10000) and default priority level (50) for stride settings in kernel/param.h. Using these defaults, I updated the process allocator, allocproc in kernel/proc.c, which sets up new processes, to set all new processes up with a pass of 0, along with a stride of 200 (10000 / 50).

Next, it was time to create the actual scheduler. Rather than changing the scheduler, simply known as scheduler, I created an alternative, which I called stride. The final step in the main function of the kernel is to start the scheduler, so the stride scheduler can be selected by just updating this call in main.

As for the mechanisms of the stride scheduler, it loops through all processes (which xv6 keeps in a fixed-size array), following the pseudocode shown above quite closely. In the commit linked above, the relevant bits can be found in kernel/proc.c from lines 504-543.

Setting Process Priority Link to heading

The next step in the implementation involved adding a system call to allow setting the priority on a process, along with adding a user-space test program. These changes are available in this git commit.

Adding system calls to xv6 is relatively straightforward, although there’s enough plumbing involved that I’m going to gloss over that for now; maybe I’ll write about adding system calls at a later time. The important thing is that the system call setpriority winds up making its way to the new function, setpriority(pid, priority) in the file kernel/proc.c. The function is pretty simple, first doing some basic bounds checking (priority must be between 1 and 100) and then finding the target process and updating its stride. Failure in input validation or finding the process results in an error being returned.

And, this is really about it for setting priorities. Next up, it was time to see how it performed.

Stride Scheduling in Practice Link to heading

Using the same test program that I used to generate the observed reports for round robin, above, I experimented with different numbers of child processes and different priority levels. Each process runs a tight CPU loop with periodic printfs, so the scheduling behavior primarily reflects CPU allocation rather than I/O blocking.

In the diagram below - similar to the theoretical example earlier - I have one process with stride 100 and one with stride 200. If it works like the theoretical example, we’d expect the higher priority job to be scheduled about twice as much as the lower priority one, and therefore, it should progress roughly twice as fast. Let’s see how it did.

Observed Schedule Allocation by Time (Stride Scheduling)

Observed Schedule Allocation (Stride Scheduling)

VERY COOL! The allocation split was nearly exact (66.8% to 33.2% compared to the theoretical 66.7% to 33.3%). And, in the timeline, notice that when the lower priority process, process 0, gets to some of the “round numbers” on the vertical axis, it’s clear that the higher priority process is twice as high. (For example, 50 to 100, 100 to 200, 400 to 800, etc.) I did cut the chart short after the high priority job finished; once it completed, there was no more contention, so process 0 had as much CPU as it needed, and it finished quickly thereafter.

Future Considerations Link to heading

Hacking on the xv6 scheduler was fun and helped me solidify my understanding of round robin and stride scheduling disciplines. I was proper chuffed to find that the behavior of my stride scheduler so closely matched the expected behavior of a theoretical scheduler in, frankly, not that many lines of code.

There were some things I left as loose ends during this process that I want to come back to later, though.

Pass Value Resets Link to heading

First, there is an important thing missing in my implementation that would be important in a real one. If a process were to run for a long time, its pass would grow and grow. As new processes started, even if they were lower priority, their passes could be so drastically lower than such long running processes, that the long running ones would periodically “starve” until the low priority processes completed or finally caught up.

So, an alternative implementation would address this behavior. One idea is to periodically reset all pass values to 0. This would both keep the behavior more consistent, and it would also eliminate the possibility of integer overflows on pass values for very long running processes. (The pass value is a 64-bit unsigned integer, so overflow is exceptionally unlikely, but it would be better to be correct.)

Unexplained Init Process Lockup Link to heading

The second, more interesting challenge, involves some odd behavior I noticed while testing. In order to test the scheduler, I set the number of CPUs in my qemu machine to 1. (Dealing with one processor and multiple processes makes behavior more deterministic and ensures there is contention that forces scheduling behaviors you want to see.)

When I first started up the machine with my stride scheduler, it just hung. The init process started, but it never finished starting up the shell. After some debugging, I found that the init process was going to sleep while trying to open the console (common for system calls), but it was failing to wake up. I found that the context of the process that tried to wake the init process was still the init process. However, there was a guard in the wakeup() function that prevents one process from waking itself. (Keen observers may have noticed that I removed this guard in the second commit, here.)

Very strangely, this behavior did not occur when starting with the default scheduler on one CPU. I don’t think this is actually attributable to the stride scheduler, but rather due to schedule timing differences. I want to trace this more deeply and determine why my stride scheduler exposed this deadlock, while the round robin scheduler did not. For the time being, though, I skipped the guard check, and all appears to work properly.

Conclusion Link to heading

If you’ve made it this far, thanks for following along with my notes from what was a fun lab experiment. I found this lab to be a great way to turn theoretical knowledge of schedulers into something concrete.

As I explore the future threads mentioned above, I plan to document those as well, especially the init process lockup. I also think at some point, I may try implementing some other scheduler types in xv6, like multi-level feedback queues or lottery schedulers. I’ll document those when I do, too.

I hope you’ve found this post useful. I’ll share notes from future experiments here as I go!