A few months ago, I decided to add a stride scheduler to the xv6 operating system, to get a better
understanding of OS schedulers. I wrote a blog post about the experience, including
my code and how it performed. Toward the end of the post, I mentioned some
odd behavior when running my scheduler on a single-core VM. In
particular, at bootup the OS would hang when starting the initial process. I was able to determine
that the process had gone to sleep, but wakeups were failing because when they fired, init was
still considered an active process. I worked around the issue by removing one of the wakeup
guard conditions and decided to “punt” a deeper investigation.
Finally, I did take the time to figure out the issue! Spoiler alert: It was my fault (who could have guessed?), and the resolution was a one-liner fix. But, to troubleshoot, I was able to spend some quality time with GDB and explore tracing in QEMU, while also deepening my understanding of concurrency and scheduling.
In this post, I thought it would be worthwhile to document the process I followed, touch on some threads I pulled along the way, and to capture both the pathology and the resolution. Further, I’ve recently realized that although I have spent a lot of time in my career troubleshooting, I have let the mental models I built along the way stay “intrinsic.” Several times, coworkers have asked me how I figured some random issue out - or, how I knew where to start looking - and all I really know is that it matches patterns I’ve seen in the past. But, that’s not really transferrable, and I think I can become a better engineer by deliberately documenting interesting analyses. So, the post is also an attempt at just that.
Note: If you are interested in either following along or exploring on your own, I have added two commits to my xv6 repo. The first one re-introduces the lost wakeup condition. The second one fixes it.
Problem Presentation and Initial Debugging Link to heading
As I developed the stride scheduler, I booted the OS several times along the way to ensure there were no panics or other unexpected errors. I was looking for crashes and did not adjust the configuration of the VM at that point. When it was time to actually test the behavior of the stride scheduler, though, I did need to make some VM changes. By default, the xv6 Makefile starts a VM with three CPUs. But, my goal for performance testing was to see how the stride scheduler behaved in the face of contention (more processes ready to run than there were processors). To simplify inducing contention, I switched to a single CPU so that I could execute my test program.
Note: If you want to follow along, you can easily change the number of CPUs by adjusting the environment variable the Makefile looks for. In other words, in your shell,
export CPUS=1and thenmake qemuto start the system.
Excited to see how my scheduler performed, I ran make qemu and saw the familiar spew of build
logs followed by:
qemu-system-riscv64 -machine virt -bios none -kernel kernel/kernel -m 128M -smp 1 -nographic -global virtio-mmio.force-legacy=false -drive file=fs.img,if=none,format=raw,id=x0 -device virtio-blk-device,drive=x0,bus=virtio-mmio-bus.0
xv6 kernel is booting
… and then: nothing. Typically, you would quickly see a shell prompt ($) on the console at
this point, but the system appeared to be locked up.
Initial Debugging and Workaround Link to heading
Admittedly, it’s been a few months now since I did this, and my analysis notes are mostly limited
to what is in the earlier blog post. Using printf and gdb, I identified that the init process
was sleeping and failing to wake up. Since init is responsible for starting all other userspace
processes, if it fails, then essentially the system has nothing to do and will hang. It
behaves a bit like an endless for loop of no-ops, as the scheduler cycles, looking for some
process to run.
At the time, I had an incorrect understanding of where init was getting stuck. Reverse engineering
from the blog post, I think I thought it was trying to open up a console file descriptor, which
is one of the first things that happens in user/init.c. What I know for certain, though, is that
I found that a guard condition in wakeup was preventing the process from being woken. Specifically,
when wakeup scans for sleeping processes, it will not select any process that it identifies is
currently active, but init was still considered active for some reason even though it was asleep.
Thinking this was to be expected for init at the time, I believed I had found some race condition
that was exposed by my scheduler due to how its behavior differed from the default round-robin one.
To add to my hypothesis, incorrect as it was, I looked at the git history on the wakeup function
and found something interesting. Here is how it looks currently:
void
wakeup(void *chan)
{
struct proc *p;
for(p = proc; p < &proc[NPROC]; p++) {
if(p != myproc()){
acquire(&p->lock);
if(p->state == SLEEPING && p->chan == chan) {
p->state = RUNNABLE;
}
release(&p->lock);
}
}
}
That if(p != myproc()) is where it failed because p did, in fact, equal myproc().
However, it once looked like this:
void
wakeup(void *chan)
{
struct proc *p;
for(p = proc; p < &proc[NPROC]; p++) {
acquire(&p->lock);
if(p->state == SLEEPING && p->chan == chan) {
p->state = RUNNABLE;
}
release(&p->lock);
}
}
(Here’s the commit that changed it.)
Wanting to move on to checking my scheduler’s performance and then seeing that the current process
check wasn’t always there, I decided to revert wakeup to the way it once was. And,
lo and behold, the system booted, and I was able to move on. It didn’t feel like a satisfactory
solution, but I made a judgment call to stick with the workaround for the time being and to
come back at a later time to sort out the issue.
Getting Reacclimated Link to heading
A few days ago, I finally took the time to begin a proper investigation of the issue. Since it had been several months, I needed to reacclimate myself to the situation. I read my notes, of course, but decided to start, not quite from square one, but maybe square two or three.
I reverted my previous change to wakeup and booted up with make qemu and confirmed the
system hang behavior was present. Then, it was time to start debugging.
Attaching the Debugger Link to heading
The xv6 OS runs in QEMU, which offers nice integration with the GDB debugger. Furthermore, the xv6
authors even have a target in the project’s Makefile to start the system up for debugging -
make qemu-gdb. In essence, this target starts up the VM with QEMU listening for GDB connections
on a pseudo-random port (the -gdb tcp::<a port calculated in the Makefile>) and halts VM execution
until GDB is attached (the -S flag). By waiting for GDB to connect, you get an opportunity to
watch the OS boot up from the very beginning - quite useful!
Once QEMU is started, it’s as simple as running gdb from another terminal window in the project
directory, to get connected. (To be fair, it’s this simple because the Makefile produces a
.gdbinit file, which GDB will load automatically. But that file is pretty straightforward.)
I wanted to first confirm the hangup behavior while the debugger was attached, so I simply ran
continue from GDB to start the machine up. Again, the observed behavior was consistent - a message
saying the kernel was booting and then nothing. Hitting Ctrl+C to pause execution, then using
backtrace to see where it was after a few seconds, I saw that it was in the stride scheduler
loop. Continuing on and pausing again a few times, this was always the case. It wasn’t scientific,
but I wanted to get a feel for what was happening.
An Aside on xv6 Sleep and Wakeup Link to heading
In xv6, when a process does a slow (relatively speaking) operation, such as interacting with
external devices like disks or UARTs, the kernel will set the interaction in motion and then put
the process to sleep. The process will be marked as SLEEPING, with a reference to what xv6 calls
a channel (a pointer to some struct of interest) to capture what the process is waiting
on. At this point, another process can be switched in and get some time to execute while the
external device is doing whatever it’s supposed to do. When the external device finishes doing what
it was asked, it sends an interrupt to the CPU, which ultimately winds up in a wakeup function,
which wakes any process that was waiting on that particular operation.
The function that puts a process to sleep is called sleep, and the function that wakes it up
is called wakeup.
Breaking Wakeup Link to heading
Because I knew wakeup was what I had patched before, I decided that it would be good to investigate
it a bit. I restarted the machine and GDB, and did a break wakeup before continuing.
Note: I learned along the way that you can also do
monitor system_resetfrom GDB to restart the VM. This allows you to keep your GDB session, but to go back to the bootup process, which is very convenient! In fact, anything you can set from the QEMU monitor, you can do from GDB because it will pass anything aftermonitoralong to QEMU to execute.
The first break was on a timer interrupt. I restarted the machine several times throughout troubleshooting and breaking wakeup, and I found this was pretty consistently the first wakeup call.
Running continue again, the next wakeup was a bit different, and the backtrace looked
intriguing:
#0 wakeup (chan=0x80015ca8 <bcache+24>) at kernel/proc.c:646
#1 0x000000008000579c in virtio_disk_intr () at kernel/virtio_disk.c:321
#2 0x0000000080002642 in devintr () at kernel/trap.c:198
#3 0x0000000080002750 in kerneltrap () at kernel/trap.c:147
#4 0x000000008000522c in kernelvec ()
There are a couple of interesting things here. First, GDB tells us that the channel is an offset
from bcache, which is an in-memory cache of filesystem blocks. Additionally, the wakeup call is
coming from virtio_disk_intr; this is the interrupt handler for the disk driver. So, pretty
clearly, we’re trying to wake up a process that had been waiting on disk access.
Knowing that init would be the first (and at this point only) process in the proc list, I decided
to confirm it was sleeping with print proc[0]->state and, seeing that it was, what channel it
was waiting on using print proc[0]->chan. Indeed, GDB showed it was sleeping on
(void *) 0x80015ca8 <bcache+24>. So, this wakeup was, in fact, intended for init.
xv6 processes also have names, to aid in debugging. You can do
print proc[0]->nameand see that this is"initcode\000\000\000\000\000\000\000".
At this point, I watched several more wakeup calls until it seemed as though all of them would
be timer interrupts. I wanted to make sure there wasn’t potentially some other wakeup to watch out
for as well. So, the next step was to dig into what was putting init to sleep.
Tip: Setting Calling Function Conditional Breakpoints Link to heading
There are a lot of reasons why a function like wakeup could be called. For instance,
every clock interrupt will call it. But, clock interrupts aren’t particularly interesting for this
investigation. Using GDB, if you know a potential caller you either do or do not care about,
you can set conditional breakpoints to ease your debugging, such as in the following.
break wakeup
# GDB will print the breakpoint number
condition <the breakpoint #> !$_caller_is("clockintr")
Here, since I know the clock interrupt’s name is clockintr, I can ignore any times where that is
the function that called wakeup.
Watching QEMU Disk Operations Link to heading
One of the personal goals I had for working on this problem was to also get a little more familiar
with tooling QEMU provides for observing VMs. So, around this same time - and before I started
looking at sleep - I decided to try tracing some disk operations in the QEMU monitor, to try
and get a little more information on what this disk interrupt was notifying the OS of.
In the QEMU monitor, I landed on the following as potentially interesting:
trace-event virtio_notify on
trace-event virtio_blk_rw_complete on
trace-event virtio_blk_handle_read on
trace-event virtio_blk_req_complete on
(To be fair, I found these in the QEMU monitor with trace-event virtio<TAB>. I’d like to spend
some time looking into other tracing capabilities in QEMU so I can observe my systems work more
deeply.)
Then, restarting the system, I saw the following in the QEMU console just as the GDB wakeup breakpoint hit:
virtio_blk_handle_read vdev 0x7fb5d1f72800 req 0x7fb5d2858980 sector 2 nsectors 2
virtio_blk_rw_complete vdev 0x7fb5d1f72800 req 0x7fb5d2858980 ret 0
virtio_blk_req_complete vdev 0x7fb5d1f72800 req 0x7fb5d2858980 status 0
virtio_notify vdev 0x7fb5d1f72800 vq 0x7fb5d2f6e000
I admit this wasn’t immediately obvious to me, but this is why I wanted to check out QEMU tracing:
to see how it works. If you read Anatomy of a Filesystem: xv6,
you may remember that the superblock is block number 1, beginning at byte 1024 of the xv6 filesystem.
QEMU virtio disks have sector sizes of 512B, so assuming they start numbering at 0, sector 2 would
begin at byte 1024. And, if reading two sectors (nsectors), that would suggest reading an entire
FS block. So, the QEMU trace suggests that it was asked to read the superblock from disk. We’ll
confirm this soon.
Breaking Sleep Link to heading
OK, so next up was to see what was going on with sleep. I did a break sleep, reset the system,
and quickly saw this backtrace:
#0 sleep (chan=chan@entry=0x80015ca8 <bcache+24>,
lk=lk@entry=0x80021040 <disk+296>) at kernel/proc.c:615
#1 0x00000000800056c6 in virtio_disk_rw (b=b@entry=0x80015ca8 <bcache+24>,
write=write@entry=0) at kernel/virtio_disk.c:285
#2 0x0000000080002ca6 in bread (dev=dev@entry=1, blockno=blockno@entry=1)
at kernel/bio.c:99
#3 0x00000000800030d8 in readsb (sb=0x8001e350 <sb>, dev=1) at kernel/fs.c:35
#4 fsinit (dev=dev@entry=1) at kernel/fs.c:43
#5 0x00000000800018dc in forkret () at kernel/proc.c:601
#6 0x00000000800018b0 in myproc () at kernel/proc.c:89
Backtrace stopped: frame did not save the PC
Now, the “backtrace stopped” message seemed odd, and I confirmed myproc does not call forkret.
I believe these are artifacts of GDB not fully understanding how xv6 context switches happen, and
so these are spurious. The rest of the backtrace is legit, though.
So, what do we see here? Well, first off, the channel is the same as the wakeup channel we saw
earlier, so this certainly seems to be the sleep that we later try, unsuccessfully, to wake. We’re
going to come back to sleep, but also notice in the call stack virtio_disk_rw (there’s our
disk operation), bread (a buffered read - it will put data read from disk into the bcache
in-memory cache), and readsb (this function reads the superblock - that seems to match up with
that QEMU trace above).
I think it’s worth quickly digging into fsinit and forkret a little more deeply. forkret is
the first function called when a process is scheduled for the first time. The mechanism for this
is interesting. Whenever a new process is created, one of the key functions is called allocproc.
This function looks for an open spot in xv6’s process table and allocates it, as the name suggests.
Just before returning, allocproc includes this bit of code:
p->context.ra = (uint64)forkret;
This line sets the return address for the process’s context to the address of forkret.
This means that when the scheduler first schedules the process, and it switches contexts, this is
where the process is set to begin.
forkret is a pretty small function. Essentially it ensures that, exactly one time - when the very
first process runs - the filesystem is initialized. Primarily, it initializes the transaction
log for the filesystem. As you might imagine, this also requires first reading the superblock, so
that it can figure out where the log blocks are located. Since init is the first process in
all cases, it will be the process to set up the filesystem. Once the filesystem has been
initialized, execution passes along to the userspace code of the application via usertrapret.
So, to put our sleep in context, init is trying to read the superblock in order to initialize
the filesystem. This is the sleep it never wakes up from.
Differential - Comparing to Round Robin Link to heading
With this context in mind, we now know what init is trying to do. Since we know that the default
scheduler does not cause a system hang on a single CPU, it seems like a reasonable idea to compare
how things look when its corresponding wakeup is called. Now, knowing that the result of myproc is
what was causing the guard to trip, it seemed like this would be a particularly worthwhile
comparison. So, first, on the stride scheduler, I reset the system and broke on wakeup. Then, I
was able to get the process from GDB by having it evaluate myproc for me:
call myproc()
$9 = (struct proc *) 0x8000fe78 <proc>
At this point, I swapped out the stride scheduler with the default scheduler by updating kernel/main.c
to call scheduler instead of stride. I restarted the system; this one required an actual stop
and start because the code had to recompile. Then, I did the same thing:
call myproc()
$1 = (struct proc *) 0x0
Interesting! I wouldn’t expect the address to necessarily be identical, but 0 definitely
implies there’s something different going on. myproc returns a pointer to a struct proc, so
0 is a null pointer here. A null pointer in this context is meant to signify that there is no
process running. And, since this does not point to a real process, the guard will never hit in
this case.
And, this makes sense. When the OS is scheduling processes or choosing which process to wake up, there, by definition, is no process running. So, if I could figure out why the stride scheduler was behaving differently here, I reckoned the system hang would be resolved.
Exploring Context Switches Link to heading
When the kernel needs to switch which process is running, it uses a function called swtch to
perform a context switch. swtch saves the registers of whatever kernel thread is running into a
block of memory and copies data out of a similar block of memory into the CPU registers. For
example, let’s take a look at the following bit of code from the round robin scheduler:
if(p->state == RUNNABLE) {
p->state = RUNNING;
c->proc = p;
swtch(&c->context, &p->context);
We see that the scheduler has found a runnable process here, adjusts its state, makes a note of
which process is running on the CPU (this is ultimately how myproc can identify the running
process, by the way), and then it does a context switch. Whatever register values are being used
by the processor right at the time of the call to swtch will be saved to the context field of
the CPU struct, and the register values will be replaced by the data in the process’s context. At
this point, execution flow switches over to whatever the process had been doing before the last time
it got descheduled. Execution has left the scheduler, and the processor is doing something totally
different at this point.
In the case of the very first call to init, the next “thing” to do is forkret (this is what that
p->context.ra assignment saw to earlier), so this is what the processor will do.
But, at some point, we have to get back to the scheduler. There is more code after swtch after
all. There are a few ways for this to happen in xv6, but it probably warrants finally looking at
sleep as an example of the general pattern:
void
sleep(void *chan, struct spinlock *lk)
{
struct proc *p = myproc();
acquire(&p->lock);
release(lk);
// Go to sleep.
p->chan = chan;
p->state = SLEEPING;
sched();
// Tidy up.
p->chan = 0;
// Reacquire original lock.
release(&p->lock);
acquire(lk);
}
There’s some lock coordination and then updating of state. Note that after the call to sched,
the next step is to clear the sleeping channel. Like swtch (actually, because of it), we’re
going to jump around a little bit before we get to that point, though. Let’s also look at sched:
void
sched(void)
{
int intena;
struct proc *p = myproc();
if(!holding(&p->lock))
panic("sched p->lock");
if(mycpu()->noff != 1)
panic("sched locks");
if(p->state == RUNNING)
panic("sched running");
if(intr_get())
panic("sched interruptible");
intena = mycpu()->intena;
swtch(&p->context, &mycpu()->context);
mycpu()->intena = intena;
}
Note that sched also calls swtch, but this time, the arguments are reversed. It will store
the current register values on the process struct, replacing them with the values from the CPU
struct. And, when the processor moves on, we’re “teleported” back to where the scheduler left off.
With this understanding of how context switches work, let’s look at what the stride scheduler does at this point.
The Fix is In Link to heading
All right, so in the scheduler, the swtch call starts up init, which ultimately winds up kicking
off a disk read, which calls sleep, which ultimately gets us back to the point in code right after
that first swtch call. If I had forgotten to clear the process on the CPU struct after this
context switch, we would see the system hang. Here’s what came just before and after the context
switch in the stride scheduler:
best->state = RUNNING;
c->proc = best;
swtch(&c->context, &best->context);
best->pass += best->stride;
release(&best->lock);
c->proc = 0;
This seems reasonable. The pass value is updated, at which point we’re done with the process struct, so we release its lock and then update the CPU to indicate there are no running processes.
To be fair, in the default scheduler, c is updated before the process lock is released. The shape
of that code is a little different than how the stride scheduler works, but that is a difference.
The process lock shouldn’t impact updating c at all, but there’s no harm in holding the lock while
updating it either. I figured I’d test whether the release order was the culprit. So, I moved
c->proc = 0 to between swtch and release, started the machine up again, and…
… it worked! Yay, but also, boo. How could holding a lock on a process have any bearing on updating a cpu struct?
I Didn’t Mean to Interrupt Link to heading
I figured it was time to look a little closer at release:
void
release(struct spinlock *lk)
{
if(!holding(lk))
panic("release");
lk->cpu = 0;
__sync_synchronize();
__sync_lock_release(&lk->locked);
pop_off();
}
I’ve removed some comments, but most of this relates to synchronization between CPUs and clearing
the lock atomically. But, what is pop_off doing?
void
pop_off(void)
{
struct cpu *c = mycpu();
if(intr_get())
panic("pop_off - interruptible");
if(c->noff < 1)
panic("pop_off");
c->noff -= 1;
if(c->noff == 0 && c->intena)
intr_on();
}
pop_off (and its counterpart, push_off) allow xv6 to track nested interrupt disable requests.
When a process needs to temporarily disable interrupts, rather than setting this at the global level,
it can use push_off. If interrupts were already disabled, there would be no perceptible difference in
behavior, however the cpu struct would increment its noff. If, while interrupts were disabled,
and there had been a push_off, the process needed to do some other action that required interrupts
to be disabled, it could use a second push_off, which would increment noff. Now, the next time
pop_off was called, noff would be decremented, but we wouldn’t want to suddenly re-enable
interrupts because they weren’t enabled to begin with. Instead, another pop_off is required,
which will decrement noff again, so that we can safely know whether to enable interrupts.
The state of interrupts prior to the first push_off is tracked in the CPU’s intena field
(presumably “interrupts enabled”). Only if that is not 0 when the final pop_off happens will
interrupts be enabled.
We’ve seen that release does a pop, and as you might imagine, its corresponding function,
acquire does a push. Assuming that noff remained >= 1 from the time the process lock was
acquired to the time it was released, this would mean there would be no interrupts in the meantime.
So, my hypothesis became: since release calls pop_off, which re-enables interrupts, the disk
interrupt could fire between release and c->proc=0. At that point, c->proc would still
point at init, so wakeup’s guard condition would skip it.
Let’s use GDB to see if this is what’s happening! Because there are a few settings here, I added
the following to my .gdbinit file, so GDB would auto-load them. I’ve added comments inline to
describe each line’s purpose.
# Display whether the CPU will enable interrupts
# when noff becomes 0.
display cpus[0]->intena
# Display the nesting level of push_off/pop_off.
display cpus[0]->noff
# Watch any time the proc field changes on the first
# (only) CPU.
watch cpus[0]->proc
# Break directly after `swtch` in the stride scheduler.
break kernel/proc.c:539
# Break when the disk interrupt fires.
break virtio_disk_intr
Using the “broken” commit, I started up the system and traced execution with GDB.
(gdb) c
Continuing.
Hardware watchpoint 1: cpus[0]->proc
Old value = (struct proc *) 0x0
New value = (struct proc *) 0x8000fe78 <proc>
stride () at kernel/proc.c:538
538 swtch(&c->context, &best->context);
1: cpus[0]->intena = 1
2: cpus[0]->noff = 1
(gdb) c
Continuing.
Breakpoint 2, stride () at kernel/proc.c:539
539 best->pass += best->stride;
1: cpus[0]->intena = 1
2: cpus[0]->noff = 1
(gdb) n
543 release(&best->lock);
1: cpus[0]->intena = 1
2: cpus[0]->noff = 1
(gdb) n
Breakpoint 3, virtio_disk_intr () at kernel/virtio_disk.c:296
296 {
1: cpus[0]->intena = 1
2: cpus[0]->noff = 0
(gdb) c
Continuing.
Hardware watchpoint 1: cpus[0]->proc
Old value = (struct proc *) 0x8000fe78 <proc>
New value = (struct proc *) 0x0
stride () at kernel/proc.c:502
502 intr_on();
1: cpus[0]->intena = 0
2: cpus[0]->noff = 0
Pay special attention to when I “next"ed over the release call. In general, the next line would
be whatever came after the previous. But, here, instead, execution has switched over to the
interrupt handler. Only after continuing past that, do we see the proc value switch from init to 0.
Since intena was 1, the pop_off in the release call stack enabled interrupts, allowing the
disk to fire.
OK, that seems to support the theory. Now, for good measure, I flipped over to the fixed commit to see how it behaved.
(gdb) c
Continuing.
Hardware watchpoint 1: cpus[0]->proc
Old value = (struct proc *) 0x0
New value = (struct proc *) 0x8000fe78 <proc>
stride () at kernel/proc.c:538
538 swtch(&c->context, &best->context);
1: cpus[0]->intena = 1
2: cpus[0]->noff = 1
(gdb) n
Breakpoint 2, stride () at kernel/proc.c:539
539 c->proc = 0;
1: cpus[0]->intena = 1
2: cpus[0]->noff = 1
(gdb) n
Hardware watchpoint 1: cpus[0]->proc
Old value = (struct proc *) 0x8000fe78 <proc>
New value = (struct proc *) 0x0
stride () at kernel/proc.c:540
540 best->pass += best->stride;
1: cpus[0]->intena = 1
2: cpus[0]->noff = 1
(gdb) n
544 release(&best->lock);
1: cpus[0]->intena = 1
2: cpus[0]->noff = 1
(gdb) n
Breakpoint 3, virtio_disk_intr () at kernel/virtio_disk.c:296
296 {
1: cpus[0]->intena = 1
2: cpus[0]->noff = 0
And, there we have it! By moving the c->proc = 0 assignment above the release, we have ensured
that this happens before the disk interrupt can fire.
The best kind of bug fix is one that makes sense!
Lessons Learned Link to heading
One of the interesting things about observing OS behavior is that execution doesn’t flow in the
same way that application code does. For instance, forkret acts as the initial application
entrypoint only because it is injected into the return address register for all new processes.
Similarly, context switches also rely on swapping state between registers and memory directly,
causing execution to jump in ways that are much different from synchronous application code. In the
case of swtch, I’ve come to think of it as a function that doesn’t return to the caller, like exec
(which, as I’m writing this, I realize also doesn’t return because it replaces process memory, similar to
swtch in a sense). Unlike exec, though, you can return to the prior execution state by calling
swtch a second time. You have to evaluate both “ends” of such processes to understand them,
rather than relying on, for instance, a simple backtrace.
Additionally, I tried to document some of the neat things I found with GDB and QEMU monitors in the post, but I learned that they both have some very interesting capabilities I wasn’t aware of previously. GDB’s conditional breakpoints are more nuanced than I expected, and I also learned how to define commands to run when hitting breakpoints. This is useful, for instance, to automatically print backtraces every time a breakpoint is hit, as a small quality-of-life improvement. And, QEMU traces offer a nice way to see events happening at the (virtual) hardware layer, which I suspect will come in very handy. At the very least, I know it’s an option I can reach for in the future.
Lastly, I knew I wanted to document this debugging effort, so I made some notes along the way. But, I feel like I can improve here. I’d like to get more rigorous in how I investigate issues. I have developed some lab guides I use for experimentation. I think it would be worth expanding on these a bit, with a template for debugging sessions. I’ve begun looking at processes that professionals in other domains (science and medicine, in particular) use to track such efforts. I plan to try some of those techniques and will build something based on what I find works for me. Much of the work I do is proprietary, so I can’t really write much publicly, but I do want to use this for personal learning efforts, and these lab notes will help me write up additional summaries like this blog post in the future.
Conclusion Link to heading
All in all, this turned out to be an “easy” fix, but the exploration was really fascinating. I tried to describe the approximate journey I took in figuring this out, but of course, there were little detours along the way, for instance, to find GDB commands to do exactly what I wanted. I mention this only because it’s not like I knew exactly the path from start to finish beforehand. Debugging is always a bit exploratory in nature. But I hope this has been an interesting story! I plan on documenting more of these in the future when I encounter them.