We said in the last post that system calls are the primary means for user processes to request some action by the kernel; system calls mediate processes' access to hardware resources.
If a user process wants to generate a system call, it starts a trap with the
trap number for system calls. Then it identifies which of the various xv6 system
calls it wants to do and passes any required arguments. The processor will then
handle the trap instruction using the code we saw in the last post. Eventually,
it'll get to the trap()
function, which will recognize the trap number as a
system call and pass it on to the syscall()
function.
syscall()
is itself a routing function like trap()
; it'll figure out which
system call the process created and redirect it again to the appropriate kernel
code.
All system calls use the same trap number: 64, or T_SYSCALL
, but xv6 has
multiple system calls, so we need another number for a process to identify which
system call it wants to run. The convention on x86 is to use a system call
number which the calling process should put in the %eax
register, which
usually holds return values. Then the kernel's handler function (here,
syscall()
) can just check %eax
to figure out which system call to run. The
system call numbers are defined in
syscall.h. There
you can see that, e.g. SYS_fork
is defined as 1, SYS_exit
is 2, and so on.
All the system call functions are defined in other files, so we'll have to
import their declarations with the extern
keyword:
// ...
extern int sys_chdir(void);
extern int sys_close(void);
extern int sys_dup(void);
extern int sys_exec(void);
extern int sys_exit(void);
extern int sys_fork(void);
extern int sys_fstat(void);
extern int sys_getpid(void);
extern int sys_kill(void);
extern int sys_link(void);
extern int sys_mkdir(void);
extern int sys_mknod(void);
extern int sys_open(void);
extern int sys_pipe(void);
extern int sys_read(void);
extern int sys_sbrk(void);
extern int sys_sleep(void);
extern int sys_unlink(void);
extern int sys_wait(void);
extern int sys_write(void);
extern int sys_uptime(void);
// ...
Now we've got the numbers and the functions. Note that the numbers start with
uppercase SYS_
and the functions start with lowercase sys_
, so make sure
your kernel hacking adventures don't do anything like SYS_fork()
; use
sys_fork()
instead.
We'll also need a way to map the numbers to those system call functions so that
syscall()
can call the right one depending on the number. We could use another
switch
statement like we did in trap()
, but there are 21 system calls here,
so that would get pretty long; also, each number will just call the specific
function, unlike the different trap numbers which required different responses
(e.g., the timer interrupt trap number didn't call any function at all). xv6
does something else this time that's much simpler and more elegant, but it uses
some slightly-obscure C features, so we'll go over it carefully.
Remember function pointers from way back in the boot loader? Functions are just
a set of instructions in order, loaded somewhere in the kernel's code segment,
so C lets us use the function's name as a pointer to the beginning of its code
in memory. So if we have a C function like int func(char c)
, then func
is
its function pointer. We could even assign it to a variable; that variable's
type would be a pointer to a function of argument type char
and return type
int
; then we could call the function using the new pointer too. Here's an
example that would print "Match!" to the screen:
int m = func('a');
int (*func_ptr)(char) = &func;
int n = (*func_ptr)('a');
if (m == n) {
printf("Match!\n");
}
So instead of a big old switch
statement, the syscall()
function will use a
static, global array of pointers to all the system call functions we just
imported above. (Remember that the static
keyword in front of a variable means
it always occupies the same fixed place in memory.) It'll work because all the
functions have the same argument type (void
) and return type (int
), so their
pointers all have the same type and can fit inside a single array. Then we can
get the right function by just using the system call number to index into the
array of function pointers.
Now, we'd have to be super careful to add the function pointers into the array
in the right order so that the indices match up. Even worse, there is no system
call with number zero, so we'd have to skip that entry of the array. This could
get complicated. Luckily, even though humans are bad at this kind of thing,
computers are really good at it. So instead of trying to line them up by hand,
we can use the array notation from procdump()
in the post on processes where
we specified the value of each entry of an array like this with the index in
square brackets, like this:
int arr[] = { [2] 5, [0] 1, [4] -2 };
The C compiler will use the indices we wrote there to figure out that the array needs 5 entries (indices 0 to 4), and entry 0 is 1, entry 2 is 5, and entry 4 is -2. Entries 1 and 3 will just be initialized to zero.
So at the end of the day, our array of pointers to system call functions looks like this:
// ...
static int (*syscalls[])(void) = {
[SYS_fork] sys_fork,
[SYS_exit] sys_exit,
[SYS_wait] sys_wait,
[SYS_pipe] sys_pipe,
[SYS_read] sys_read,
[SYS_kill] sys_kill,
[SYS_exec] sys_exec,
[SYS_fstat] sys_fstat,
[SYS_chdir] sys_chdir,
[SYS_dup] sys_dup,
[SYS_getpid] sys_getpid,
[SYS_sbrk] sys_sbrk,
[SYS_sleep] sys_sleep,
[SYS_uptime] sys_uptime,
[SYS_open] sys_open,
[SYS_write] sys_write,
[SYS_mknod] sys_mknod,
[SYS_unlink] sys_unlink,
[SYS_link] sys_link,
[SYS_mkdir] sys_mkdir,
[SYS_close] sys_close,
};
// ...
Okay great, now we're ready to route system calls to the right function.
The first thing we need to do is get the system call number so we can figure out
which function to call. We said above that the x86 convention is to store it in
the %eax
register, but we might have a problem: by the time we get to
syscall()
, the processor has already executed the code in the trap handler
function for trap number T_SYSCALL
, which sent it to alltraps()
, which
replaced all the register contents with those of trap()
, so the system call
number is probably long gone from %eax
.
But wait, all is not lost! alltraps()
saved all the registers in a
struct trapframe
for the current process. So we can just read the value of
%eax
from there. Whew, that was some good forward-thinking.
void syscall(void)
{
struct proc *curproc = myproc();
int num = curproc->tf->eax;
// ...
}
Now we just need to do one more thing: call the function that corresponds to that number. We're gonna use the array of function pointers above, but we have to be careful: this number was given to us by a user process. A malicious user process might pass in an invalid number in the hopes of getting the kernel to carry out some undefined behavior which might lead to an easy exploit. So in order to keep up good security practices, the kernel should always distrust anything originating from user code and handle it carefully, preferably with three-inch-thick lead-lined gloves. So let's think about it: what might go wrong?
First of all, any entries that weren't explicitly initialized above (including the 0 entry) will have been automatically initialized to zero, i.e. a null pointer. Also, a number that's bigger than the highest system call number will make us do an out-of-bounds read from the array, thus possibly executing some arbitrary kernel code that's stored after the array in memory. So we should check that (1) the number is greater than 0, (2) it's smaller than the number of elements in the array, and (3) the entry it points to is not a null pointer.
Finally, the %eax
register is usually used in x86 to store return values, so
we'll put the return value of the system call function there. If any of the
above checks failed, we'll just print a message to the console and return -1 to
indicate failure.
void syscall(void)
{
// ...
if (num > 0 && num < NELEM(syscalls) && syscalls[num]) {
curproc->tf->eax = syscalls[num]();
} else {
// Invalid system call number
cprintf("%d %s: unknown sys call %d\n", curproc->pid, curproc->name, num);
curproc->tf->eax = -1;
}
}
The system call handler function will store its return value in %eax
; after
that, syscall()
will return to the line below where it was called in trap()
.
After executing the rest of the code there, trap()
will return into trapret()
,
which ends with an iret
(interrupt return) instruction to tell the processor
to switch to user mode and resume executing the process's code.
Take a look at the sys_
functions we imported above: they all have argument
type void
. But if you think about it, many system calls need an argument: for
example, open()
needs to know which file to open, chdir()
needs to know
which directory to open, kill()
needs a PID to know which process to kill,
etc. So why did we make them all have argument type void
?
The trouble is that until we get the system call number in syscall()
above, we
have no way of knowing which function we'll need. And each function takes
arguments with different types, e.g. open()
might need a string for the file
to open but kill()
might need an integer for the PID. So there's no way for
the kernel to know which arguments to expect in syscall()
, even though the
arguments were already pushed on the stack. The task of recovering the arguments
from the stack will have to fall to each of the sys_
functions. But let's go
ahead and make their lives a little easier by setting up some nice helper
functions now.
The system call functions might take integers, strings, or pointers, so we'll
need functions to fetch each of those types. fetchint()
is one example; it
takes a user virtual address (an integer argument's location in memory) and a
pointer to an integer where we can store the integer we find. Then it returns 0
if it was able to find it, or -1 if it failed.
Just like syscall()
above, we need to treat anything passed from user space
with extreme caution. A user process that tries to read or write memory outside
its address space will cause a segmentation fault or page fault and be killed,
but the kernel has free reign over memory, so a malicious process might try to
trick the kernel into doing that for it by putting its "argument" outside of
the user's address space. So we have to start by checking that the entire 4
bytes of the integer is inside the process's address space.
int fetchint(uint addr, int *ip)
{
struct proc *curproc = myproc();
if (addr >= curproc->sz || addr + 4 > curproc->sz) {
return -1;
}
// ...
}
Now we can just cast the address to a pointer, dereference it, and store the
value in *ip
.
int fetchint(uint addr, int *ip)
{
// ...
*ip = *(int *)(addr);
return 0;
}
Note that we can use an address like addr
which will be in the lower half of
memory because traps don't perform a full context switch, so we're still using
the process's page directory even though we're in kernel mode (ring 0). If we
had switched to a kernel page directory, we'd have to call walkpgdir()
or
uva2ka()
to figure out the corresponding kernel virtual address for addr
.
Now hopefully, if you've taken anything away from my past rants about undefined behavior in C, you noticed something wrong with this function. If you didn't, take another look; I'll wait.
Did you see it? We're dereferencing addr
without checking that it's not null,
so if the user passed in a null address, we'd dereference a null pointer! We
also dereference ip
without a similar check, but at the very least ip
is
passed in by the kernel.
This could be very dangerous -- in general, it's undefined behavior in C, but now that we've seen the code for handling traps, we're actually at a point where we can figure out what would happen in xv6 if a null pointer gets dereferenced, so let's take the opportunity to think about it for a bit.
First, what would happen if the kernel dereferenced a null pointer? Well, if the
kernel is currently using kpgdir
as a page directory, the address 0 isn't
mapped to anything, so when the paging hardware goes to figure out which physical
address corresponds to the kernel virtual address 0, it would fail and generate
a "General Protection Fault" (trap number 13, or T_GPFLT
). That would start
running the trap handler code, which would eventually get to the switch
statement in trap()
(see the last post). Trap number 13 would fall under the
default
case, and the if statement there would recognize that it originated in
the kernel. So it would print an error message to the console, then panic.
Okay, what if we're using a process's page directory, e.g. during a system call?
Address 0 is in the lower half of memory, so it's a user virtual address. The
result will depend on whether that page and its page table are mapped in the
process's page directory. If they are, then dereferencing a null pointer might
be fine after all. But if they're not mapped, dereferencing a null pointer will
cause a General Protection Fault. This time, trap()
would print an error
message to the console, then mark the process to be killed.
Now, killing a process or causing a kernel panic might not sound like a huge deal. In fact, xv6 does a great job here by killing a process that might have dereferenced a null pointer or caused the kernel to do so. A kernel panic would be much worse -- think about how annoying it would be if that PDF you downloaded from that one sketchy website installed some malware that made your kernel panic all the time -- the OS would become unusable. In fact, this is an example of a "denial of service" vulnerability -- a malicious process might not be able to read or write arbitrary memory or execute arbitrary code, but it can still keep you from using your machine the way you expect to.
Just like uva2ka()
, this function will only get called by one other function
(we'll see it soon), so it just so happens that under the current xv6 code,
it'll all be okay because it should never get passed a null pointer. But
everything from my rant about uva2ka()
applies here: if you add any kernel
code that calls this function, be VERY careful and add your own null checks.
Okay, deep breath now. /rant.
Fetching a string argument is tricky too; strings in C are just pointers to an
array of characters that ends in nul, i.e. '\0'
, so this time we have to make
sure that both the pointer and the entire string are in the user's address
space; otherwise, we could unwittingly read from some arbitrary memory location
and pass the data back to the user process.
So we'll start by making sure the pointer itself is in a valid address:
int fetchstr(uint addr, char **pp)
{
struct proc *curproc = myproc();
if (addr >= curproc->sz) {
return -1;
}
// ...
}
Now we'll store the string pointer in *pp
. We'll also get a pointer to the end
of the process's virtual address space so we can make sure the entire string is
inside its bounds.
int fetchstr(uint addr, char **pp)
{
// ...
*pp = (char *) addr;
char *ep = (char *) curproc->sz;
// ...
}
How can we check if the entire string is inside user memory? Well, a string ends
with a nul byte, '\0'
, so we just have to start scanning the memory starting
from *pp
up to ep
until we find a zero byte. If we find one in that range,
then the entire string is in user memory and we can return its length to
indicate success; otherwise the string overflows past the end of the process's
virtual address space, so we should return -1 to indicate failure.
int fetchstr(uint addr, char **pp)
{
// ...
// Scan for a nul byte inside process's address space
for (char *s = *pp; s < ep; s++) {
if (*s == 0) {
// If nul byte found, return the length
return s - *pp;
}
}
// String is not nul-terminated inside process's memory, so report failure
return -1;
}
Note that again, we're dereferencing pp
and addr
without any null checks
(where addr
is definitely the bigger concern, since it's user-generated), and
again, it's gonna work out okay (a misbehaving process will just get killed),
but once more: be careful if you use this function for your own kernel hacks.
This is the main function that the sys_
system call functions will use to
recover an integer argument; it's basically just a wrapper for fetchint()
. The
arguments are an integer n
to say we want the nth integer argument, and a
pointer ip
to store the recovered argument in. We have to call fetchint()
with an address argument, so the main task now is to figure out where in memory
the nth integer argument should be.
We're gonna have to use the x86 function call conventions again. Remember how
whenever we call a function in x86, its arguments get pushed onto the stack in
reverse order (i.e., from right to left), so that the first argument is at the
top of the stack (i.e., lowest memory address)? Then we push a return address
(%eip
) and the old stack base pointer %ebp
. Normally, the stack pointer
would just keep going on to the next slot on the callee's stack, but in this
case the code in alltraps()
saved all the registers (including the stack
pointer %esp
) in a struct trapframe
before calling trap()
or syscall()
.
That means we can recover the old value of %esp
from the trap frame and look
one spots below that on the stack (i.e., 4 bytes higher in memory, since int
s
are 4 bytes) to get the first (n = 0
) argument. The second argument (n = 1
)
would be 8 bytes higher than %esp
, and so on. Pretty neat.
Okay, now that we've got that down, the code for this function is pretty straightforward.
int argint(int n, int *ip)
{
return fetchint((myproc()->tf->esp) + 4 + 4*n, ip);
}
Some of the system call functions will have pointer arguments, so this function
recovers them. Pointers are 4 bytes in x86, so we can use argint()
to get the
pointer itself before performing some additional checks to make sure the pointer
and the address it points to are valid.
The arguments are n
(to retrieve the nth function argument), a pointer pp
to
an address where we can store the retrieved pointer, and the size of the block
of memory that the retrieved pointer points to.
Let's start off by just retrieving the value of the pointer as an integer using
argint()
; that'll make sure that the number n
is valid.
int argptr(int n, char **pp, int size)
{
struct proc *curproc = myproc();
int i;
if (argint(n, &i) < 0) {
return -1;
}
// ...
}
Now we have to make sure that the pointer we just retrieved is itself valid, i.e. that the size is nonnegative and the beginning and end of the memory block it points to are both within the process's address space.
int argptr(int n, char **pp, int size)
{
// ...
if (size < 0 || (uint) i >= curproc->sz || (uint) i + size > curproc->sz) {
return -1;
}
// ...
}
Finally, we can store the pointer in *pp
and return 0.
int argptr(int n, char **pp, int size)
{
// ...
*pp = (char *) i;
return 0;
}
A string is just a pointer in C, so we can recover the pointer's value using
argint()
again, then pass it to fetchstr()
. The former will make sure n
is
valid, and the latter will make sure the string is nul-terminated and resides
entirely in the process's address space.
int argstr(int n, char **pp)
{
int addr;
if (argint(n, &addr) < 0) {
return -1;
}
return fetchstr(addr, pp);
}
So now we know how syscall()
will route a system call trap to the right sys_
function, and we've seen how those functions can recover arguments from the
process's stack. Let's see some examples in action; most of these will be simple
wrapper functions.
All the hard work here is gonna be done by fork()
, which will create a new
child process by cloning the parent process's virtual address space. We don't
need any arguments for this, so we'll just call fork()
.
int sys_fork(void)
{
return fork();
}
exit()
closes out a process, but it puts it in the ZOMBIE
state so that the
parent process can call wait()
to find out it's done running. exit()
should
never return, so we'll add a return value here to make the compiler happy, but
it should never get executed.
int sys_exit(void)
{
exit();
return 0;
}
This system call is the parent process's counterpart to exit()
; it'll do as
its name says and wait until the child process exits.
int sys_wait(void)
{
return wait();
}
The kill()
system call sounds like a more aggressive version of exit()
:
after all, we're killing another process against its will, right? But in reality
it would be way too complicated to do that: the process might be running on
another CPU, midway through updating some kernel data structure, or about to
wake up another process that's asleep. Killing it by force might screw up a lot
of other things.
So instead kill()
just tags it with the killed
field in its struct proc
;
eventually either the process will call exit()
on its own, or it'll generate
another trap, at which point the code in trap()
will call exit()
on it.
kill()
needs an integer argument: the process ID for the process we wish to
kill. So now we can see the payoff of writing those functions above.
int sys_kill(void)
{
int pid;
if (argint(0, &pid) < 0) {
return -1;
}
return kill(pid);
}
The getpid()
system call is so simple that it doesn't even have another
function for this sys_getpid()
to call. We'll just return the PID for the
current process.
int sys_getpid(void)
{
return myproc()->pid;
}
If you're not familiar with system calls like brk()
and sbrk()
on Unix
systems, here's what they do: they grow or shrink the virtual address space of a
process. brk()
sets its new size to a specific maximum address; sbrk()
grows
or shrinks the process by a certain size in bytes and returns its old size.
They're mostly used to implement higher-level memory management functions like
malloc()
. Heh, "high-level" probably isn't high on your mind when you think of
adjectives for malloc()
, right? Anyway, xv6 only has sbrk()
, so let's check
out its sys_
wrapper function.
We'll need an integer argument (the number of bytes to grow or shrink by), so let's grab that.
int sys_sbrk(void)
{
int n;
if (argint(0, &n) < 0) {
return -1;
}
// ...
}
Now we can use growproc()
from our posts on paging to grow the process by n
bytes. But we want to return the old size, so we'll have to grab that before we
change it with the call to growproc()
.
int sys_sbrk(void)
{
// ...
int addr = myproc()->sz;
if (growproc(n) < 0) {
return -1;
}
return addr;
}
The sleep()
function is pretty interesting; we'll get to the implementation
details later, but let's talk about the broad strokes now. You might be familiar
with the sleep()
system call in Unix systems; you pass it an integer (usually
in milliseconds) and it puts your process to sleep (i.e., leaves it inactive or
not running) for that amount of time.
However, sleep()
plays a dual role in xv6: the kernel will call sleep()
for
processes that need to wait while something else happens, e.g. waiting for a
disk to read or write data. That way the processes don't end up idly spinning in
a loop or something and wasting valuable CPU time.
Implementing that is tricky; there's no way to know how long it would take for
whatever condition the process is waiting on to be satisfied, so it's not like
we can just stick in a random amount of time in the call to sleep()
and hope
the condition is satisfied by then. So instead the sleep()
function will just
"put a process to sleep" (read: make its state SLEEPING
so it can't be run by
the scheduler) on a channel, which is just an arbitrary integer. Then later on
the kernel can wake up any processes sleeping on that channel. So for example,
the kernel can put a process waiting on the disk to sleep using a specific
channel that's assigned to the disk; then when the next disk interrupt occurs it
can wake up any processes that might be sleeping on the disk channel.
Okay so that's all well and good for the kernel's use of sleep()
. But what
about the regular old sleep()
system call? The argument is an integer that
represents the number of ticks to sleep for; how are we gonna turn that into a
channel to sleep on?
The answer is pretty neat (at least I think so): we'll set the channel to the
address of the ticks
counter. Remember, ticks
is a global variable that gets
incremented with every timer interrupt. Go check out the code in trap()
again:
each timer interrupt sends a wakeup call to any processes that might be sleeping
on the &ticks
channel. That should wake the process at every timer interrupt.
Then we'll just stick that inside a for loop so it keeps sleeping forever until
the right amount of ticks have passed.
Let's start by retrieving the integer argument, which is the number of ticks to sleep for.
int sys_sleep(void)
{
int n;
if (argint(0, &n) < 0) {
return -1;
}
// ...
}
That argument n
is a relative count, since a user process won't necessarily
know how many ticks have already gone by. So let's get the current tick count
before we put the process to sleep.
int sys_sleep(void)
{
// ...
acquire(&tickslock);
uint ticks0 = ticks;
// ...
}
Now we just have to write that while loop I mentioned above to put the process
to sleep until n
ticks have passed. Since we started counting at ticks0
, the
condition should be satisfied when ticks - ticks0 == n
.
Two more details: first, we'll add a check inside the while loop to see if the
current process has been tagged to be killed; if so, we'll just return -1 so we
can hasten the process's actual death by letting it run more code so the kernel
will call exit()
on it at the next trap. Second, the function sleep()
takes
another argument in addition to the channel: a lock. It'll release the lock for
us and reacquire it before waking up so that a sleeping process doesn't hog a
lock when it doesn't need it.
int sys_sleep(void)
{
// ...
while (ticks - ticks0 < n) {
if (myproc()->killed) {
release(&tickslock);
return -1;
}
sleep(&ticks, &tickslock);
}
release(&tickslock);
return 0;
}
The uptime()
system call just returns the amount of ticks that have passed
since the system started. This is another one that's so simple it doesn't need
another function, so we'll take care of it all here.
We just acquire the lock for ticks
, get its current value, release the lock,
and return the value we got.
int sys_uptime(void)
{
acquire(&tickslock);
uint xticks = ticks;
release(&tickslock);
return xticks;
}
We have system calls now! Well, not quite -- we still have to check out the
actual functions like exit()
, sleep()
, kill()
, etc. Plus, we only saw the
sys_
wrapper functions for some of the system calls here; the rest are in
sysfile.c, which
we'll get to after we understand the xv6 file system.
But let's pause for a second and think about how a user process will send a
system call. Like let's say you're writing some C code for a user program that
will run on xv6 and you want to create a child process with fork()
. What
should you do?
Well, if you were coding for a Unix system like Linux or macOS, you'd just write
a call to fork()
in your code. But that can't be right in xv6, can it? After
all, fork()
is a kernel function, to be run in kernel mode with a current
privilege level of 0. Plus, isn't it supposed to be called by sys_fork()
? So
should we call that?
None of these options will work. Well, yes, you do end up just calling fork()
,
but it's not the kernel function fork()
, so if you're expecting that one,
you'll be surprised when it doesn't behave the way you want it to. You won't be
able to use any kernel code at all in your user program for xv6. This is a
mistake I've seen a lot of people make in their xv6 OSTEP projects, so bear
with me for a second while I explain why you can't do it; feel free to skip the
next section on the Makefile if you already know why.
To see why, let's check out the xv6
Makefile to see
how xv6 is actually compiled, built, and run. There's a ton of stuff in there,
but take a second to think about this: how do you usually run xv6? I bet it's
a command like make qemu
or make qemu-nox
, right?
If you're not familiar with Makefiles, here's a quick primer: each command like
make qemu
, make clean
, etc. is specified in the Makefile with a rule that
looks like this:
mycmd: dependency1 dependency2 ...
build_cmd1
build_cmd2
# ...
So if I run make mycmd
, the make
program will check that dependency1
,
dependency2
, etc. are up to date; if they're not, it'll update them by looking
up their rules and executing those to update them. Then it'll execute
build_cmd1
on the shell, followed by build_cmd2
, etc.
Okay, I know that might be confusing, so let me simply the make qemu
command a
bit to make it more readable (note that I cut a lot of stuff out here, so don't
try to run xv6 with what I wrote below).
# ...
qemu: fs.img xv6.img
qemu -drive file=fs.img,index=1 -drive file=xv6.img,index=0
# ...
This just says that in order to run make qemu
when you type it on the
terminal, the make
program first has to make sure that both fs.img
and
xv6.img
are fully up to date. Then once they are, it can just run the shell
command qemu
with the options -drive file=fs.img,index=1
and
-drive file=xv6.img,index=0
. Those options are just regular flags like the
ones you're probably used to with stuff like ls -a
or rm -rf
. In this case,
they tell qemu
to use the files fs.img
and xv6.img
as virtual hard drives,
with xv6.img
as disk number 0 and fs.img
as disk number 1.
Okay, let's check out the make
command for xv6.img
next.
# ...
xv6.img: bootblock kernel
# some dd commands here
# ...
Hey, that's interesting, we already saw bootblock
in a prior post. That's the
one we get when we compile the boot loader. kernel
is, well, all the kernel
code. The dd
command is often used in Unix systems to format and set up disks;
the details aren't important here, so I left them out for now. The point is that
the boot loader got compiled separately from the kernel code, remember? But
their machine code files get smushed into the same (virtual) disk together as
xv6.img
, which will be disk 0 when we run in qemu
.
Not let's check out the (slightly simplified) make
command for fs.img
.
# ...
UPROGS = cat echo forktest grep init kill ln ls # ...
fs.img: mkfs README $(UPROGS)
./mkfs fs.img README $(UPROGS)
# ...
Okay, so UPROGS
is just a list of all the user programs. Each of those gets
compiled separately; e.g. if you look in their source code, you'll see each one
has its own main()
function. Then the shell command says to run mkfs
to
create a file system called fs.img
with README
and all the user programs as
files.
The point of this detour is this: the boot loader gets compiled as a single
unit, as does the entire kernel code. But the user programs are compiled one at
a time. So if you write a user program for xv6, you should add it to the list in
UPROGS
(as well as in EXTRA
) and expect it to get compiled individually and
stuck onto the fs.img
disk.
That means there's no way for a user program to call into any kernel code; the
linker wouldn't even be able to match up the call to the right function. So no
user program will ever be able to call functions like (the kernel's) fork()
.
Think about it: if you write a program in C and compile it to run on Linux, do
you expect to have to recompile the entire Linux kernel just to run your one
little program? No, right?
But certainly we can't just expect every single program ever to be totally self-
contained. You also don't have to rewrite and recompile all of malloc()
every
time you write a C program. So operating systems provide libraries for users to
include and call in their programs. Aha! So all we need to do in order for
user processes to execute system calls is to provide a library. That library is
usys.S.
Let's trace back to the beginning of a trap. In order to execute a system call,
we're supposed to send the processor an int
instruction with a specific trap
number; that would be int 64
for system calls on xv6. We're also supposed to
stick the system call number in the %eax
register. Let's say we want to call
fork()
. According to
syscall.h, the
system call number for fork is SYS_fork
, or 1. In order to send a specific x86
instruction and manipulate individual registers, we'll have to write our system
call library in assembly. Here's what it would look like for the fork()
system
call:
.globl fork
fork:
movl $1, %eax
int $64
ret
Okay, that's easy enough, but we have 21 of these to write, and it would be
pretty easy to make a mistake or write the wrong system call number. Let's
automate it instead with a C preprocessor macro. We've seen plenty of examples
of defining simple constants with #define
directives for the preprocessor, but
we haven't looked at them too closely until now.
The C preprocessor is a piece of software that edits C (or assembly) code before
it's compiled. Preprocessor directives like #define A 5
create macros that are
expanded to replace every instance of A
in the code with the number 5;
directives like #include "header.h"
expand such that they essentially copy-
paste all the code in the file header.h
. We can also create function-like
macros like the P2V()
and V2P()
macros we've used often by adding a
parameter inside parentheses; unlike functions, these will be expanded before
compilation to paste the code into every instance of its use, thus avoiding the
usual overhead associated with a function call. Function-like macros are also
generic, in a sense, since they don't require specifying parameter types or
return types (as long as it works within the places where the macro will be
used). Note that there are some drawbacks: macros aren't type-checked, they can
evaluate their arguments more than once, we can't use pointers to them like we
can with functions, and they can result in larger code.
We're gonna use a function-like macro here to create the assembly code for each
system call function so that it gets expanded before the code is assembled.
We'll use T_SYSCALL
instead of 64 in the code above, and SYS_fork
(or its
equivalent for each system call) for the system call number. We'll have to
replace the part after the underscore in SYS_
with the name of the system call
function; we can do that with the token-pasting operator ##
, which glues two
tokens together to form a single token. Also, macros must be defined on a single
line, so we'll escape the newline characters with \
and end each assembly line
with a semicolon.
#include "syscall.h"
#include "traps.h"
#define SYSCALL(name) \
.globl name; \
name: \
movl $SYS_##name, %eax; \
int $T_SYSCALL; \
ret
Now we can just invoke the macro on the name of each function we want to create:
# ...
SYSCALL(fork)
SYSCALL(exit)
SYSCALL(wait)
# and so on ...
After the preprocessor runs on the file, the result will look like this:
.globl fork
fork:
movl $1, %eax
int $64
ret
.globl exit
exit:
movl $2, %eax
int $64
ret
.globl wait
wait:
movl $3, %eax
int $64
ret
# and so on...
Great! Now we have 21 functions for the system calls, all written in assembly.
All user programs for xv6 will be compiled together with the code for these
functions: see ULIB
in the Makefile. So now, a user program can execute a
system call by calling these functions, e.g. fork()
.
After all the preparations are handled by the trap handler functions in the IDT,
alltraps()
, and trap()
, system calls get routed to the syscall()
function,
which uses a system call number to pick the right function out of an array. That
function will have to recover any arguments to the system call before passing it
on to the real system call function later on.
Next up, we'll take a look at some of those system calls; we'll leave the rest until after we go over xv6's file system.