Random thoughts as I go
This page will serve as a design sketchbook for support of demand paging in Minix 2.0.4.
Understandably, given its design as a teaching tool, Minix 2.0 does not include support for demand paging. Instead, processes are allocated text and data segments consisting of blocks of contiguous physical memory. This simplification works well, and makes the code dealing with fork() and exec() very easy to understand. When a process fork()s, parent and child can share the same physical memory containing the code (text), and the child is allocated its own data segement, which gets a copy of the parent's data. When a process exec()s, the target executable file is opened and validated. Then, the process's existing code and data segments are freed. New segments are allocated and are initialized with code and data from the executable file.
Memory allocation and de-allocation is relatively simple and allows students to explore techniques to manage free lists and holes. Given the speed of modern x86 processors and the low cost of memory this situation is perfectly adequate for mainstream Minix uses.
However, this scheme does have some inherent ineffiencies. Command shells under Minix (and other Unlx-like systems) create new processes using fork() followed immediately by an exec(). The act of copying the parent's entire data segment to the child is almost entirely useless - the child will generally only use a tiny amount of the parent's data before exec()'ng. An exec() will cause all of that previously copied data to just be discarded.
Further, exec() itself also represents a significant usability inefficiency. Minix will load a new program's entire code and data from disk to the newly allocated physical memory segments before execution begins. There are two problems with this. First, most programs will only ever execute a smallish amount of thier code. This means that code that will never actually be needed is loaded from disk into memory. Secondly, the act of ensuring that all code and data is loaded before execution begins represents a unnecessary degradation of interactive response time. If only the code and data that is needed immediately is loaded, execution can begin sooner.
Now, these ineffiencies exist in x86 Minix - but the speed of modern systems is so great that they aren't noticable. For Magic-1, though, these useless actions can literally add seconds to the execution response time of interactive applications. In other words, they make the system feel much slower that it might otherwise be.
To deal with the issues for Magic-1, we will be using a pair of related techniques: demand paging and copy-on-write. In general, copy-on-write addresses the fork() problem of useless copying of parent data, and demand paging deals with improving exec() response time.
Before getting into the details of demand paging and copy-on-write, we must first discuss paging in general. Key to both of these techniques is the ability to break up a process's virtual address space into chunks (pages) and associate those chunks with arbitrary chunks (pages) of physical memory. Modern computers have special hardware (Memory Mapping Units, or MMUs) that allow a process to believe that it "owns" a contiguous range of addressable memory, while hiding from it the details of the actual memory that it uses. The process's view of memory is the "virtual address space", while the real memory in the system is "physical memory". When the process executes an instruction to load the contents of a memory cell at a virtual address, the MMU converts that virtual address to the address of a physical memory cell. Because MMU hardware is somewhat costly, this mapping of virtual to physical address spaces is bundled into chunks of address - also known as pages. The size of a page is system dependent. For Magic-1, the page size is 2048 bytes. All of the virtual addresses in a page of virtual memory are translated to the same physical page.
Key to further discussions is the fact that a process "sees" its virtual memory space as a contiguous range, or set of contiguous pages. For example, a Magic-1 process has a 16-bit data address space, and so sees it as an unbroken range of memory cells from address 0x0000 through 0xFFFF, or 32 2K-byte pages. The MMU's mapping of virtual to physical pages, however, does not need to be contiguous. From its virtual memory point of view, page 3 is immediately followed by page 4. However, the MMU might associate virtual page 3 with physical page 241, and virtual page 4 with physical page 112. Further, it is possible (and often quite desireable) for the MMU to map virtual pages from different processes to the same physical memory page.
Another important concept is that the total sizes of all running process's virtual memory spaces need not be the same as the total amount of physical memory in the system (and generally isn't). For most systems, the cumulative total of virtual address spaces is much larger than available physical memory. It's the operating system's job of efficiently allocate physical memory to meet process needs, and that turns out to be a quite complex job.
Magic-1 is a little odd in that it has much more physical memory (4MB) than required virtual address space (max 128K bytes per process). Supporting demand paging in most other computer systems would involve a lot more focus on fairly spreading available physical memory to meet virtual memory needs, and in particular how to decide when to steal physical memory from one process to give to another (and how to save and restore the contents of those stolen pages). Though I may include this capability in my implementation of paging for Magic-1, it is unlikely to be needed much and therefore doesn't need a sophisticated implementation.
Now, let's briefly go back for Minix's notion of memory allocation. In Minix, each process has three memory segments - Text (code), data and stack. Each of these segments is described by a mem_map structure (see include/minix/types.h) which gives a starting page (click) in virtual address space, a size in pages and a physical starting page. If a data segment starts at virtual page 0, is 5 pages long and starts at physical page 123, we know that the virtual to physical page mapping is 0->123, 1->124, 2->125, 3->126 and 4->127. This greatly simplifies the converting between virtual and physical addresses, but note that it has a built-in assumption that the physical pages for a virtual address range are contiguous.
The first step in a more flexible memory management system is to break this assumption. We need to change the mem_map structure to allow arbitrary mappings between virtual and physical memory pages. For Magic-1, the solution is quite simple. Because process virtual address spaces are relatively small (64K bytes, or 32 pages), we can simply replace the single physical page base field with a 32-element array of physical page descriptors. Note that if we were dealing with a 32-bit or 64-bit processor, we would need to use a more space-efficient data structure.
The details of what these page descriptors are will change slightly as the implementation progresses, but in short they will consist of a physical page number plus a number of flag bits to describe attributes of the page (i.e. read-only, writeable, present, unmapped, file-backed, etc.).
In addition to changing the mem_map structure, we also need to identify all of the places in Minix which directly or indirectly rely on the assumption that physical memory is contiguous. The primary offenders in the kernel proper are phys_copy(), umap(), numap() and the various vir2phys() macros. I've replaced these routines with equivalents which understand that data in a virtual address space that crosses a page boundary may span widely different physical memory pages.
More significant changes also need to be made in the Memory Manager. These will be discussed later.
Assume now that Minix generally understands that virtual code and data segments conists of virtual pages which can be arbitrarily associated with physical memory pages, and that further the MMU enforces memory protection. In other words, if a process attempts to read or write to a virtual page that has no physical memory mapping, a Not-Present page fault will be generated. Further, if a process attempts to write to a virtual page that has a physical page mapped to it, but mapped as read-only a Not-Writeable page fault will be generated.
We now have in place the elements for an elegent and efficent fork() mechanism. The current Minix fork() code in MM's exec.c is moderately long. It has to allocate memory for the child and copy the parent's data (and sometimes code) to the child. With copy-on-write, the implementation of fork() becomes trivial. It essentially amounts to simply copying the parent's page table to the child and making all data page entries in both parent and child to read-only. No actual page data is copied. Both parent and child are marked as ready to run and fork() is done.
The interesting part is what happens next. The first time either the parent or the child attempts to write to its data segment, a Not-Writeable page fault is generated. The trap handler will examine the mem_map structure for the offending process, and see that the fault happened on a "copy-on-write" page. It will then do just that - allocate a fresh data page and copy the source page to the fresh page. That fresh page will be entered into the process's page table, but this time with write permission enabled. The process will then be made ready to run and will eventually attempt to redo the write that previously failed. This time, though, it will succeed.
The key here is that only pages that are written to are copied. If, as it typical, a child only writes a small amount of data before exec'ng, only 1 or 2 pages will be copied.
Now, I've skipped over some important implementation details. We need to keep a reference count on these pages. When we do a copy-on-write for a page, we'll decrement the reference count for that page. For example, if both the parent and child write to the same page, the first write will cause the copy-on-write action will cause the reference count to go to 1. When the other process writes to that page, it will get a trap also but can see that because the reference count is 1, it can just regain sole ownership of the page by enabling write permission.
There is a problem here, though. Consider in more detail the actions that might have to take place following a page fault. When a page fault happens, control is vectored through the normal trap handling mechanism and eventually winds up in exception(). This routine need to decide whether this is a fault or signal that can be handled, and if so it will construct an appropriate message. Lets assume that we have a new kernel task, PAGER, that will recieve page fault messages. So, following a page fault the faulting process will be blocked and a message will be sent to PAGER.
PAGER, then, will need to examine the mem_map structure of the faulting process, possibly allocate a fresh page of memory and update page reference counts. It seems natural that the Memory Manager server (MM) would be responsible for handling these tasks, so you'd expect PAGER to accompish them by sending appropriate messages to MM. However, what would happen if the page fault happened while MM were executing a system call on behalf of a process that needed a copy-on-write servicing? The answer is deadlock. One of Minix's key simplifications is that user-mode servers like MM and the file system (FS) are single-threaded. They can only handle a single message at a time. If either were blocked while waiting for a page fault to be handled, they go deaf - not accepting any new messages.
There are several possible solutions here. We could force the user-mode servers to make sure any possible page faults are handled before "accepting" a new message. Similarly, we could rework the servers to be able to save partially completed faulting operations and then later restart them following the servicing of the page fault. Both of these methods, however, are ugly and error-prone, as you have be sure to identify and prevent all possible faulting situations before they happen.
The solution I'll be using is to break the deadlock cycle by creating a new task/server called MAPPER. MAPPER will have very simple jobs - basically storing information about physical memory pages and mappings between virtual and physical memory. To ensure that it cannot be part of a deadlock cycle, it will never send a message on its own behalf. It will only recieve messages, carry them out and send a response. We will move the memory allocation functionality from MM to MAPPER. So, when it comes time to handle a page fault, PAGER can avoid having to send a message to MM or FS. Instead, it need only talk with MAPPER, and thus avoid the deadlock situation.
Interestingly, some of the memory allocation capabilities that we are moving from MM to MAPPER become much simpler in a paging environment. We no longer have to worry about finding a contiguous block of physical memory (as does MM now), but can just keep a list of free blocks and allocate an appropriate number of them without worrying about their physical relationships with one another. This mechanism also opens up the possibility of a more comprehensive caching mechanism (but more on that when we discussing demand paging).
MAPPER will have some odd quirks. To meet its requirement that it never initiate a message, we must require that all callers handle parameter passing. If a caller needs to pass in or retrieve more information than can fit in a standard message body, it must first send a message to MAPPER to request private parameter storage. The caller is responsible for either filling up that parameter storage prior to sending the "active" message request, or retrieving the needed data from that area later. To support copy-on-write, the following new system calls will be handled by MAPPER:
Demand paging support builds on the infrastructure created for copy-on-write, relying heavily on MAPPER and PAGER to avoid deadlocks and handle page fault requests. The new feature is that unlike copy-on-write (which is triggered by Not-Writeable faults on existing pages), demand paging allows us to defer loading a page in the first place and is triggered by the Not-Present page fault.
At exec() time, instead of allocating new physical memory pages for code and data and then pre-loading them with the contents of the executable, we instead simply establish a mapping between the code and data virtual address space in the new process and the locations in the executable file that hold the code and data. No physical memory pages are allocated at this time. When the process begins running, it will trigger Not-Present page faults when it attempts to access that non-existant code and data. These faults will be converted to messages sent to the PAGER task, which will then allocate physical memory and initialize it with the appropriate data from the executable file.
The necessary capabilities for establishing the virtual <-> disk file mappings can be generalized into a pair of new systems calls, mmap() and munmap(). These new system calls will be implemented in MM and wll be used during exec() handling, as well as be available for general user-mode program use (mmap() is commonly used to support inter-process communication via shared memory regions). Refer to a standard Posix man page for mmap() for details (and note that my Magic-1 implementation will likely only support a subset of standard mmap() capabilities).
The first thing to consider here is that demand paging suffers from the same kind of deadlock issues that we discussed in the copy-on-write section - only worse. During the handling of copy-on-write, the PAGER needed only to allocate fresh memory pages. For demand paging, though, it will need to additionally be able to locate disk blocks on the file system and read/write them - actions normally performed by the File Server (FS). And again, it is quite possible for a Not-Present page fault to occur while FS is handling a system call on behalf of a user process causing a deadlock situation as it would be unable to recieve any messages from the PAGER task looking to load/store the necessary disk blocks.
The solution will be the same as before - we will move the limited FS capabilities needed by PAGER into the MAPPER task. In this case, what we really need is the FS's knowledge of which disk blocks are mapped to which virtual pages (i.e. - the mapping created by the mmap() system call). So, at the time we establish the mapping we will ask FS to provide a complete description of where the data can be located. We will need the major and minor device, the inode number, the inode's modification timestamp and an ordered list of the physical disk blocks that make up the mapped region. This will be done via a new FS system call: FMAP. MM will do an FMAP call as part of mmap() processing, and will then send the returned into to MAPPER - which will make it available to PAGER on demand.
PAGER will use this mapping info to create block read/write request messages that will be sent directly to the appropriate device driver tasks to service the faults, bypassing FS and avoiding deadlock.
As an example, let's follow an exec() of a new process.
Roles and Responsibilities
By moving some functionality out of MM and FS, I'm jumbling roles and responsbilities up a bit. Here, I'll try to re-impose some order in the world by clarifiying the new roles and responsibilities of FS, MM, MAPPER and PAGER.
[todo: describe in details the lookup, reference counting and caching mechanisms that MAPPER will need. In short, fixed array of 2048 page descriptors to describe all physical memory. Hash lookup for range identifiers and for cached pages. Use of dev/inode/timestamp triplet as hash key. Free list threaded through descriptor table. When reference count goes to zero, don't immediately free but rather keep in victim cache. If run out of completely free pages, use simple mechanism to cull a group of pages from victim cache. If truly out of pages (i.e. both free list and victim cache empty), probably just kill process rather than flush to page file for now. Describe mmap() support for MAP_PRIVATE & MAP_SHARED, as well as flushing dirty pages back to disk. Timing issues. What to do when mapped file is altered by other process? Do we flush and reload? Need FS to keep track of which files are mapped and signal PAGER when inode changes? Describe how heap growth is supported via the break system call (i.e. - map copy-on-write zero pages). Describe how we can keep the gap between the end of heap and the stack unmapped to catch bogus memory references. Describe how normal stack growth is distinguished from invalid data reference in the unmapped gap (i.e. - examine the current value of SP at the time of fault, and either map in fresh memory pages if stack growth or send seg-fault message if bad gap reference). Describe how a.out format requires that code and data aligned within executable file to page boundaries, side discussion of enhancement to reduce wasted disk space by including a.out header in code space. Describe null pointer dereference detection via use of unmapped data page 0. Brief description of issues around mapping a file whose size is not a page multiple. Discuss how Windows sidesteps the issues around modification of a mapped executable by disallowing write access to the file. Describe implementation of ptrace() for debuggers by using mmap(MAP_PRIVATE) and copy-on-write on code pages to easily support setting breakpoint instructions in code. mmap() tricks to allow processes to communicate via shared memory.]