Minix Demand Paging Design

[ Home ] [ Up ] [ New Stuff ] [ Magic-2? ] [ Overview ] [ Photo Gallery ] [ Construction ] [ Technical Info ] [ My Other Projects ] [ Links ]

Random thoughts as I go

This page will serve as a design sketchbook for support of demand paging in Minix 2.0.4.

Problem Statement

Understandably, given its design as a teaching tool, Minix 2.0 does not include support for demand paging. Instead, processes are allocated text and data segments consisting of blocks of contiguous physical memory. This simplification works well, and makes the code dealing with fork() and exec() very easy to understand. When a process fork()s, parent and child can share the same physical memory containing the code (text), and the child is allocated its own data segement, which gets a copy of the parent's data. When a process exec()s, the target executable file is opened and validated. Then, the process's existing code and data segments are freed. New segments are allocated and are initialized with code and data from the executable file.

Memory allocation and de-allocation is relatively simple and allows students to explore techniques to manage free lists and holes. Given the speed of modern x86 processors and the low cost of memory this situation is perfectly adequate for mainstream Minix uses.

However, this scheme does have some inherent ineffiencies. Command shells under Minix (and other Unlx-like systems) create new processes using fork() followed immediately by an exec(). The act of copying the parent's entire data segment to the child is almost entirely useless - the child will generally only use a tiny amount of the parent's data before exec()'ng. An exec() will cause all of that previously copied data to just be discarded.

Further, exec() itself also represents a significant usability inefficiency. Minix will load a new program's entire code and data from disk to the newly allocated physical memory segments before execution begins. There are two problems with this. First, most programs will only ever execute a smallish amount of thier code. This means that code that will never actually be needed is loaded from disk into memory. Secondly, the act of ensuring that all code and data is loaded before execution begins represents a unnecessary degradation of interactive response time. If only the code and data that is needed immediately is loaded, execution can begin sooner.

Now, these ineffiencies exist in x86 Minix - but the speed of modern systems is so great that they aren't noticable. For Magic-1, though, these useless actions can literally add seconds to the execution response time of interactive applications. In other words, they make the system feel much slower that it might otherwise be.

The solution

To deal with the issues for Magic-1, we will be using a pair of related techniques: demand paging and copy-on-write. In general, copy-on-write addresses the fork() problem of useless copying of parent data, and demand paging deals with improving exec() response time.

Paging

Before getting into the details of demand paging and copy-on-write, we must first discuss paging in general. Key to both of these techniques is the ability to break up a process's virtual address space into chunks (pages) and associate those chunks with arbitrary chunks (pages) of physical memory. Modern computers have special hardware (Memory Mapping Units, or MMUs) that allow a process to believe that it "owns" a contiguous range of addressable memory, while hiding from it the details of the actual memory that it uses. The process's view of memory is the "virtual address space", while the real memory in the system is "physical memory". When the process executes an instruction to load the contents of a memory cell at a virtual address, the MMU converts that virtual address to the address of a physical memory cell. Because MMU hardware is somewhat costly, this mapping of virtual to physical address spaces is bundled into chunks of address - also known as pages. The size of a page is system dependent. For Magic-1, the page size is 2048 bytes. All of the virtual addresses in a page of virtual memory are translated to the same physical page.

Key to further discussions is the fact that a process "sees" its virtual memory space as a contiguous range, or set of contiguous pages. For example, a Magic-1 process has a 16-bit data address space, and so sees it as an unbroken range of memory cells from address 0x0000 through 0xFFFF, or 32 2K-byte pages. The MMU's mapping of virtual to physical pages, however, does not need to be contiguous. From its virtual memory point of view, page 3 is immediately followed by page 4. However, the MMU might associate virtual page 3 with physical page 241, and virtual page 4 with physical page 112. Further, it is possible (and often quite desireable) for the MMU to map virtual pages from different processes to the same physical memory page.

Another important concept is that the total sizes of all running process's virtual memory spaces need not be the same as the total amount of physical memory in the system (and generally isn't). For most systems, the cumulative total of virtual address spaces is much larger than available physical memory. It's the operating system's job of efficiently allocate physical memory to meet process needs, and that turns out to be a quite complex job.

Magic-1 is a little odd in that it has much more physical memory (4MB) than required virtual address space (max 128K bytes per process). Supporting demand paging in most other computer systems would involve a lot more focus on fairly spreading available physical memory to meet virtual memory needs, and in particular how to decide when to steal physical memory from one process to give to another (and how to save and restore the contents of those stolen pages). Though I may include this capability in my implementation of paging for Magic-1, it is unlikely to be needed much and therefore doesn't need a sophisticated implementation.

Now, let's briefly go back for Minix's notion of memory allocation. In Minix, each process has three memory segments - Text (code), data and stack. Each of these segments is described by a mem_map structure (see include/minix/types.h) which gives a starting page (click) in virtual address space, a size in pages and a physical starting page. If a data segment starts at virtual page 0, is 5 pages long and starts at physical page 123, we know that the virtual to physical page mapping is 0->123, 1->124, 2->125, 3->126 and 4->127. This greatly simplifies the converting between virtual and physical addresses, but note that it has a built-in assumption that the physical pages for a virtual address range are contiguous.

The first step in a more flexible memory management system is to break this assumption. We need to change the mem_map structure to allow arbitrary mappings between virtual and physical memory pages. For Magic-1, the solution is quite simple. Because process virtual address spaces are relatively small (64K bytes, or 32 pages), we can simply replace the single physical page base field with a 32-element array of physical page descriptors. Note that if we were dealing with a 32-bit or 64-bit processor, we would need to use a more space-efficient data structure.

The details of what these page descriptors are will change slightly as the implementation progresses, but in short they will consist of a physical page number plus a number of flag bits to describe attributes of the page (i.e. read-only, writeable, present, unmapped, file-backed, etc.).

In addition to changing the mem_map structure, we also need to identify all of the places in Minix which directly or indirectly rely on the assumption that physical memory is contiguous. The primary offenders in the kernel proper are phys_copy(), umap(), numap() and the various vir2phys() macros. I've replaced these routines with equivalents which understand that data in a virtual address space that crosses a page boundary may span widely different physical memory pages.

More significant changes also need to be made in the Memory Manager. These will be discussed later.

Copy-on-write

Assume now that Minix generally understands that virtual code and data segments conists of virtual pages which can be arbitrarily associated with physical memory pages, and that further the MMU enforces memory protection. In other words, if a process attempts to read or write to a virtual page that has no physical memory mapping, a Not-Present page fault will be generated. Further, if a process attempts to write to a virtual page that has a physical page mapped to it, but mapped as read-only a Not-Writeable page fault will be generated.

We now have in place the elements for an elegent and efficent fork() mechanism. The current Minix fork() code in MM's exec.c is moderately long. It has to allocate memory for the child and copy the parent's data (and sometimes code) to the child. With copy-on-write, the implementation of fork() becomes trivial. It essentially amounts to simply copying the parent's page table to the child and making all data page entries in both parent and child to read-only. No actual page data is copied. Both parent and child are marked as ready to run and fork() is done.

The interesting part is what happens next. The first time either the parent or the child attempts to write to its data segment, a Not-Writeable page fault is generated. The trap handler will examine the mem_map structure for the offending process, and see that the fault happened on a "copy-on-write" page. It will then do just that - allocate a fresh data page and copy the source page to the fresh page. That fresh page will be entered into the process's page table, but this time with write permission enabled. The process will then be made ready to run and will eventually attempt to redo the write that previously failed. This time, though, it will succeed.

The key here is that only pages that are written to are copied. If, as it typical, a child only writes a small amount of data before exec'ng, only 1 or 2 pages will be copied.

Now, I've skipped over some important implementation details. We need to keep a reference count on these pages. When we do a copy-on-write for a page, we'll decrement the reference count for that page. For example, if both the parent and child write to the same page, the first write will cause the copy-on-write action will cause the reference count to go to 1. When the other process writes to that page, it will get a trap also but can see that because the reference count is 1, it can just regain sole ownership of the page by enabling write permission.

There is a problem here, though. Consider in more detail the actions that might have to take place following a page fault. When a page fault happens, control is vectored through the normal trap handling mechanism and eventually winds up in exception(). This routine need to decide whether this is a fault or signal that can be handled, and if so it will construct an appropriate message. Lets assume that we have a new kernel task, PAGER, that will recieve page fault messages. So, following a page fault the faulting process will be blocked and a message will be sent to PAGER.

PAGER, then, will need to examine the mem_map structure of the faulting process, possibly allocate a fresh page of memory and update page reference counts. It seems natural that the Memory Manager server (MM) would be responsible for handling these tasks, so you'd expect PAGER to accompish them by sending appropriate messages to MM. However, what would happen if the page fault happened while MM were executing a system call on behalf of a process that needed a copy-on-write servicing? The answer is deadlock. One of Minix's key simplifications is that user-mode servers like MM and the file system (FS) are single-threaded. They can only handle a single message at a time. If either were blocked while waiting for a page fault to be handled, they go deaf - not accepting any new messages.

There are several possible solutions here. We could force the user-mode servers to make sure any possible page faults are handled before "accepting" a new message. Similarly, we could rework the servers to be able to save partially completed faulting operations and then later restart them following the servicing of the page fault. Both of these methods, however, are ugly and error-prone, as you have be sure to identify and prevent all possible faulting situations before they happen.

The solution I'll be using is to break the deadlock cycle by creating a new task/server called MAPPER. MAPPER will have very simple jobs - basically storing information about physical memory pages and mappings between virtual and physical memory. To ensure that it cannot be part of a deadlock cycle, it will never send a message on its own behalf. It will only recieve messages, carry them out and send a response. We will move the memory allocation functionality from MM to MAPPER. So, when it comes time to handle a page fault, PAGER can avoid having to send a message to MM or FS. Instead, it need only talk with MAPPER, and thus avoid the deadlock situation.

Interestingly, some of the memory allocation capabilities that we are moving from MM to MAPPER become much simpler in a paging environment. We no longer have to worry about finding a contiguous block of physical memory (as does MM now), but can just keep a list of free blocks and allocate an appropriate number of them without worrying about their physical relationships with one another. This mechanism also opens up the possibility of a more comprehensive caching mechanism (but more on that when we discussing demand paging).

MAPPER will have some odd quirks. To meet its requirement that it never initiate a message, we must require that all callers handle parameter passing. If a caller needs to pass in or retrieve more information than can fit in a standard message body, it must first send a message to MAPPER to request private parameter storage. The caller is responsible for either filling up that parameter storage prior to sending the "active" message request, or retrieving the needed data from that area later. To support copy-on-write, the following new system calls will be handled by MAPPER:

	ALLOCPARMSPACE: Allocate parameter space, message type M1. Incoming I1 field is size, outgoing P1 field is pointer to region within MAPPER's virtual address space.
	FREEPARMSPACE: Free parameter space, message type M1. Incoming P1 field is pointer to space to free.
	ALLOC_PAGES: Allocate a series of memory pages, message type M1. Incoming I1 is the number of pages to allocate, In/Out P1 is a pointer to an array of page request descriptors. For copy-on-write support, each descriptor will either be empty - which is interpreted as a request to allocate a fresh physical page, or contains the number of an existing physical page - which is interpreted as a request to increment the reference count for that page. Each request record will be replaced by its result.
	FREE_PAGES: Free a series of memory pages, message type M1. Incoming I1 is the number of pages to allocate, In/Out P1 is a pointer to an array of page request descriptors. MAPPER will decrement the reference count for each physical page identified in the request vector. If the count goes to zero, that page will be returned to the free list (or as we describe later, to a victim cache in some cases).

Demand Paging

Demand paging support builds on the infrastructure created for copy-on-write, relying heavily on MAPPER and PAGER to avoid deadlocks and handle page fault requests. The new feature is that unlike copy-on-write (which is triggered by Not-Writeable faults on existing pages), demand paging allows us to defer loading a page in the first place and is triggered by the Not-Present page fault.

At exec() time, instead of allocating new physical memory pages for code and data and then pre-loading them with the contents of the executable, we instead simply establish a mapping between the code and data virtual address space in the new process and the locations in the executable file that hold the code and data. No physical memory pages are allocated at this time. When the process begins running, it will trigger Not-Present page faults when it attempts to access that non-existant code and data. These faults will be converted to messages sent to the PAGER task, which will then allocate physical memory and initialize it with the appropriate data from the executable file.

The necessary capabilities for establishing the virtual <-> disk file mappings can be generalized into a pair of new systems calls, mmap() and munmap(). These new system calls will be implemented in MM and wll be used during exec() handling, as well as be available for general user-mode program use (mmap() is commonly used to support inter-process communication via shared memory regions). Refer to a standard Posix man page for mmap() for details (and note that my Magic-1 implementation will likely only support a subset of standard mmap() capabilities).

The first thing to consider here is that demand paging suffers from the same kind of deadlock issues that we discussed in the copy-on-write section - only worse. During the handling of copy-on-write, the PAGER needed only to allocate fresh memory pages. For demand paging, though, it will need to additionally be able to locate disk blocks on the file system and read/write them - actions normally performed by the File Server (FS). And again, it is quite possible for a Not-Present page fault to occur while FS is handling a system call on behalf of a user process causing a deadlock situation as it would be unable to recieve any messages from the PAGER task looking to load/store the necessary disk blocks.

The solution will be the same as before - we will move the limited FS capabilities needed by PAGER into the MAPPER task. In this case, what we really need is the FS's knowledge of which disk blocks are mapped to which virtual pages (i.e. - the mapping created by the mmap() system call). So, at the time we establish the mapping we will ask FS to provide a complete description of where the data can be located. We will need the major and minor device, the inode number, the inode's modification timestamp and an ordered list of the physical disk blocks that make up the mapped region. This will be done via a new FS system call: FMAP. MM will do an FMAP call as part of mmap() processing, and will then send the returned into to MAPPER - which will make it available to PAGER on demand.

PAGER will use this mapping info to create block read/write request messages that will be sent directly to the appropriate device driver tasks to service the faults, bypassing FS and avoiding deadlock.

As an example, let's follow an exec() of a new process.

MM recieves EXEC system call. It calls FS to open the executable file, validates permission and reads the file header to see the size of the code (Text) space, the initialized data space and the BSS space (which is to be initialized to zero).
MM internally make a series of mmap() calls to associate the process's virtual code space with the disk blocks containg the code, the appropriate region in the process's data space with the disk blocks containing the initialized data and finally the BSS virtual space is associated via copy-on-write to an existing physical memory page which has been pre-initialized to all zeroes.
Each of the file-related mmap() calls in turn makes an FMAP call to FS to retrieve the device/inode/timestamp info for the range. This info is passed to MAPPER to storage, and mapper returns an identifier which is then placed in the process's mem_map structure.
For each virtual page in the mapped region, MAPPER also returns a page table entry to be loaded into the process's page table. This PTE will either be marked not present if the target page has not yet been loaded, or it might actually be a reference to the desired physical page of memory if some other process had caused that page to previously be loaded (this we where we can easily implement a nice page caching mechanism).
MM then passes the process's new page table to the SYSTEM task, which loads it into the process's hardware page table.
Finally, the process is made ready to run and begins execution.
Assume now that the process attempts to execute code on a page that has not yet been loaded. The MMU will trigger a Not-Present page fault, control will vector to exception() - which will build a page fault message to be sent to PAGER.
PAGER will recieve this message and look at the process's mem_map structure to retrieve the identifier that MAPPER previously associated with the mmap() range that covered the faulting address.
PAGER sends a message to MAPPER describing which page faulted within the range. It is possible that since the original mmap() call, some other process faulted on the same page and caused it to be loaded. If so, MAPPER returns a new page table entry (PTE) which PAGER simply installs in the process's page table and the process is allowed to resume.
Otherwise, MAPPER responds with the device identifier, the two 1K disk blocks that hold the data for that page and an identifier for a newly-allocated physical memory page that the data is to be loaded into.
PAGER then constructs a READ message for the appropriate device driver (generally C0) and instructs it to read the disk blocks into the new page.
The device driver reads the blocks into the page and replies to PAGER, which in turn sends a message to MAPPER telling it that the new page is ready.
Finally, PAGER appropriately adjusts the process's page table, makes the faulting process ready to run and returns to its recieve loop to await the next page fault message.

Roles and Responsibilities

By moving some functionality out of MM and FS, I'm jumbling roles and responsbilities up a bit. Here, I'll try to re-impose some order in the world by clarifiying the new roles and responsibilities of FS, MM, MAPPER and PAGER.

	FS: As before, FS is responsible for the file system implementation. Philosophically, little is changed except that block device reads and writes (i.e. get_block(), put_block()) will be routed through PAGER rather than directly invoking the block device driver. This will enable a more comprehensive block cache. The 2nd-level block cache in FS will be disabled, and the 1st-level block cache will be slightly changed to allow direct mapping of cached pages (to avoid memory copies).
	MM: Previously, MM was responsible for setting up and managing a process's complete memory environment. This meant allocating physical memory pages and initializing them appropriately. Now, MM will be only responsible for managing a process's virtual memory space. Allocation and de-allocation of physical memory will be moved to MAPPER, and initializations will handled on a page-by-page basis by PAGER.
	MAPPER: Mapper is responsible for registering mappings between virtual address spaces at the process level and physical storage. As implied above, MAPPER is more akin to a data structure than an active task or server. It will generally deal with memory ranges, and will report on the state of mappings at the time it is called. It will not change mappings - just register and report on them. It will also maintain the list of free physical memory pages and provide allocation and free interfaces.
	PAGER: Pager is responsible for changing the state of a mapping between a virtual page and a physical storage page. It deals only with a single page at a time. Changing the mapping state could mean invoking page device drivers to read/write a page, copying existing physical pages or assigning new physical pages. It will invoke MAPPER whenever it needs a physical memory page, and will also register changes with MAPPER.

[todo: describe in details the lookup, reference counting and caching mechanisms that MAPPER will need. In short, fixed array of 2048 page descriptors to describe all physical memory. Hash lookup for range identifiers and for cached pages. Use of dev/inode/timestamp triplet as hash key. Free list threaded through descriptor table. When reference count goes to zero, don't immediately free but rather keep in victim cache. If run out of completely free pages, use simple mechanism to cull a group of pages from victim cache. If truly out of pages (i.e. both free list and victim cache empty), probably just kill process rather than flush to page file for now. Describe mmap() support for MAP_PRIVATE & MAP_SHARED, as well as flushing dirty pages back to disk. Timing issues. What to do when mapped file is altered by other process? Do we flush and reload? Need FS to keep track of which files are mapped and signal PAGER when inode changes? Describe how heap growth is supported via the break system call (i.e. - map copy-on-write zero pages). Describe how we can keep the gap between the end of heap and the stack unmapped to catch bogus memory references. Describe how normal stack growth is distinguished from invalid data reference in the unmapped gap (i.e. - examine the current value of SP at the time of fault, and either map in fresh memory pages if stack growth or send seg-fault message if bad gap reference). Describe how a.out format requires that code and data aligned within executable file to page boundaries, side discussion of enhancement to reduce wasted disk space by including a.out header in code space. Describe null pointer dereference detection via use of unmapped data page 0. Brief description of issues around mapping a file whose size is not a page multiple. Discuss how Windows sidesteps the issues around modification of a mapped executable by disallowing write access to the file. Describe implementation of ptrace() for debuggers by using mmap(MAP_PRIVATE) and copy-on-write on code pages to easily support setting breakpoint instructions in code. mmap() tricks to allow processes to communicate via shared memory.]