Diary 2002

[ Home ] [ Up ] [ New Stuff ] [ Minix Port ] [ Magic-2? ] [ Overview ] [ Photo Gallery ] [ Construction ] [ Technical Info ] [ My Other Projects ] [ Links ]

Diary entries for 2002

9/22/2002

Lots of work since my last entry. I've made good progress on the schematics - and in the process have been forced to give a lot of thought to backplane signal assignment and how to break my implementation over the cards. At most, I can cram 70 20-pin dip parts on a card. My first cut at the card for the ALU and registers was clearly going over budget. To make things fit, I've had to move some control signal generation that I'd expected to do local to that card off-card. This means more active backplane signals, but I don't have much of a choice. On the bright side, anything out on the backplane will be easier to access for debugging - and will be available for bringing out to the front panel.

While thinking about this stuff, I noticed yet another card cage on eBay. This was nicer than the first one I had, so I snagged it for $7.95 (plus $20 or so for shipping). Besides the cage, it also had a full heavy case and power supply. I won't be using either of those, I think. I still want to do a Plexiglas case to show off all my work. Anyway, the cage arrived and I went ahead and built the backplane using wirewrap. It took nearly 1,000 wires to do it - and I can say that I an a lot more comfortable with wrapping than I was beforehand.

Next step is to complete the schematics for at least the register card, and then move on to clock and control generation. These, I believe, I can actually start building before the rest is complete. I'm anxious to get construction going.

8/13/2002

More simplifications as I work through the schematics. I'm dropping support for rotate - it just make things a bit too complicated. To handle rotate, I was going to build Register A out of 4 4-bit universal shift registers. However, handling shift/rotate in both 8 and 16 bits got a bit too messy. I would have ended up with about 10 devices to handle routing and glue logic. So, the simpler method is to just support shifts - and do that by having two sets of bus drivers on the output of the alu. I was going to have to have one set anyway to buffer the signals to send over the backplane. So, I'll have one set of 74244s that are wired straight through, and another that are pre-wired to do a 1-bit right shift. To handle left shift, I'll just add the operand to itself (turning off flag generation). Much simpler, and it will also give me the opcode space to allow shifts of B as well as A.

I will have to decide whether I want to bring A out to the front panel. Perhaps I should just latch the current contents of A and B into shadow registers on the front panel. This way I don't have to carry their signals - just Z and the latch signals.

Also, after a discussion with Ken, I've decided to rename T to MDR (Memory Data Register). That's really what it is. The interesting bit here is that when I first stared this project, the diagrams of similar projects always had an MAR and MDR. I didn't immediately see the need for a MDR, so I didn't put it in. Instead, I was using an weird split-bus arrangement to overlap memory operations with address generation. However, that got unwieldy and my design mutated to putting my T (temp) register between memory and the other registers. Now I see why everyone was using an MDR.

8/02/2002

Starting to work on the schematic fragments, starting with the registers. Some additional simplification: I moved PC from being a register with both two and three-state connections to just three-state. To do this, TPC now has both input and output on the L bus, and it latched when MISC(INIT_INST) shows. This saves two 74244 buffers, but also means that PC won't be shown on the front panel. This is okay, as I will show MAR and the expanded address bus. All I need to do is make sure I also have an LED showing the fetch microcode cycle. When that is showing, it tells you what PC is. Also, at any instruction boundary, PC and MAR will be the same.

7/12/2002

Microcode rewire complete, and work begun on bringing the simulator up to date. At the moment, I appear to be correctly generating the PROM images for the new microcode bits. I think I'm going to go ahead and do a complete rewrite of the simulator. The general plan is to have a handler for each register or functional unit which mimics the behavior of the 74xxx devices making it up. I'll then invoke each function any time one of the clocks changes state. I'm assuming 4 states - CLK_S rising edge, CLK_S high, CLK_S falling edge, CLK_S low. This will make the simulator quite slow, but should help me get the circuitry correct. Later on I can do a fast simulator, if necessary. In other software, I need to do another lcc retargeting pass, as well as updating the assembler. At some point I'll need to do a linker - but haven't decided yet on object file format. I'd like to use an existing linker. Linkers are more difficult than they might seem, and there are lots of things I'd rather spend my time on. For the time being, I figure I can "link" by concatenating the assembly files and assembling absolute executables.

7/7/2002

Lots of progress on the microcode rewrite. It looks like I'm going to end up with about half the microcode as the previous version. This is due to my more aggressive use of field decoding in the compacted microcode structure.

7/6/2002

Working through the microcode, and did some detail work on the fault/interrupt code. It occurs to me that my IVEC_base (interrupt vector base) register isn't all that useful. In practice, I'd initialize it at startup and never change it again. To same myself some wiring, I think I'll eliminate it. Instead, I'll have fixed vector locations starting at address 0x0004 in the supervisor mode data segment. When we're running with address translation off, this will still enable us to have the starting instruction (which would be a 3-byte absolute branch) at address 0x0000.

Note to myself: handle DMA as an unmaskable interrupt (i.e. allow current instruction to complete before recognizing). It can be of low priority, but won't be delayed for more than 1 instruction.

7/2/2002

Managed to spend a couple of hours trying to write the new microcode, and turned up quite a few problems. This new ISA revision is breaking lots of stuff. Here's my to-do list:

	Eliminate the pre-latch conditional branches. In my previous microarchitecture, I had an extra temporary register and could speed branches by a cycle by looking at the comparison condition in the same cycle as the CMP. That's no longer the case, so I might as well always do branches on the latched flag bits. This will actually make the branch stuff a bit cleaner.
	I realized that I had no microcode bit to select whether we needed to do a branch (only the condition for it). I have no free bits, but rather than go from 40 to 48 bits, I'll move remove the priv_inst bit, and make it a decoded output of the misc field. This may slow some privileged instructions a cycle. In its place, I'll add a do_branch bit.
	Specifying the T register's inputs and outputs was a little screwy. Clean it up by using the lo_byte bit for not only memory writes but T writes. Make separate L(T_MEM) and L(T_Z) functions.
	I find that I frequently need to latch T, PC and MAR in the same cycle, but I didn't have a way to express that in the microcode. There's a separate MAR latch control bit, but T and PC shared the L() function. Add a new function case, T_MEM_PC, which latches either a high or lo byte into T from the data bus (based on lo_byte) and also latches the contents of the Z bus into PC.
	There were a few IR-based function cases I needed to add to L() and EL().

Note to myself: Are a carry-in of 0 and MSW[c] sufficient? Will I also need ~MSW[c]? If so, will have to find another bit somewhere.

7/1/2002

Starting a forced week off from work, and it looks like I probably won't be able to spend much time on the project (too many house projects). However, I did make another refinement pass at the microcode instruction format. It's getting a bit more complex as a result of my desire to keep it to 40 bits wide (down from 56 in the old version). Also, I believe I'll be adding two bits to MSW to report whether a memory fault happened using user/supervisor page table, and whether it was in the code or data section. Previously, I only had user/supervisor to deal with - and handled that by using distinct trap vectors for user vs. supervisor faults. I could still do the latter, but I think I'll just go ahead and put the bits in MSW (which will also free up a couple of slots for extra interrupt request lines. These bits will be read-only, and will only latch on a fault.

Anyway, here's the latest and greatest microcode structure definition:

/* Define for micro instruction word.  Assume I'll be using 512x8 bipolar
 * PROMs.  This version is quite a bit more compact than previous ones,
 * but at the cost of having addition field decoding logic.  Initial plan
 * is to send these signals across the backplane and do decoding on the
 * appropriate card.
 *
 * Note that the encoding here is getting pretty ugly.  I'm trying hard to
 * keep the microcode store down to 5 PROMS - 16 bits for enable signals, 
 * 16 bits for latch signals and 8 bits for the next field.  Some of the latch
 * signals also imply enable signals.  In particular, T_[BOTH|LOW|HIGH] drives
 * not only T's latch, but the mux selecting input from the Z and data busses.
 * Similarly, rw and l_byte control not only mem/device latching, but dbus direction
 * and the mux selecting either the high or low byte of T on a write.
 */typedef struct {
	// Latch bits - activate on rising edge of CLK_S
	// ** 16 bits used -> two PROMs
	unsigned latch:4;	// Register latch signal. Value:
				// 0x0 : none
				// 0x1 : MSW (flag nibble only, from Z)
				// 0x2 : C
				// 0x3 : PC
				// 0x4 : DP
				// 0x5 : SP
				// 0x6 : A
				// 0x7 : B
				// 0x8 : PTB
				// 0x9 : IVEC
				// 0xa : SSP
				// 0xb : T_BOTH (from Z)
				// 0xc : T_LOW (from dbus)
				// 0xd : T_HIGH (from dbus)
				// 0xe : 
				// 0xf : use IR[5..7]
	unsigned lmar:1;	// Latch MAR
	unsigned rw:1;		// 0-> dbus read, 1-> dbus write
	unsigned lo_byte:1;	// 0-> T's MSB to dbus, 1-> T's LSB to dbus
	unsigned lm:1;		// Latch MSW[Mode]
	unsigned lp:1;		// Latch MSW[P] (paging enable)
	unsigned priv_inst:1;	// This inst is privilidged
	unsigned shift_code:2;	// Control for A's shift functionality
				// 0x0 : Shift Left
				// 0x1 : Shift Right
				// 0x2 : Rotate Left
				// 0x3 : Rotate Right
	unsigned misc:4;	// Controls signals which never occur at the
				// same time:
				// 0x0 : none
				// 0x1 : bkpt
				// 0x2 : halt
				// 0x3 : syscall
				// 0x4 : trap on overflow
				// 0x5 : latch PTE
				// 0x6 : latch MSW (flag nibble only, from alu)
				// 0x7 : INIT_INST (clear T, PC->TPC, latch IR)
				// 0x8 : do shift op
				// 0xa : DMA acknowledge
				// 0xb : latch MSW[IE] (Interrupt Enable)
				// 0xc :
				// 0xd :
				// 0xe :
				// 0xf :

	// Transparent latch bits - activite on falling edge of CLK_S
	// ** 16 bits used -> two PROMs	
	unsigned e_l:4;		// Enable L bus
				// 0x0 : MAR
				// 0x1 : MSW
				// 0x2 : C
				// 0x3 : PC
				// 0x4 : DP
				// 0x5 : SP
				// 0x6 : A
				// 0x7 : B
				// 0x8 : T
				// 0x9 : PTB
				// 0xa : IVEC
				// 0xb : SSP
				// 0xc : TSP
				// 0xd : TPC
				// 0xe :
				// 0xf :
	unsigned e_r:2;		// Enable R bus
				// 0x0 : T
				// 0x1 : Immediate
				// 0x2 : Fault code/encoder
				// 0x3 :
	unsigned immval:2	// Immediate value
				// 0x0 : 0
				// 0x1 : 1
				// 0x2 : -2
				// 0x3 : -1
	unsigned op_size:1;	// 0x0 -> 8 bits, 0x1 -> 16 bits
	unsigned aluop:2;	// Which alu operation to perform
				// 0x0 : IR[1..3]
				// 0x1 : AND
				// 0x2 : SUB
				// 0x3 : ADD
	unsigned carry_in:1;	// 0x0 -> 0, 0x1 -> MSW[c]
	unsigned branch_op:1;	// Select branch condition using
				//	0x0 : IR[1..3]
				// 	0x1 : IR[5..7]
				// Non-negated conditions are:
				// 	0x0 : latched eq
				// 	0x1 : eq (same cycle cmpb)
				// 	0x2 : lt (same cycle cmpb)
				// 	0x3 : le (same cycle cmpb)
				// 	0x4 : latched ltu
				// 	0x5 : latched leu
				// 	0x6 : latched lt
				// 	0x7 : latched le
	unsigned branch_dir:1;	// 0x0 -> don't negate, 0x1 -> negate
	unsigned user_ptb:1;	// Use user's page table base
	unsigned data_ptb:1;	// Data region of page table vs. code region

	// Note sure whether to use transparent for this...
	// ** 8 bits used -> one PROM
	unsigned next:8;	// Next micro-op to exec. 0x00 means
				// use output of priority encoder, 0xff
				// means use IR[0..7].  Also a significant
				// bit of !(IR[0..7]==0xff) to give the full
				// 9-bit microcode address.
} micro_inst_t;

6/27/2002

Haven't had much time to think about SP/SSP during user-kernel mode switch, but it does occur to me that if I require word (16-bit) alignment of SP, I can eliminate TSP. TSP and TPC are used to hold the starting value of SP and PC so that I can roll back in case of a trap. For all other machine state, I just make sure that I don't change it until we've passed any possible trap points. If I require 2-byte alignment of SP, I believe I can treat SP like the others and won't need TSP to roll back to. The problem with byte alignment is that I might take a fault on the 2nd byte, after updating SP after the first byte load. With word alignment, I'll either trap on the first byte or neither.

This convention also simplifies SSP update. I can now just wire SSP's latch input to a simple function of SP's latch input and the MSW mode bit. Actually, it will probably be a bit more complicated - but I haven't yet had time to sit down a map out the sequence of events during user<->supervisor mode transitions.

I consider it bad form to require something like word-aligned SP and then allow SP to ever be byte-aligned. I can easily deal with this. All of my register transfers are now done via an ALU operation. I'd planned on AND'ng the source with 0xffff and latching the result in the target. My microcode immediate generator can do 1,0,-1 and -2. By AND'ng with -1 I can do a simple copy. For SP, I can AND with -2 and auto-clear the low bit. Alternately, I could just hard-wire the LSB of SP to 0. I guess I'd prefer to handle this in software. I might change my mind later and it will be easier to handle in mirocode rather than rewiring.

6/25/2002

Finished a superficial reading of Tanenbaum's Operating Systems Design & Implementation, the Minix book. Although porting Minix to M-1 may be unrealistic given time and address space limits, my working plan is to do so. I just had a thought, and figured I'd better write it down:

Why not have SSP exactly track SP while in supervisor mode, but not be changed while in user mode? The idea here is that when we fault or take an interrupt, M-1 will spill state onto SSP. For a user-mode process to kernel transition, this spill can be into the process' save stack (ala Minix). Then, we switch SP to the kernel stack (and SSP magically tracks). If we then get a new interrupt, we'll spill normally onto the kernel stack - and only need to check whether we're already handling an interrupt to know not to switch again to the kernel stack base. Maybe then we can always RFI from the kernel stack. If we were handling a nested interrupt, we resume with the previous handler. If not, perhaps I can have a canned return below the kernel stack start which will send us to the code to schedule the next process by pointing SP (and SSP) to its stack save area in the process table. Implementation-wise, this should be very simple. I can make SSP's latch a simple logical function of SP's latch and the mode bit in the MSW. This means I can also eliminate any explicit SSP-setting instructions. If I'm in kernel mode (which I must be to set SSP), I can set SSP by setting SP. I need to think this through, but so far I like it. Oh - here's another thought: should SSP update happen at fetch time? In other words, do I even need TSP? Well, yes, I guess I do. However, perhaps the best thing here is to set SSP's latch bit to a function of the INIT_TSP_TPC_T_FETCH microcode bit and MSP's mode bit.

6/22/2002

Definately decided to go with split code/data address spaces. Haven't yet worked out all the details. Also considering instructions to assist virtual to physical address conversions.

6/21/2002

Considering yet another significant change to the architecture: move to split code/data regions. Right now, I limit a process to a 64Kbyte address space with combined code and data, using 64 1K pages mapped into the 22-bit address space. Might it be better to split code and data, giving each process a maximum of 128Kbytes? One way to do this would be to have separate page tables for PC accesses and regular data accesses. I could still use a single page table base, but would OR in a new microcode bit at page table entry selection time to hit either a data or code portion of the table. I'd probably want to increase the page size to 2K to keep the page table setup time reasonable. I'll need to add some new instruction to copy between memory regions, and would likely require those to be supervisor-mode only.

I'll also have to make sure I can easily represent which table was in use at the time of a memory access fault. This will require some thought. I'd previously solved this problem for deciding whether the fault happened when using the user vs. supervisor page table by having separate fault vectors for the two flavors. I don't think I can double again - as that will really mess up my microcode sequencing scheme that uses the output of the priority encoder to select which of the first sixteen microcode ops are used to transition either to fetch or a fault/interrupt. There just isn't enough room to accommodate user-code/ user-data/ supervisor-code/ supervisor-data, etc. This needs thought.

On anther topic, I think I can eliminate a microcode bit from the IMMVAL construction. Right now I'm using 3 bits, but I think I can get away with 4 values selected by two bits: 0, 1, -2 and page_size. This bit would be the one needed to be freed up to select code/data page table. Here's the current cut of the micro-instruction word:

/* Define for micro instruction word.  Assume I'll be using 512x8 bipolar
 * PROMs.  This version is quite a bit more compact than previous ones,
 * but at the cost of having addition field decoding logic.  Initial plan
 * is to send these signals across the backplane and do decoding on the
 * appropriate card.
 */

typedef struct {
	// Latch bits - activate on rising edge of CLK_S
	// ** 16 bits used -> two PROMs
	unsigned latch:4;	// Register latch signal. Value:
				// 0x0 : none
				// 0x1 : MSW
				// 0x2 : C
				// 0x3 : PC
				// 0x4 : DP
				// 0x5 : SP
				// 0x6 : A
				// 0x7 : B
				// 0x8 : T
				// 0x9 : PTB
				// 0xa : IVEC
				// 0xb : SSP
				// 0xc :
				// 0xd :
				// 0xe :
				// 0xf :
	unsigned lmar:1;	// Latch MAR [REVIEW - need separate?]
	unsigned lm:1;		// Latch Mode bit within MSW
	unsigned lei:1;		// Latch interrupt enable (EI) bit in MSW
	unsigned lp:1;		// Latch (P)aging enable bit in MSW
	unsigned priv_inst:1;	// This inst is privilidged
	unsigned dma_ack:1;	// DMA acknowledge bit (bus is now free)
	unsigned shift_code:3;	// Control for A's shift functionalityj [NEEDS REVIEW]
	unsigned misc:3;	// Controls signals which never occur at the
				// same time:
				// 0x0 : none
				// 0x1 : bkpt
				// 0x2 : halt
				// 0x3 : syscall
				// 0x4 : trap on overflow
				// 0x5 : latch page table entry
				// 0x6 : latch flags (from alu op)
				// 0x7 : init_inst (clear T, PC->TPC, SP->TSP, latch IR)

	// Transparent latch bits - activite on falling edge of CLK_S
	// ** 16 bits used -> two PROMs

	unsigned e_l:4;		// Enable L bus
				// 0x0 : MAR
				// 0x1 : MSW
				// 0x2 : C
				// 0x3 : PC
				// 0x4 : DP
				// 0x5 : SP
				// 0x6 : A
				// 0x7 : B
				// 0x8 : T
				// 0x9 : PTB
				// 0xa : IVEC
				// 0xb : SSP
				// 0xc : TSP
				// 0xd : TPC
				// 0xe :
				// 0xf :
	unsigned e_r:2;		// Enable R bus
				// 0x0 : T
				// 0x1 : Immediate
				// 0x2 : Fault code/encoder
				// 0x3 :
	unsigned immval:2	// Immediate value
				// 0x0 : 0
				// 0x1 : 1
				// 0x2 : -1
				// 0x3 : page size (2048?)
	unsigned aluop_size:1;	// 0x0 -> 8 bits, 0x1 -> 16 bits
	unsigned aluop:2;	// Which alu operation to perform
				// 0x0 : IR[1..3]
				// 0x1 : AND
				// 0x2 : SUB
				// 0x3 :
	unsigned carry_in:1;	// 0x0 -> 0, 0x1 -> MSW[c]
	unsigned branch_op:1;	// Select branch condition using
				//	0x0 : IR[1..3]
				// 	0x1 : IR[5..7]
				// Non-negated conditions are:
				// 	0x0 : latched eq
				// 	0x1 : eq (same cycle cmpb)
				// 	0x2 : lt (same cycle cmpb)
				// 	0x3 : le (same cycle cmpb)
				// 	0x4 : latched ltu
				// 	0x5 : latched leu
				// 	0x6 : latched lt
				// 	0x7 : latched le
	unsigned branch_dir:1;	// 0x0 -> don't negate, 0x1 -> negate
	unsigned user_ptb:1;	// Use user's page table base
	unsigned data_ptb:1;	// Data region of page table vs. code region

	// Note sure whether to use transparent for this...
	// ** 8 bits used -> one PROM
	unsigned next:8;	// Next micro-op to exec. 0x00 means
				// use output of priority encoder, 0xff
				// means use IR[0..7].  Also a significant
				// bit of !(IR[0..7]==0xff) to give the full
				// 9-bit microcode address.
} micro_inst_t;

6/13/2002

Reworked the microarchitecture block diagram and posted it to the web site.

6/12/2002

The ISA redesign is going well. I've finished the first pass, and am much happier with the tightness of the encoding than I was with the old instruction set. I decided to do a complete microcode rewrite, in part because the better encoding will allow me to use much less microcode - but also because it's clear to me now that I'll be reworking the internal busses. My early sketches provide a significant simplification - and may even allow me to use six rather than seven microcode PROMs. I'll eliminate the separate address unit and force all arithmetic ops and bus transfers through the main ALU. This will cause an extra cycle to be added to many instructions (particularly loads). That's bad, but I think the simplification is worth it. Now that I'm getting close to starting construction, a change that can save hundreds of wire-wrap connections seems a lot more important than an extra cycle.

6/5/2002

Starting to change the architecture write-up to reflect the new register model. I haven't yet decided how much (if at all) I'm going to change the basic internal bus structure. With AD0 becoming B and an aluop source and memop target, it might make sense to move that register into the integer unit. In fact, if I want to allow 8-bit ops, I'll have to because the address unit busses are unified 16-bits wide. However, such a change would cause a lot of rippled changes in everything. If I leave the bus structure as-is, I believe I can provide the functionality via microcode, though operations using B as an aluop would be a at least a cycle slower than would be possible if A were in the integer unit. Decisions, decisions, decision....

6/4/2002

Here's a fibonacci program compiled with my retargeted lcc (using the new a/b register conventions). I still some tuning to do to improve the code generation, but it's pretty close:

; Program start
    .global _fib
    .cseg
_fib:
    enter 10
;int fib(int n) {
; if (n < 2) {
    ld.16 a,4+12(sp)
    ldi.16 b,2
    cmpb.sge a,b,L2
; return n;
    ld.16 a,4+12(sp)
    br L1
L2:
; return (fib(n-1) + fib(n-2));
    ld.16 a,4+12(sp)
    sub.16 a,1
    st.16 (sp+0),a
    call _fib
    st.16 -2+12(sp),a
    ld.16 a,4+12(sp)
    sub.16 a,2
    st.16 (sp+0),a
    call _fib
    st.16 -4+12(sp),a
    ld.16 b,-2+12(sp)
    copy a,b
    ld.16 b,-4+12(sp)
    add.16 a,b
L1:
    leave
    ret
    .global _main
_main:
    enter 14
;int main() {
; int num = 10;
    ldi.16 a,10
    st.16 -2+16(sp),a
; int result = fib(num);
    ld.16 a,-2+16(sp)
    st.16 (sp+0),a
    call _fib
    st.16 -4+16(sp),a
; printf("Fib %d -> %d\n",num,result);
    lea a,L5(dp)
    st.16 (sp+0),a
    ld.16 a,-2+16(sp)
    st.16 (sp+2),a
    ld.16 a,-4+16(sp)
    st.16 (sp+4),a
    call _printf
; return(0);
    ldi.16 a,0
L4:
    leave
       ret
L5:
    .defb 70
    .defb 105
    .defb 98
    .defb 32
    .defb 37
    .defb 100
    .defb 32
    .defb 45
    .defb 62
    .defb 32
    .defb 37
    .defb 100
    .defb 10
    .defb 0
; End of program
    .end

6/3/2002

Some more thought on the ISA rework. I think it's going to be much better. I'll now have the following registers:

	a - general purpose load/store plus operand and result of alu ops
	b - general purpose load/store plus operand of alu ops
	c - special register for block copy byte count
	sp - stack pointer
	dp - global data pointer
	pc - program counter

The difficulty now is how to handle address generation for a and b. My previous arrangement was working out quite well allowing simultaneous alu operations and address generation. If a and b are both in the integer unit, that presents a bit of a problem. I need to rethink my internal busses, and I'll likely need to find some additional microcode bits. Right now all are being used, but there are 4 mutually-exclusive trap condition bits that I can compress to a 2-bit field if I add a field decoder.

6/2/2002

I'm nearing completion of the lcc retargeting, but it's causing me to seriously rethink a good chunk of my instruction set. I had sacrificed load/store/aluop orthogonality (is that a word?). However, I'm realizing now that the oddities that I though could fairly easily be dealt with by a smart compiler are going to be much more difficult to handle than I first thought. In truth, I am simply not ever going to have the time to do a sufficiently intelligent compiler on my own. I'm thinking now that I need to do some serious instruction set modifications. In short, instead of an accumulator and a set of address registers, I think I need to have two more general-purpose registers. Perhaps change AD0 to "B", and allow indirect load/store from both A and B. I'd keep SP, DP and AD1 (which would become a special purpose for block moves). Or, perhaps I can get rid of AD1 altogether, and do a "move (a++),(b++)" type of instruction? No, that's ugly - where would I keep the move count? Oh, why name make AD1 "c" - for loop count. I could use the adder in the address register to count down, though I'd need something to allow me to compare it to zero. Here's an idea: terminate when (A+C)==A. I can use the regular ALU for the comparison, I think.

This is going to require quite a bit of thought, as it means serious microcode rewriting. I think I'll proceed with an lcc retarget which assumes the changes, and see how much better the generated code is.

Right now I'm supporting (AD0+u16) and (SP+u16) addressing. Most of the time an 8-bit offset will be sufficient, so perhaps I'll remove these two addressing modes and add (A+u8) as a general address mode. I'd also need to allow AD0 (now "B") to be a load and store target. That will be a bit tricky - not much opcode space. I won't have B be a general arithmetic operand - still "aluop A,<addr>", but I will allow additions to B using lea and the adder in the address unit. Hmm, this may work out. Need to sleep on it.

6/1/2002

The lcc retargeting is going two-steps-forward, one-step-back. I'm definitely going to need a post peephole pass to clean up my usage of memory-mapped pseudo-regs. Still, though, I think it's going to work. It will be interesting to see if I can cross-compile Minix with my retargeted lcc. Of course, I'll need to dredge up standard C libraries. I'm hoping that won't be too difficult to find, though GNU stuff probably relies on gcc extensions. The exercise continues to be useful for validating my instruction set. So far, I've decided to eliminate the generic lea support for AD1, and substitute A instead. This will leave AD1 pretty much crippled (only useful for indirect load/store with no offset, and as a block copy source). I think that's okay, though. I'm not going to have an optimization pass sophisticated enough to really take advantage of two general-purpose address registers, and by using A instead I can more easily materialize and store/pass addresses. I haven't yet redone the microcode. I think I'll wait until the lcc retargeting is complete and I have the chance to look at a bigger range of the generated code. Also, I need to change my calling conventions a bit. I had previously planned on passing back return values in the caller's frame. That's dumb. I'll pass 1 & 2-byte values in A, and will pass larger structures via a hidden pointer passed as a first parameter to a function. The caller will be responsible for allocating the space and passing the pointer, and the function will do the block copy before returning.

5/31/2002

I'm making good progress on retargeting lcc to Magic-1, and am now convinced that it's going to work (though as a cross-compiler, not native). lcc is really expecting a machine with lots of registers, so I did have some problems starting out. My solution was to pretend I have some extra registers to keep lcc happy, but map those regs into frame memory. The code lcc produces for this is initially horrible (as register-to-register copies turn into memory copies), but with a bit of work I am arranging the copies to show up in patterns easily detected (and cleaned-up) by a planned post-processing peephole optimization pass. This, I believe, will end up yielding quite decent code. I've been experimenting with a throw-away machine description file, and will shortly start work on the real one. With any luck, I should have a complete ASCII C compiler for Magic-1 by Monday. I'll be using 16-bit ints, but will support 32-bit longs and 32-bit floats. Long long and double will each be 32-bit also to simplify things.

05/22/2002

Starting to play around a bit with lcc, an ANSI C retargetable compiler. It looks like it will be quite a bit of work to get this targeted to M-1, but it would sure be nice to have a real C compiler. It would necessarily be a cross-compiler, though. Too big to run in my 64k-byte process model. That's okay for now.

05/11/2002

Ah, what timing. I've landed at the new job just in time for a merger (and associated layoffs). The bad news is that I'm at significant risk of being laid off. The good news is that if I do, I'll have more time to work on this project. I've been able to do very little with the project lately, but my big news is that I found a fantastic card cage on Ebay. I'll need to redrill some holes to mount my four wire-wrap boards, but it is just what I need. In other news, my old laptop bit the dust so I've moved to a Dell Inspiron 8100, running the SuSE 8.0 Linux distribution. While doing this, I discovered a bug in perl script that generates the microcode image from the microcode web page. There was a slight formatting difference between the Lynx browsers in the old and new systems that tripped up my code. Should be a bit more robust now.

03/29/2002

The new job is keeping me busy, so not a lot of progress. After more discussions with Gil, I've decided to go with a backplane using dual 96-pin sockets that match the wire-wrap prototype boards I've got. The clock circuitry is pretty much nailed down, and I've been sketching out the rest of the control signals. Next up is fleshing out the front-panel design and layout, followed by signal assignment for the backplane. I hope to begin construction soon in the following order:

Backplane

Front panel

Clock generation (first blinky lights!)

Memory board (testing front-panel memory examine/modify mechanism)

Enough of microcode/control/memory to fetch and execute HALT instruction.

03/09/2002

Had some good discussions with Gil Smith and have made some changes. First, I'll be using wire-wrap prototype boards rather than mucking around with wire-wrap sockets and copper tape for ground and power planes. I found some cheap surplus boards (about 30 years old!) that should work well - though they are just a bit smaller than would be perfect. Also, Gil suggested that I look at transparent latches for some parts of the design. Good idea - by using transparent latches for key portions of the microcode sequencer and the instruction register I believe I can significantly improve cycle time. And finally, the discussions spurred me into a long-overdue revision of the working schematics. So far I've got the clock/reset generation, address register and microcode field decoding circuits complete.

Once I get this done, I'll also need to make a pass over the microcode and simulator. I need to flip the order of the starting and continuing microcode halves (direct index will be 0x000 to 0x0ff and continuation will be 0x100 to 0x1ff). Also, I'm splitting the microcode fields into the enable signals vs. the latch signals. The enable signals will stored in transparent latches to make them available a little earlier. The latch signals, though, must not appear until they are stable (at the rising edge of the microcode clock).

In other news, Ken Sumrall located the references to the Spider homebrew computer mentioned on the links page. He's going to be photocopying the articles, and has pointed me to the web page of the guy who did it. I'll be contacting him shortly to see if there's any more info I can add or link to.

02/28/2002

Woo Hoo! I haven't had much of a chance to work on the project lately (lots to do in between jobs), but I found an hour and worked on paging support in the simulator. It seems to work - I have a test program that takes a page fault and also throws a fault on non-writeable page.

02/21/2002

Reworked the microcode trap vectors to support both user and supervisor versions of page not present and page not writeable faults. Also, pretty much decided to allocate that extra bit in the page table entry to expand the physical address space to 22 bits. We won't need anywhere near that much space for regular processes, but it could come in handy as a RAM-disk, if I ever get that far. Also did a bit of fiddling with the simulator and it is now very close to supporting the address translation mechanism.

02/20/2002

Fiddling some with traps. When I take a page fault or page permission fault, the fault handler needs to know whether the access was using the page table described by the current PTB, or whether I'm using the supervisor page table. I think the simplest thing to do here is simply to add a couple of different vectors. So, if we take a page fault using the user base, we'll vector to fault_np_u. If we're using the supervisor table, we'll use fault_np_sys. Similarly for not writeable.

02/19/2002

Changing jobs - gave notice yesterday. I'll be taking some time off between and hope to make significant progress.

02/10/2002

Eliminated the half-carry and replaced it with an 8-bit overflow. Also eliminated branch on S bit, replacing it with signed less-than (S xor V) for both 8 and 16 bits. If anyone needs to branch on S, they can always just do a branch on bit of the result.

02/09/2002

Added a few block copy operations: bcopy, tosys and fromsys. I'm pleased with the way these came out. I originally had various flavors of single-item copy, but it occurred to me that it would be very simple to make them iterative. The "sys" flavors utilize the page table base override to do block copies to and from system memory. These guys iterate by conditionally repointing pc to the beginning of the instruction. This also has the nice benefit of allowing them to be interrupted and resumed. Functionally, the look like:

while (A) {
AD1[A] = AD0[A];
A--;
}

Starting to think a bit about what kind of rudimentary operating system to do. My first thought is to port a Forth system, which should be quick and easy. A more interesting thought is to try a severely hacked version of Minix. This may be completely unrealistic, though, given my 16-bit virtual address space limitation. However, depending on how successful my shared library scheme is, there might be some possibilities. Because shared libraries will function somewhat as overlays, I can have much more than 64K bytes of code in a process, and perhaps I can come up with a memory-mapped file mechanism (with pseudo-segmentation) to alleviate pressure for processes that need a lot of data space. This needs a lot more thought.

02/05/2002

The assembler is shaping up and is basically functional. So, I've been starting to write tests and have run into a few more microcode bugs. The most significant relates to flag generation. I presumed that I could determine whether to generate flags based on the 8-bit vs. 16-bit results by looking at whether the ALU result was driving the X bus as 8 or 16 bits. However, this doesn't work for compare operations because we discard the ALU result. So, I've now allocated by last microcode bit to signify 8 vs. 16-bit alu operation size. Hope I don't need any more new microcode bits...

01/30/2002

Updated the microcode to reflect the ISA changes. I continue to take the approach that so long as I've got space in the microcode ROM that I'll just duplicate sequences rather than add logic. The result is that I've got a lot of almost identical microcode sequences - and I'm finally running out of microcode ROM space. It would be very easy to radically reduce the amount of microcode needed by adding some additional field decode functions based on bit patterns in the instruction byte.

01/29/2002

I've been thinking quite a bit about the C compiler I'll be writing and the instruction set. As a result, I've decided on some fairly significant changes in the ISA. In short, there is a fair bit of the opcode space devoted to addressing modes and instructions that I don't think I'd use much, as well as some awkwardness in code patterns I expect to be frequent.

Let's consider the former. My standard addressing modes includes (AD0) and (AD1). However, in practice I expect that my simple compiler will typically load one operand into A and point to the other via AD0. I'm not sure there's much value in having a great deal of expressability off of AD1. Much better, I think, is to have (AD0 + offset) addressing. Ideally, though, I'd have 8- and 16-bit offset forms. And, in fact, this is what we're going to do. Also included in the standard addressing modes is (DP+#u8) and (DP+#u16). In practice, unless I do global optimization or require attributes, my compiler won't generate (DP+#u8) references (because the actual offset won't be known until link time). So, I'm going to eliminate (AD1) and (DP+#u8) addressing and replace them with (AD0+#u8) and (AD0+#u16).

Given the radically fewer usages of AD1 this represents, I considered dropping the register altogether. However, memory copies were really painful without it, and I think I can also get tighter array access code if I keep it.

Next on the hit list is the XOR group. I've had my eyes on this for awhile. I need xor, but I expect it's rare enough in use that it doesn't really warrant occupying 16 opcode slots. I've left it in so far because I didn't have a good replacement in mind. Now I think I do. I have a rich set of compare and branch instructions, but they don't cover all possibilities. In considering my expected simple code generation scheme, I expect to generate worst-case patterns and then use a peephole pass to select tighter patterns when possible. For example, consider:

if (foo==2)
foo = 3;

When generating the "if" pattern, I won't know the size of the "then" clause, so I can't assume that I can use the short compare-and-branch form (because it has only an 8-bit branch displacement). So, I'll generate:

    ld.8    a,(sp+foo_offset)
    cmp.8   a,#2
    jr.ne    L_then_clause
    ld.8    a,#3
    st.8    (sp+foo_offset),a
L_then_clause:

Assuming foo_offset is short, this would be an 11-byte sequence. I'd then expect peephole to convert this to:

    ld.8    a,(sp+foo_offset)
    cmpb.ne8   a,#2,L_then_clause
    ld.8    a,#3
    st.8    (sp+foo_offset),a
L_then_clause:

This works out to 9 bytes. The problem here is that as of now, I don't have a cmp instruction, nor do I have jr.nc, jr.nz, jr.nv or jr.ns (only the positive versions). In a perfect world, this isn't a problem. I can use sub instead of cmp, and just flip operands or otherwise rearrange the branch pattern to only conditionally branch on positive flags. However, from a compiler point of view it would be much nicer my worst-case branch patterns were orthogonal, and I could punt to peephole to select the special cases. So, the other big changes are to eliminate XOR as a full-featured opcode and replace it with CMP. I'll keep a single XOR, as "xor.16 a,(ad0)" - which is the same form as I have subtract with borrow and add with carry. Also, I'll add the 4 jump relative on negative flag and make all of those guys use a 16-bit branch displacement rather than 8-bit (the compare-and-branch will still use 8-bit signed displacements).

Some other minor tweaks are to delete a couple of funny lea's which use ad1 in a strange way and to recode the magic immediate value in position 2 of the 8-bit standard addressing modes to use either -1, 1 or 0 depending on opcode instead of always 1. ADD and SUB will use 1, AND and OR will use 0xff and CMP will use 0. Also, I'll replace ld.8 a,#1 with clr.16 a.

These changes completely fill out my opcode space, which makes me unhappy. I would still like to work in some store immediate instructions - at least for (sp+#u8) and (sp+#u16) in both 8 and 16-bit forms. However, I may be short on temp registers for these.

01/19/2002

Little progress lately (too much going on). Still need to finish up the assembler and do a test for each instruction to validate the microcode. It will probably be a week or two before I can find the time.

01/18/2002

When I first wrote the microcode, I wasn't thinking about byte order and ended up doing things little-endian. After confessing this to a friend of mine I was suitably shamed and promptly redid the microcode as big-endian. Back on the path of truth and harmony now, my soul is at peace.

01/13/2002

Busy weekend (Elizabeth's 10th birthday), so little progress. I'm still thinking through the new assembler. What I have so far is a perl script which auto generates the yacc grammar from the instruction descriptions on the Microcode web page, along with a simple lex file and some processing code. Qas currently handles basic assembly. The tricky bit I'm thinking about is handling the cases in which an immediate operand can be described in a shorter form if it fits (i.e., add.8 a,1 takes 1 bytes, add.8 a,32 takes two bytes). The tricky bit is that I am going to allow complex immediate expressions which may not be fully defined until all labels have been assigned an address. However, I can't assign a final address to a label until I know how long each instruction is. Rather than mess with an iterative solution here, my current thinking is to perform an allocation pass which assumes worst-case instruction length for unresolved expressions. This pass will assign provisional values to all labels. The next pass will then assemble the instructions using the provisional label values (but not filling in values which depend on labels). Finally, we'll do a backpatch pass to complete assembly with the final label values. Because code can only shrink in the assembly pass, the assembled code will be safe - though there may be borderline cases in which I could use a short-form but don't.

01/10/2002

Not much progress - some playing around with the quick and dirty assembler (qas). I'm using lex and yacc - something I've done infrequently in the past so I have had to read up on them a bit to remember how to use them.

01/08/2002

The microcode is firming up - I hand-coded a fibonnaci function and simulated it successfully. To do so I also had to think about my calling conventions a bit. Here's a quick ASCII sketch. The stack grows down towards decreasing addresses.

|                 |
| arg ..N         |
| arg 1           |
| arg 0           |
| ret value       |
|-----------------|
| prev. frame ptr |
|-----------------| } = current frame
| return address | }
|-----------------| }
|       .         | }
|       .         | }
| compiler spill | }
|-----------------| }
|       .         | }
|       .         | }
| local variables | }
|-----------------| }
|(parameter space)| }
|                 | }
| arg ..N         | }
| arg 1           | }
| arg 0           | }
| ret value       | }
|-----------------| }
| prev. frame ptr | }
|-----------------| << SP of current procedure }

What this picture is intending to show is that I intend to emphasize a fixed-frame calling convention in which arguments are passed in the caller's frame. The callee references all of his incoming parameters SP relative (and will, in fact, reference all locals and spill variables similarly). Functions copy their result into the first slot in the caller's parameter space area before returning. Callers must allocate a parameter space large enough to hold all of the parameters and return values for the largest of its calls.

Items within the frame are accesses as follows:

	Incoming arguments: (SP + frame_size + 2 + argument_offset)
	Passed arguments (SP + 2 + argument_offset)
	Returned return values: (SP + frame_size + 2)
	Received return values: (SP + 2)
	Local variables: (SP + 2 + parameter_space_size + local_offset)
	Spill variables: (SP + 2 + parameter_space_size + locals_size + spill_var_offset)

Here's an example. Consider the following program:

int foo( int n ) {
    return (n + 1);
}

void bar() {
    int result;
    result = foo(10);
}

"bar" has one local variable (result, 2 bytes), passes one parameter( the immediate "10", size 2 bytes) and accepts a 2-byte function result. Thus it's frame would be:

|-----------------| } = bar's frame
| return address | }
|-----------------| }
|-----------------| }
| result | }
|-----------------| }
| argument for fib| }
| fib result | }
|-----------------| }
| prev. frame ptr | }
|-----------------| << SP of current procedure }

The size of the frame is thus 10 bytes, and is created in two step. The return address is pushed to the stack during the original call to bar. Immediately on entry to bar, we execute a "entry -6" instruction, which allocates 6 bytes (2 for result, 4 for argument area) and the pushes the previous stack pointer.

To call foo, bar would store a 10 in the argument area and then execute the call instruction (which pushes a return address). Let's assume that the compiler isn't very clever, and foo allocates a spill variable to save the result of the (n + 1) expression before returning it. So, after bar's entry, the stack would look like:

|-----------------|
| return address |
|-----------------|
|-----------------|
| result          |
|-----------------|
| 10              |
| fib result      |
|-----------------|
| prev. frame ptr |
|-----------------|
| bar ret address | } = foo's frame
|-----------------| }
| n+1 spill temp | }
|-----------------| }
|-----------------| }
|-----------------| }
| prev. frame ptr | }
|-----------------| << SP of current procedure }

Note that foo has an empty outgoing argument region (since it doesn't make any calls) and an empty local variable region. Now, foo find it's data as follows:

load n [the 10] from bar's frame at (SP + 12)

add 1 to n, giving 11

store temporary result in foo's spill area at (SP + 2)

Copy result to return area by loading from (SP+2) and then storing into bar's frame at (SP+10)

Execute a leave instruction to strip the frame (this is actually just a "pop sp"

Execute a return instruction (actually "pop pc").

01/06/2002

Various microcode bug fixes as the simulator get more functional. The biggest snag I've run into so far is the timing of branches. I had written the microcode assuming that I could branch on a condition code in the same T-cycle that it was generated. Actually, I can - but there's a bit of a snag in that I also want to be able to branch on the existing value of a condition code. For doing the same-cycle flag-gen/branch, I want to look at the unlatched flag output of the ALU as of the rising edge of the system clock. For doing the other flavor of branch, I want to look at the latched values. I think my choices are:

Treat the pre and post-latch values as different flags and sample on the rising edge of the system clock. So, I'd have pre_z, latched_z, pre_c, latched_c, etc. The compare-and-branch guys would look at the pre flag values, and the branch on condition-code would look at the latched. This option would yield the fastest operation, but I would have to go from 3 to 4 bits in the microcode for the CBR field to describe the "new" condition flags, and some extra circuitry would have to be added.
Disallow branching on condition codes set in the same T-cycle. In this model, I always use the latched values as of the beginning of the rising edge of the system clock. Hardware-wise, this is the simplest but it would force me to rewrite a large chunk of the branch microcode and lengthen the execution time of most branches by 1 T-cycle.

I'm inclined to go with option #1, but want to think about it a bit before ripping stuff up. Note that branches are handled via the CBR microcode function. The way this works is that if the condition is met the microcode sequences to the address in the next field as usual. Otherwise, you branch to fetch. The branch to fetch is handled by tugging on the asynchronous reset line of the latch holding the current next field. We have to be sure we perform this operation with stable values, so it must happen on the system clock rising edge.

UPDATE: Okay, I've decided to go with option #1. The CBR microcode field will move to 4 bits. When its encoded condition is met, microcode control will continue with the address in the next field. If the condition is not met, control will go directly to fetch. CBR will be encoded as follows:

	0 : TRUE (normal case - always use next field)
	1: B_L_Z (branch on latched value of Z)
	2: B_L_C (branch on latched value of C)
	3: B_L_S (branch on latched value of S)
	4: B_L_V (branch on latched value of H/V)
	5: unused
	6: unused
	7: unused
	8: unused
	9: B_NZ (branch on current NZ output of ALU)
	A: B_LT (branch of current C output of ALU)
	B: B_SLT (branch on current S output of ALU)
	C: B_Z (branch on current Z output of ALU)
	D: unused
	E: unused
	F: use ir[0..3] value to select above mode

Update #2: Things going pretty well. I've just simulated a call/return. See the simulator output on the Software page.

01/05/2002

WooHoo! The simulator is beginning to take shape, and is capable of handling most non-branching or trapping instructions. I'll be fixing microcode bugs and increasing simulator coverage for a while now.

01/03/2002

Recovering slowly, but managed to play with this stuff a little. Created a Source Code page along with the Perl script that generates the microcode images. Next comes the utility to generate the input for the PROM programmer, followed by the quick-n-dirty simulator.

Decided to add a halt instruction. It will take an unused slot in the interrupt logic. Essentially, it will continuously request a "halt" interrupt (which will request a halt interrupt, and so on). However, it will be tied to the lowest priority interrupt line so we can have an external device break us out of halt. Also decided that I don't like the source code page I just added, and will nuke it shortly. Instead, I'll do a general software page which describes the simulators I'm writing and will have the source available for download as a gzip'd tar file.

01/01/2002

Yuck - rotten way to start a new year. Hit with a nasty CD flare-up two days ago and have been mostly out of action. Had hoped to get quite a bit done, but it will have to wait. Did some tweaking with the microcode structure, and realized that I didn't need an explicit control bit to handle steering data between the 16-bit X bus and the 8-bit data bus. Also, no need for an explicit "write" control bit either. Both of these functions can be derived from the ELO, EHI, LLO, LHI and L functions. This saved two microcode bits, which takes us to 53. If I'm physically up to it, will push on getting a quick-n-dirty microcode-level simulator running.

Diary entries for 2001

Hit Counter