After taking a bit of a break from the project, I've started up again. The big push of the moment is to transfer Magic-1's software build environment back to Linux. I started the Magic-1 project using Linux years ago, but moved to Windows XP to reduce the hassle of constantly rebooting my dual-boot laptop between Windows (for general web surfing and office tools) and Linux (for Magic-1 development). I now have a dedicated Linux computer - and a rather special one at that.
My new build machine is a Transmeta Crusoe TM5800 running at 800 Mhz. It's not exactly a speed burner these days, but it has personal significance for me. I worked at Transmeta for nearly 4 years on the Code Morphing software for both the Crusoe and Efficeon microprocessors. My new computer is actually a Transmeta Micro-ATX evaluation board which was produced at just about the time I left Transmeta to return to HP (2002). Deep inside of all Crusoe (and Efficeon) microprocessors is some of my code.
So far, I've gotten my retargeted lcc working under Linux (SuSE 9.2), as well as my old quick and dirty assembler (qas) and the utility to generate the microcode PROM images. Still to go is getting the simulator to build, including the nasty part of my build environment which fetches and parses the microcode definitions from the web pages on this website. If I were redoing things today, I'd use XML rather than HTML. My current scheme of stripping out the tags and then parsing is a bit gross. Oh well, it does work.
A few notes to myself in case I have to repeat this process:
It is so nice to be working on Linux/Unix again. I not enough of a fanatic to go completely Windows-free (too lazy, I guess), but it's great to be able to pop a window (putty/ssh) on my Windows XP laptop over to my Linux box do get actual work done rather then screwing around with the damn mouse all the time.
Oh, one other nice thing: a while back I bought a new EPROM programmer, a Needhams EMP-10. After getting it, I discovered that it didn't really work that well with my Windows XP laptop (and has since been discontinued by Needhams). I have to jump through hoops to get it to work at all, and programming EPROMS with it was a challenge. It wanted to run in DOS mode, and had some trouble working with XP. On my new Linux box, I was able to get it to work great via dosemu - the Linux DOS emulator. [note to self: use /dev/null as device name in dosemu.conf when configuring the parallel port for direct hardware access]. I can also keep it hooked up all of the time.
My goal with all of this activity is to prepare for the attempt to port Minix to Magic-1. For that kind of work, I really need my Unix tools - Windows & GUI's would just get in my way. The Minix port will be a huge underaking. I'd expect at least a year of off-and-on fiddling. As a more short-term project goal, I'm thinking it's about time to turn Magic-1 into a web server. The idea here is to use Adam Dunkels' UIP, which Magic-1 should be able to easily handle. I have two serial ports on Magic-1. One would be used to accept telnet connections just as it does today through the Lantronix device server. The other port would be wired to my Linux box's serial port via SLIP/PPP. My Linux box would forward packets out to the world. Besides getting the UIP code working, I'd also need to enhance Magic-1's bios/monitor to support multi-tasking (so I could accept telnet sessions and serve web pages simultaneously).
But, first things first. Got to get the rest of the build system transferred and working. Also need to post a picture of Magic-1's new home next to my Transmeta box. Next time.
I take some comfort in that it didn't die a wimpy death, just silently slipping away. No, the hard drive in my laptop fought all the way, squealing, crunching and thunking into that long dark night. No data recovery is possible.
The good news is that Magic-1's source code, build system and documentation are safe. This website serves as the official image. I also have been burning backup CDs periodically with my key data. I did one on Jan. 14th. I also did a full disk image backup four months ago, so the recovery hasn't been all that bad. However, I did miss one thing in my backup strategy: email. All of the email I've gotten since 10/1/2004 is gone.
Fiddled a bit with the IDE drivers while watching TV tonight. Previously, when I was trying to run Magic-1 at 3.5 Mhz I had the drivers drop the clock speed in half while accessing the 8255. This was necessary at 3.5 Mhz, but now that I'm running at 3 Mhz I thought I try doing full speed. So far, it seems to work. Disk I/O is now running at 13,000 bytes a second when copying from the HP Kittyhawk drive to a compact flash card, and about 10,000 bytes a second when copying from the compact flash to the Kittyhawk.
Not too much new in the project. I did a bit of cleanup on the IDE support code in the bios/monitor. On start-up it now searches for attached IDE drives, asks them for capabilities and if they support LBA addressing they will be accessed LBA. Otherwise, we fall back to CHS addressing. I also cleaned up the ROM-based bootstrap loader to go ahead and boot a default image if I no key is pressed within a minute on reset or power-on. I put this change in because Magic-1's power plug kept getting pulled out of the wall. Now it will completely boot when power returns.
Over the last few days I've been experimenting with hard drives. I have several of the HP Kittyhawk 1.3" microdrives and discovered that one supports LBA addressing, while the other requires the old CHS. The CHS drive is a revision A, while the LBA is revision B. I had written the bios/monitor assuming that I had to support CHS, so all of my IDE accesses go through the trouble of converting logical block addresses to cylinder/head/sector addresses. I'm thinking of just going LBA throughout. It will make life simpler and disk accesses a bit faster.
My standard configuration now is the Kittyhawk as the master IDE device, and a compact flash card as the slave. The slave is pretty much only used for backup. Incidentally, my disk I/O runs about 8K bytes a second when copying from master to slave (doesn't change much whether I'm using compact flash, Kittyhawk or a newer drive). That's pretty slow in modern terms, but fast enough for me.
One unsolved problem, though, is that I haven't yet figured out how to run the Kittyhawk drive as a slave. It seems to only work as master. I tried using all 4 possible jumper configurations, but it either responded to nothing or showed up as master. I haven't yet tried cable select. Perhaps that will do the trick.
I did solve a couple of problems. First, when I first started fiddling with the Kittyhawk I had some machine crashes. Doing a drive copy (8 Mbytes) takes about 15 minutes. My deadman timer was buggy and caused Magic-1 to commit suicide about 10 minutes into the copy. The fact that it crashed at about the same distance into the copy made that one pretty simple to detect and fix.
Somewhat more difficult to find was another crashing problem. This time it was on power-up. It has been months since I had any non-software crashes, but during my fiddling with the hard drives the machine occasionally hung on power-up. Usually, after a few power cycles it came up - but on a couple of occasions died within a minute or so. It seemed to be the case that if it stayed up for more than a couple of minutes, it was solid. I'll make a long story short: intermittent connection on the dedicated power supply for the temporary front panel. The plug had worked it way partially off. The front panel can also obtain power through the signal ribbon cable (and with the HP hex displays and LEDs it draws a lot). When the dedicated supply broke, it drew too much current through the ribbon cable and starved the logic on the device card. Since I firmly reseated the connector, I haven't experienced the problem again.
The last problem solved also solved a mystery for me. A while back I had a software bug which wiped out drive sector that I'd been using as a file directory. It seemed a bit odd to me at the time that only that sector would get trashed - especially since the operation in progress at the time was a file read - not write. Tonight I just happened to take a look at a monitor option I hadn't used for a while - kernel verbose mode (command 'V'). This command causes the kernel to display lots of debugging and trace info while performing system calls on behalf of user processes. I ran the guestbook program and listed the guestbook while in verbose mode and was surprised to see a kernel message saying that it was writing the directory block. For a read, this should not happen. I finally traced the bug to my file close code. I was unconditionally flushing any partial file buffer to disk on close. This should only happen if the buffer was dirty (i.e. - I'd done a write). Fixed now, and also cleaned up some of the kernel messages.
The next thing to do is to finish signal and switch assignments for the new front panel layout. I've got a reasonable assignment, but am not happy with my crude attempts at background graphics, font choices, etc. I'll post what I come up with soon.
It took me a while, but Magic-1 is now using an ancient (~1991) HP Kittyhawk microdrive as it's primary hard drive. The Kittyhawk is a really slick drive - a real technological marvel at the time of its introduction. The one I'm using is 20 Megabytes, which is plenty for Magic-1.
The key to getting it working was finally getting the partition copy code to work. What took so long is that I still had lots of places in both the monitor and boot loader which used hard-coded drive geometries for my original hard drive. Now the startup code does the right thing and searches for all attached IDE drive. It then queries each one to determine it's CHS geometry and uses that when calculating logical to physical sector.
One somewhat unsettling thing, though. I have experiences a few unexplained crashes. In particular, when doing a partition copy between master and slave drives (8 megabytes at a time), the machine has crashed. For the last couple of copies, I turned off the heartbeat interrupt timer and they succeeded without error. That makes me suspect that I've got an interrupt window somewhere, but I'll have to do many more experiments to make sure.
Anyway, now I've got the ability to do a quick backup of the hard drive image to compact flash card.
Okay, finished restoring the hard drive. It took me a while to get the original adventure back again - mostly because I couldn't remember the build process. So, in case I need to do this again I'll write the steps here:
Very busy Christmas break on the home front, followed by catch-up at work hasn't left me with much time to work on the project. Unfortunately, I found a little bit of time tonight and managed to trash the file directory sector on Magic-1's hard drive. Naturally, I killed it while testing code that was supposed to back up the drive image to the slave drive. I could fairly easily rebuild the data - except for the guestbook. I think the data may still be there on the drive, so I'm going to carefully try to recover it (but not tonight). Meanwhile, I've disabled the guestbook program as well as original adventure.
The big news is massive progress on the enclosure and front panel front - thanks to Alistair. More this this shortly.
[update] - The thought of losing the guestbook data bugged me enough that I decided to try the recovery tonight. Good news: I found and extracted the data intact. I still have a bit of work to do to rebuild the file system and restore the data files needed by Original Adventure. Think I'll wait until after I get my backup code working.
Twenty-seven days of continuous operation at 3.0 Mhz -- Magic-1 is rock solid. I finally rebooted tonight to try out something that I'd been putting off: running with real bi-polar PROMs. The bad news is that it didn't work. The good news is that I get to debug some hardware now. I've been doing too much software lately (at the paying job). It will be nice to play with wires and logic probes again. It's probably just a miswiring (but also need to verify that the PROMs programmed correctly). I'm using MB7124 PROMs, which are supposed to be equivalent to 74S472s (512x8 bits). My friend Ken programmed them at work using an expensive TopMax universal programmer.
When I pulled the control board out to replace the EPROM daughter card with the real PROMs, I did get a bit of a bad surprise. Magic-1 is already clogged with dust. Here's a picture of the card edge nearest the fan:
I'll clean everything out with compressed air soon, but need to stop running in the open cage. It appears, though, that some very exciting news on the front panel and enclose front may be coming soon.
I updated the monitor program with a fix for the "sector not found" error, but haven't really done much else. The Buzbee household has been a bit crazed with kids, holidays, school homework and the paying job. I'm hoping to get back into the project over the Christmas break.
It's been a bit slow on the Magic-1 front lately, what with Thanksgiving holiday and a particularly busy time at the paying job. I continue to be pleased with how solid the machine is. As of this writing, it's been up for a full week (and it would be more than two weeks if daughter #1 hadn't accidentally unplugged it.
I have identified a bug in the monitor program. If you do command 3 (identify IDE drives) followed by running a program such as the guestbook ("x" then 11), you'll get a "sector not found" drive error. However, it's a benign problem. I was neglecting to select the master IDE drive prior to running the program, and command 3 left the slave drive selected. So, the first page fault tried to load from the slave rather than the master. Luckily, I retry drive failures and correctly select the master IDE drive in that code so the retry succeeds.
There's a simple fix, but I won't put it in yet because that would mean a reboot and I'd like to see if the machine will stay up another week or so.
I've also started working on a couple of quick side projects for variety. Last weekend I rebuilt the cassette drive capstan roller from an old HP-85 desktop computer a friend of mine gave me. Took a while to locate a tape that hadn't deteriorated, but it's working great now. Not sure what I'll do with it, though.
I've also started gathering requirements for a new version of my teletype control program, HeavyMetal. Gil Smith has designed a fantastic new teletype interface box, and I'm going to rev my control program to take advantage of it. Finally, the same friend who gave me the HP-85 also gave me a slightly broken 1 Ghz digitizing oscilloscope he salvaged from a junk bin. I think it works well enough that I should be able to try measuring the speed of light with a laser pointer, couple of mirrors and photodiodes. I never did science fair when I was younger - guess I'm trying to catch up.
Anyway, back to Magic-1. My plan of the moment is to write a functional simulator in C, but using a global machine state record that can be used by C# code. I'll start with the core as a command-line simulator, and later add a whizzy UI in C#. I probably won't go the "autogenerated from spec" route - I just don't have the proper software tools (FrameMaker). Time is also an issue. My momentum on the project is started to drop off a little. Probably best to push ahead.
First, thanks to all who telnetted in to Magic-1 and signed the guestbook. It's fun to see notes from folks truly all over the world. Incidentally, to answer Henk's question, Magic-1's schematics can be found on the microarchitecture page - and to make it easier I've also added a link on the construction page (oops - just checked. It appears that I accidentally overwrote the microarchitecture web page with a copy of old microcode. Fixing that now....). As far as a kit goes, I have no plans to do anything like that for Magic-1. In my less sane moments, though, I have toyed with the idea of writing a textbook based on a simpler CPU design expressible in either TTL or FPGA. It would include a kit/plans, and come with a full software stack (simulator, microcode editor, assembler, LCC retargeting, simple OS, etc.). The target audience would be hobbyists and undergrads looking for a hands-on experience, and I would put special emphasis on the ability for builders to customize the instruction set (and then automatically regenerate the assembler and monitor). It will probably never happen, though. Too many interesting projects to tackle, way too little time.
Back to Magic-1, I've been giving more thought to what to do next, and I think I'm going to write a new simulator before the new macro assembler. My old simulator is a bit out of date, and was designed to help me bring up the hardware. It simulated at somewhere between the gate and package level, and was extraordinarily useful during design validation and writing of the microcode. It's also pretty slow.
As I move towards the Minix port, I'm going to need good debugging tools. The front panel and clock single-step mode was adequate for debugging single instructions, but for an operating system bring-up I'd like something a bit higher up the food chain. The answer here is to develop the software in a simulated environment. I probably ought to write this one in Java or C#. I'm a old dog, and prefer C and assembly language (with side trips in Forth and Lisp). It's probably time to be assimilated, though. Besides, I'd like to make this new simulator mimic the (not yet designed) front panel, including the blinky lights. That will be much easier in Java or C#. On the other hand, I am a huge fan of simulators being (at least partly) automatically derived from an architectural reference document. This was done by HP and Transmeta for its processors, and was very useful. For me, I would have a much easier time doing this if I stuck with C.
Perhaps I'll compromise - the emulation engine in autogenerated C, with the UI in C#. We'll see.
A few minor additions: a simple guestbook program, and a deadman timer to exit any user processes that have been idle for more than 10 minutes or so.
I've been giving some thought to what comes next. With the completion of Adventure, there really isn't much more I feel like doing with the bios/monitor. It's pretty functional - besides handling significant terminal and file I/O, it supports demand paging (simplified). I think it's time to start the Minix port.
However, to do this, I first need to fix up my tool chain. Step one is writing a new macro assembler that outputs relocatable object files. I've decided to go with COFF. For the assembler, I think I'll go with something similar to Macro-11 (the old PDP-11 macro assembler) syntax and macro capabilities. The key issues, though, have to do with memory handling. My C compiler, LCC, is simply to big to ever run natively on Magic-1. However, my assembler should do so. Designing an assembler which can generate programs big enough to fill all available memory - while operating in that same address space - is quite a challenge. I am counting on a RAM-disk being available, so I expect to have fast intermediate file I/O. Still, to really be capable, I'd have to keep the symbol table and just about everything in separate disk files.
My inclination at the moment is to not go that route. Instead, I'll keep the symbol table resident in memory, and use separate temporary disk files for data, text and fixup sections. This will put an artificially lower limit on the side of files I can assemble, but should be sufficient for most things. The workaround for files that are too big is simply to split them up and then link them afterwards. My current assembler is built around an auto-generated yacc grammar. One possible approach is to take that grammar as a starting point and then hand-code around it. That would be my preference, but I haven't yet checked how big a yacc-generated parser would be.
Huge milestone this weekend: the original "Colossal Cave Adventure" is up and running. This has been a tough one to get going, as it required a file system and random file I/O (none of which I had). To achieve this goal, I have been using, as much as possible, Minix libc sources. On the operating system side, I've had to write a lot of code for creating, opening and closing files, seeking, reading and writing buffered files. Along the way I discovered (after many frustrating hours) that my recent pushing of Magic-1's clock to 3.14 Mhz was too much. At that speed some of my block copy operators occasionally fail to properly read memory. I've reduced the clock to 2.4 Mhz for now, and will later see how high I can push it back.
More on this milestone here.
Decided to make some more microcode changes after all. These, though, are likely temporary. I've been working on getting my C library support code in better shape. The first versions of things like getchar(), putc(), printf(), etc. were just hacks I wrote myself. I'm now trying to move over to the Minix libc code in preparation for my attempted port.
As mentioned before, one problem I have is that my assembler does not create relocatable object files. So, my temporary solution is to have the ".h" files include the actual code as well as the declarations. It works, but means that even the smallest program will include all of the library code.
Anyway, as I push more code through my tool chain I continue to run into problems due to my hacked-up assembler. In particular, Magic-1 has a lot of instructions which support an unsigned 8-bit displacement plus base register addressing mode. For example:
The problem here is that I need to only use this instruction form if the displacement fits in 1 byte. If it's bigger, I need to use a two-instruction sequence, something like:
Neither my assembler nor my LCC retargeting are smart enough to automatically do this yet. I could stop right now and work on them (and probably should), but I'm anxious to push a little bit further before I begin writing the new macro assembler.
So, today I just decided to temporarily change Magic-1's instruction set to always use 16-bit displacements. Interestingly, it was very easy to do. The software build system (assembler, simulator, boot ROM image generator, etc.) is quite complex, but very useful. The official source for Magic-1's ISA are the microcode web pages on this web site. The build system will actually fetch the web page and then convert it into C source files for the rest of the system, as well as generate the actual microcode ROM images. So, to make this change all I had to do was edit the web page in an HTML editor and rebuild (and then burn new EPROMs, recompile the Magic-1 loader and all programs, etc.).
The build system is messy, but it works. It uses lynx, a text-mode browser, to fetch the web page and strip out all of the HTML tags. The output is then passed to a complicated perl script which parses it and generates about a half-dozen source code and include files. It's been a couple of years since I originally put this all together, and I don't think I really understand how it works anymore. Hope I don't need to change anything or try to move it to another machine.
The downside of the change to 16-bit displacements is that performance is degraded somewhat and programs are bigger. My Drystone scores have dropped down to 367.
It turns out I overtuned a bit - my fast strcpy() was fast in large part because it was incorrect. The fix was simple, but dropped me back to 330 or so Dhrystones. I tried various speed-up tricks, but one of the weaknesses of 1-address machines is that memory copies can be very inefficient. For standard byte copy, you need four registers: source address, target address, byte count and a temporary register to store the data being copied.
I considered this while designing Magic-1, and hand-coded a series of memcpy() routines to see how well it would perform with various register configurations. Generally, the code was poor - I just didn't have enough registers. I finally decided to add a special MEMCOPY opcode. This solved the problem because it allowed me to use internal temporary registers to get a very efficient memory move.
However, that instruction just works with counted string (i.e. move a specific number of bytes). For C's strcpy() functionality, you are moving a null-terminated string of bytes. You don't know how long it is before you start - you just move bytes until you have moved a zero byte. Coding this in Magic-1 assembly, I just didn't have enough registers. Registers A and B can be used to load and store, but I'd need to also use either A or B as the intermediate register to hold the byte to move.
To make this long story short, I decided to add a special STRCOPY instruction. It uses an internal temporary register to data transfer, and cheaply does the zero test. Very fast. Also bumped the score up a bit by tuning strcmp().
Magic-1's real Dhrystone is now 384.
While I was messsing with the microcode, I also eliminated a couple of redundant instructions - sh0add a,b,a and sh0add b,b,a (which are equivent to sh0add a,a,b and sh0add b,a,b respectively). I had put them in earlier due to limitations in my assembler. Since then, I've had to add some awk and sed post-processing on the assembly output, and was able to use that to tranform the "b,b,a" version to "b,a,b" and "a,b,a" to "a,a,b".
I don't expect to make many more changes to the instruction set. I have four nops, which gives me room for three new instructions. My plan has always been to add some special purpose Forth instructions, so I'll keep those free for that.
Further improvements are possible, but I think I've pretty much picked all of the low-hanging performance fruit for Dhrystone. In addition to the hand-optimized mul/div code, I created an extremely fast strlen() function, and greatly improved strcpy() performance. Also, I bumped up Magic-1's clock to 3.14 Mhz. The result:
Magic-1 is now sporting a Dhrystone score of 418 with a 3.14 Mhz clock, making it a 0.27 MIPS machine.
There's still a bit more than I can do, but probably not a lot without more effort than I'm interested in expending right now. Lcc is a nice C compiler, but it is not an optimizing compiler. It is smart enough to find common subexpressions, but unfortunately for Magic-1, that can actually degrade performance (given that I don't have lots of registers to store them in, so they get flushed to the stack). I looked over the generated code, and it isn't that bad. There are a few places where I'm getting useless sign extends, and a lot of places where redundant loads could be eliminated. Overall, though, unless I write a peephole optimizer to massage the output, I don't expect to see dramatic improvements. I've got way too much other software to write before I even think about that.
Speaking of MIPS (Millions of Instruction Per Second, a.k.a. Meanless Indication of Processor Speed), I recall that before the SPEC benchmarks arrived computer manufacturers boasted of their machines' performance in terms of MIPS. At first, I believe they used Dhrystone MIPS as I am above (i.e. normalized against a Vax 11/780), but soon devolved into "marketing MIPS", or simply the theoritical peak instruction issue rate. By that latter measure, I could advertize Magic-1 as a 1.6 MIPS machine given that M-1's fastest instruction is 2 clock cycles long (register copies). WooHoo! - I can feel Andy Grove trembling in fear from here...
Benchmark games are pretty much pointless, but fun nevertheless. I've spent the last couple of sessions cleaning up some of the runtime support code - in particular rewriting the 16 and 32-bit multiply and divide routines in assembly (they were in C). I haven't tried to squeeze the last cycle out of them, but they are all dramatically faster. And for the dhrystone benchmark, it shows.
Magic-1 is now sporting a Dhrystone score of 277 with a 3 Mhz clock.
The Vax score of 1560 is what was used to normalize to 1 MIP, so Magic-1 is now up to a 0.18 MIP computer. Note that Magic-1 has a much higher dhrystone per megahertz score than the other early micros. This is almost certainly due to the modern, high-speed SRAM that I'm using. I haven't done any significant compiler tuning yet and might be able to coax the clock up 10%, so I expect there's still a bit of headroom for Magic-1's dhrystone score.
Fixed - just needed to add the OR gate to make sure the 2nd half of of the microcode is selected when a fault occurs during an opcode fetch. I'm having fun with it all today - it's seeming like a real operating system. Programs are now launched by simply creating a page table for the new process with all pages marked as not present. On the first access to a "not present" page, a page fault is generated and the fault handler goes out to disk and reads that (and only that) page into memory. The page table entry for that page is then marked as present, and execution resumes. Only pages that are touched are read from disk now, and then only as needed. You can see this all work by turning on the kernel "verbose" option, and then executing a program (via the "X" command).
Okay, found the problem - it was in the microinstruction sequencer. Each Magic-1 instruction is implemented by a sequence of microcode instructions. Each microinstruction has a NEXT field which says which microinstruction should be executed next. However, I have two special values for the NEXT field: zero, and minus 1. If the next field is zero, it means to use the output of the priority encoder (which selects faults, interrupts or regular instruction fetch as the next micro-op). If the next field is -1, it means to use the contents of the instruction register (IR) as a direct index into the microcode.
The design bug dealt with the -1 case: using the IR as a direct index into the microcode. When the machine executes the "fetch" microinstruction to fetch the opcode, its next field is -1 (i.e. grab the opcode and use it as the microinstruction address). In the failing "not present page fault" case, the fault is correctly detected and the microinstruction sequencer correctly decides to use the output of the priority encoder as the next microinstruction (value 0xe - the NP fault case). However, I had neglected to think about the 2nd half of the problem. My microcode is broken into two halfs - 256 microinstruction used as the first microinstruction for each of the 256 possible opcodes, and then 256 continuation microcode slots. This means I've got 9 bits worth of microcode addresses. The lower 8 bits are determined by the NEXT logic, but the mist significant bit is supplied by the logic that determines if we are in the (NEXT==-1) case.
And that was the bug.
The fault microcode lives in the 2nd half of the microcode address space. However, the (NEXT==-1) logic activated and caused the machine to select the 1st half of the microcode. So, insead of vectoring to microinstruction 0x10E on the NP fault, it branched off to 0x00E.
I believe there is a trivial fix - just adding a single OR gate, and I think I've got a free one. I'll want to think about it a bit before I start rewiring. Perhaps tomorrow.
10/15/2004 (early evening)
Update - it is a hardware problem of some sort. Haven't figured out exactly what yet, but just did a test program to confirm. The test sets up two code pages with a 3-byte instruction that spans the two pages. The first byte of the instruction is in the last byte of the first page, and the rest of the instruction is on the first two bytes of the next page. (The full sequence is: "ld.16 a,1234; ret"). If both pages are marked as present, I can call the sequence and return normally. If the first page is marked present, and the 2nd page is marked not present then it still works. The machine fetches the first byte of the instructon from the present page, and then take a not-present page fault when attempting to fetch the rest of the instruction on the following page. The fault handler is invoked, page permissions are added, retry and succeed. That works as it should.
However, if I mark the page with the 1st byte as not present and then call the sequence, the machine flies off into the weeds. This is a clear indication that I have a problem with handing a fault on the first clock tick following an instruction boundary. I can trigger a not-present page fault with an instruction fetch, but not if the first byte of the instruction is on a not-present page.
I'll sit down with the schematics later tonight (or tomorrow).
I think I've got a hardware problem. I've been trying to get "not present" page faults working. Everything seems to work fine (i.e. fault properly) if I access the "not present" page with a load or a store, but I seem to fly off into the weeds when attempting an instruction fetch on a not-present page. I probably shouldn't be surprised. There's a lot of bizarre logic related to correctly dealing with instruction boundaries in the trap/interrupt circuits. I'm guessing that a fetch (which happens as the first thing in a new instruction cycle), may be causing some problems here. Too tired to think about it now. Will review the schematics and think about it tommorow or over the weekend.
Cleaning up some C library routines, and added "time.h" from the Minix sources, along with most of the time-related functions. I haven't updated the programs yet, so the benchmark times are likely to be a bit wacky for a while.
I still have a bit of work to do on time. I don't think I'm handling timezones and daylight savings time correctly yet. Also, computing the time from the universal "seconds since 00:00:00 UTC 1/1/1970" is stunningly slow. The reason for this is that it does a lot of 32-bit computation, and my runtime support for 32-bit multiply and divide is incredibly bad. The current versions are written in C, and my C compiler doesn't do 32-bits very well. Not only that, but Magic-1 doesn't have good support for doing 32-bit right shifts - particularly variable right shifts, so my current compiler code just emits subroutine calls for 32-bit shifts. Multiply and divide are shift-intensive, so I'd guess that's the reason behind the very, very bad performance.
Also starting to work some on adding page not present page fault handling. Once this is works, I can launch a process whose image is stored on disk in a very elegent manner. More on this later (when it works..).
Brutal software debugging over the last couple of days trying to get the split code/data model working. I was getting very strange behavior, and was almost convinced that I had a bad wire somewhere in the page table board. Fortunately, the problem was much simpler. My crappy assembler wasn't properly generating the Intel hex files when I requested code and data spaces to be disjoint. Structurally they were correct - but the data in the segments was slightly garbled.
Ugh. I'd done a cursory check of the assembler output at the very beginning and thought it correct - so I've spent the last couple of sessions pouring over monitor code that looked (and in fact was) correct.
Oh well, good progress. Magic-1 can now load, store and execute programs with disjoint code and data address spaces. This raises addressability to 128K bytes per process. A good thing about the last couple of days is that I did quite a bit of cleanup on the monitor while looking for the bug. The kernel verbose mode (toggle 'V' from the command line) is now a bit more clear about what is happening on page faults and forking.
Next, I think I'll spend some time cleaning up my C library routines. I've only implemented a handful, and some are not standard. I think I'll try importing some of the Minix source code for this. One problem I continue to face is that my assembler does not produce linkable object modules. So, my ".h" files at the moment contain the actual code as well as the declarations for the standard routines. Makes even small programs big.
Laptop is back and repaired - pretty impressive service, actually. Shipper picked it up on Tuesday, and it was returned on Friday.
I've spent much of the weekend rewriting chunks of the monitor/bios, particularly code dealing with process and address space management. To this point, except for ocassional test programs, all of the code I've been running on Magic-1 has used a shared code/data address space model. As I prepare to start porting and developing more sophisticated software, though, I want to move to a split code/data model. Getting this working is turning out to be a bit trickier than I'd hoped. Very difficult to debug.
Things are almost working for the split model as well as they were for the unified model. However, I do have one nasty bug I haven't yet tracked down. I'm pretty sure I've got a wild pointer somewhere. I've run into a situation in which adding an innocuous statement will break the world. I'm assuming that some memory location is getting trashed, and adding that statement shifted things around enough to make it deadly. Other than than, I've temporarily introduced a significant slowdown in I/O. In the previous version, I was using some special byte move instructions to copy data between supervisor data pages and the data pages of another process. To better support the operations needed for running split code/data processes, I wrote a generic (and horribly slow) function that performs byte copies of arbitrary length between any locations in physical memory. To flush out any problem, I'm using it everywhere - which really slows things down. After I'm confident it is solid, I'll optimize and use my special instructions where possible.
My next big goal is to tackle the old "Original Adventure" program (which will require a split code/data model - it's too big for unified). However, it also requires random file access and a file system of some sort - all of which I'll have to write before I can get it going.
As far as the hardware goes, things are seeming really, really solid. While my laptop was in the shop last week, I didn' t have the ability to rev the monitor/bios, so I just let it run. It stayed up for 5 days, before hanging in a user-mode program. I'd earlier added the ability to Ctl-C out of a hung user-mode program, but there was a bug in the way I did it, so I had to reboot the machine to unhang it. As far as I can tell, I haven't had any hardware-related crashes since I added the muffin fan to the card cage a couple of weeks ago.
Brief update - the laptop computer which has served as the build machine (and master for this website) is ill. The battery charger is faulty, and as I'm tying this I've only got a few minutes of power left. Backups are complete, and I'll see about getting it to the shop for repair tommorow. So, proably won't have any updates for a while.
Good progress this weekend. Completed the clock circuity reword, and have pretty much convinced myself that Magic-1's normal operating speed is going to be 3 Mhz. It runs for a while about 3.5 Mhz, but not perfectly. Good enough.
Lots of work on porting programs. Besides Eliza, I've got "Hunt the Wumpus" now. Tried a few others, but my current lack of floating point support and some other compiler issues kept them from working. I really need to write a new macro assembler, as well as do some more cleanup of 32-bit arithmetic. As far as floating point goes, the Minix code I'm playing with appears to have some emulation code. Perhaps I'll use that.
Better sign off - power light is flashing. But first, here's a picture of Magic-1 in its new home: the corner of an old desk. Next to it is my logic analyzer (I was verifying timing of the IOCLK and clock speed slowdown circuit - you can see the clock slow by half on the trace on the screen when the SLOW signal is asserted.
For a while now, I've been aware of a speed problem with Magic-1. Whereas before it was running solidly at 3.68 Mhz, for the last month or two I haven't been able to get it past 2.6 Mhz. I've been focusing on the software lately, so hadn't gotten around to looking for the problem.
Anyway, for the last two evenings I've been trying to figure out what's going on. Tonight I discovered part of the problem: my clock fast/slow mechanism was faulty. In particular, the way I was using a JK flip-flop as a clock frequency divider was bogus (left my slow clock in the wrong phase), and to compound matters I was not synchronizing the SLOW signal which selected between the fast and slow clocks. That clearly explains my speed problem: my bad circuit would - about half the time - produce a puny pulse when switching between fast and slow clocks. For speeds below 2.6 Mhz, the puny pulse would be long enough to keep the machine going. Any faster (and thus shorter puny pulse), and Magic-1 would fly off into the weeds.
I reworked the circuit, and things are looking much better now. Magic-1 even surived a few seconds of running at 4.0 Mhz. Things still aren't perfect, though. I am getting garbled serial I/O at the higher speeds (above 3 Mhz.). I suspect my slow clock isn't slow enough (and serial I/O is also complicated by the shortened IOCLK pulse that strobes reads after the address lines have setteled. I'm guessing the read pulse has gotten a bit too short.
I was able to run Dhrystone at 3.68 Mhz. The output was a bit garbled, but the elapsed time and score came through fine: 224 Dhrystones (compared to 259 for a 4.77 Mhz 8088).
I'm going to need to take some time to rework the clock circuitry. I've done way too many little modifications to it (on the front panel logic board). It's in bad need of cleanup. To address the garbled serial I/O, I think I'll make the slow clock slower. Rather than being 1/2 the speed of the main clock, I'll divide by 2 again and make it 1/4th the speed. The way things are supposoed to work is that in software I slow the clock down before accessing slow devices, and then set it back to normal speed afterwards. A cleaner solution would be to use a bit in the page table to do this automatically, but I wasn't able to figure out a sufficiently simple circuit.
I'm back in business on the EPROM front. Picked up a Needham EMP-10 programmer. It uses DOS software, which on my Windows XP laptop is a little awkward, but it does the job. Of course, now that I can program EPROMs again, I no longer need to.
My boot loader is complete. Yeah!
I stripped down a copy of my bios/monitor and converted it into a boot image loader. Instead of loading and storing process images on the hard drive, it loads/stores bios/monitor images. I can now update the bios/monitor without burning a new EPROM.
It took me a bit longer to get this working than I had expected. Probably should have expected it - bootstrapping can be somewhat tricky. Magic-1 starts off executing from EPROM in device space. The boot loader needs to transition the machine into paged-memory mode in order get access to the full physical address space. It needs to do that in order to load new boot images and copy them to the appropriate place in physical memory. Once that's done, we have to turn paging off and remap the supervisor mode page table to refer to the new boot image in physical memory. Finally, to boot the new image we need to mock up a return from interrupt (reti) that will atomically load up the new boot process state and turn paging back on. Whew! Debugging was a challenge, but it seems to work just fine now.
Since I can't burn any new eproms, I spent some time porting an old classic. It was originally done in Fortan, but the version I remember was done in BASIC. It's been way too long for to remember BASIC well enough, but I did find a Pascal version that converted to C without too much difficulty.
Telnet in and try it out. Slot 6 in the process table.
Good news: fork() implemented using "copy on write" is working now. It was a trivial change to deassert _WR when FAULT_PENDING is asserted. Bad new: my eBay-special EPROM burner died, so I won't be able to make much progress for a while. Need to start shopping around for an inexpensive replacement - just something that can deal with 27c256s.
Lots of really interesting (well, at least to me) stuff going on. Dave C.'s 3-d tic-tac-toe game has started me on a quest to get Unix-like "fork()" capability going.
The problem with the way I'm doing processes now is that I'm just preserving an image of the process in memory. That's fine the first time you run a process, but when the process exits its memory is left in whatever state it was at exit. So, the next time you run the program you aren't starting with a clean slate. This doesn't always cause problems, but if your program relies on initialized statics, it will break. The 3-d tic-tac-toe did rely on initialized statics, so my quick hack was the change the program to reinitialize each time.
My first thought was to keep the original program image around as a master, and then copy everything into a new process prior to running. It didn't take long for me to realize that this is pretty much what Unix's "fork()" does. However, at least in the implementations I'm familiar with, forking is done a bit more elegently. Instead of doing a complete copy of the original process into the forked child, games are played with virtual memory. The child process starts off by simply referencing the parent's memory image via it's page table. So long as it only does memory reads, it can safely share the parent's pages. However, if the child needs to write to memory, it needs it's own copy of the data (so as not to affect the parent). Similarly, if the parent writes to memory, it needs to do this in its own private copy as well.
The mechanism used to accomplish this is known as "copy on write." It's simple, really. The shared pages are marked in the page tables of the parent and child as read-only. If either tries to do a write, a page fault occurs. The page fault handler will then recognize what's going on, and allocate a new page, copy over in image of the original page, and then give the faulting process exclusive read/write permissions on the page.
Very elegent - and very do-able. Magic-1 has the ability to remap pages, as well as grant and revoke write permission on a per-process basis. So, even though I'm not yet close to being able to time-share among processes, I thought I'd implement process creation via fork() and copy-on-write.
The first bug I ran into was a microcode error. One of the issues with faulting instructions is that you have to be able to roll the machine state back to the way it was before the faulting instruction started executing. I put a lot of care into writing the microcode to ensure this. In general, it means not updating register values until after you've completed all memory references (or other activities which could fault). This can be tricky, especially given I don't have much in the way of temporary register in Magic-1. In fact, the internal register TPC (temporary program counter) exists specifically for the purpose of rolling back during a fault.
Anyway, I screwed up three instructions: memcopy, tosys and fromsys - the block copy operators. All of them updated either register A or register B prior to doing a potentially faulting memory operation. I remember now that I rewrote these not too long ago and no doubt introduced the problem then. It was fairly easy to fix them, but with a caveat: even when faulting they will still change condition code flags. The reason is that I want those instructions to check for zero before attempting to copy bytes. That check will trash the flags. I could move the check after the memory references, but then I'd end up having to always precede those instructions with extra code to check for a zero-byte move. It's a bit gross, but at least for the moment I'd rather have that zero pre-test.
The other screw-up is slightly nastier, as it will require a wiring change. As I mentioned before, when a fault happens you want to roll back the machine state to the point it was at before the faulting instruction began. To accomplish this, I have a lot of logic sprinkled around that supresses register clock/latch signals whenever a faulting condition is present. However, I missed one - and once again it was caused by a late design change. To eliminate some duplicate logic, I created a new signal, _WR - memory write. This signal should be supressed when the FAULT_PENDING signal is asserted, but it is not. As a result, when process attempts to write to a write-protected page a fault is correctly generated, but the 1st byte of the write succeeds anyway.
Other than that, though, fork() and copy-on-write works just fine.
This should be an easy fix - no more than a couple of wires assuming I can find an appropriate empty gate on the control card. Maybe tommorow.
Of course, the main purpose of computers is to play games. Now Magic-1 is a real computer: it can play tic-tac-toe. Thanks to Dave C. for his old 3-d tic-tac-toe program, which came up very easily. I also added a 1-d version. Both can be run (and played) via the 'X' command.
Put in a few hours of work on the project today, most of which dealt with speeding up reads and writes to the hard drive. The interface I'm using isn't especially fast, which is fine considering Magic-1 isn't all that fast either. However, I was a bit disappointed with how slow disk I/O was. It was taking a full 15 seconds to read or write 64K bytes. The slowdown isn't the drive itself. The problem was a huge amount of overhead pushing control and data bytes through my 82c55-based IDE interface. My first cut at IDE drivers focused (correctly) on getting things right, and were very inefficient.
The way things work is that the IDE interface is built around an 82c55 - a programmable I/O buffer. To write data to hard drive control registers, you have tell the 82c55 which direction the bytes are going, load 8255 register with the data, and then (with additional writes) cause the IDE read/write control lines to be strobed. It's similar for writing.
So, to read from the drive, you load up the IDE control registers with the cylinder, head and sector of the target, issue an IDE read command, and then generate 256 16-bit reads from the IDE data register. In my code, each of these reads/writes would require three or four procedure calls.
Anyway, long story short: I left the slow and explicit code in place for most operations, but wrote some streamlined routines to read and write the sector data. The results: a 64Kbyte read/write now takes five or six seconds. To see the reads in action, telnet in and do a "R" command (which reads in the saved 64Kbyte process images).
One other note about the IDE code: I'm using cylinder/head/sector (CHS) addressing instead of the more modern, and easier, LBA mode. The reason is that I intend to use an ancient HP 1.2" KittyHawk hard drive for Magic-1, and it only supports the old CHS addressing.
On the speed front, it appears I have a problem with my slow/fast clock scheme. A while back I added a circuit to allow Magic-1's main clock to be slowed by half programmatically (by writing a bit into a flip-flop at physical device space address 0xFF90). Anyway, it appears that the clock is stuck at slow, which in this case is 1.3 Mhz. Didn't have time to do much debugging, but it's kinda nice to know that all of the benchmarks runs are based on a very slow clock. When i figure out the problem, I should be able to at least double Magic-1's performance.
Lots of progress over the last couple of days. The main new feature is that I can now save loaded processes to the hard drive. I don't have a real file system - I'm just dumping 64Kbyte process images to the raw drive. So far, it's just user-mode processes. There's a bit of extra work needed before I can load a bios/monitor image, but once that works I'll convert the current monitor into a boot loader and I can stop having to burn EPROMs.
On the user-mode program front, I've added the old Byte magazine Sieve benchmark. It computes primes, and is currently running 10 iterations in 6 seconds. Don't know yet how that compares. Also, I've been poking around for simple C programs to port over to Magic-1. One that went easily was the old Unix banner program. What helped a lot was that I finally got around to getting variadic functions working. That enabled me to cobble together a basic printf().
The user-mode programs can be accessed via the 'X' command after telnetting in.
I've run into some stability problems. Magic-1 has been crashing ocassionally - generally after it's been running for more than five or six hours. It's probably wishful thinking, but my chief suspect at this point is temperature sensitivity. The machine still consists of an open card cage, and does get warm after a while. I have a room fan that I try to remember to keep pointed at it, and so far all of the crashes have happened when the fan was either off or pointed away. I plan on rigging a muffin fan and will see if that puts a stop to the crashes.
First "MIPS" rating is in: I just got the old Dhrystone (1.1) benchmark running, and it reported Magic-1 at 178 dhrystones/second, which computes to 0.11 MIPS (normalized against a Vax 11/780's score of 1560). One oddity is that when I ran it with both fast and slow clocks, I got the same score. This doesn't really surprise me - my bios code is constantly flipping between fast and slow clock speeds, and I bet somewhere I'm not setting it back properly. So, I'm not sure whether that score corresponds to a 1.2 Mhz clock or a 2.4 Mhz clock (probably the latter). For what it's worth, here's Magic-1's score in context:
Note that it's totally unfair to compare against these old machines for several reasons. First, although my C compiler isn't optimizing, it likely does better than the ones used to compile dhrystone for the early machines. Second, my SRAM is much faster (70ns) than what would have been commonly available on the Apple, CMP and PC/XT in the early 80's. As a result, M-1's instructions don't have any need for wait states. And finally, I designed M-1's ISA with a C compiler in mind.
On the other hand, I haven't even begun serious compiler tuning. I'll bet I can at least double Magic-1's score by tweaking the compiler and pushing up the clock a bit.
I've left the dhrystone image in Magic1's process table, so it can be run by telnetting to Magic-1, selecting the "x" (eXecute) command and then process table slot 0. It takes about half a minute to complete 5000 iterations.
Did some cleanup today, and added a few new system calls. Neither of the new system calls worked on first attempt, so I've got a bit of debugging to do. I suspect a problem with interrupt masking. Some of the bios code is pretty sloppy when it comes to turning interrupts on and off. I got away with it when everything was a single supervisor-mode program, but now that I'm switching between user and supervisor mode I think I need to be a bit more precise.
I also password protected the reboot command and disabled the automatic reboot feature so I can get a better feel for how long the machine stays running. As far as I can tell, Magic-1 is pretty stable.
Sometime soon I'll have to do a major restructing of the bios code. I've mostly been "designing at the keyboard", and it is starting to show. I don't regret doing this - at least for myself I need to just play around with things before I really understand the right (or better) code architecture.
It still amazes me this thing actually works. I had lunch with a friend of mine today (hi Ken!) and told him about the new telnet capability. He had a Sidekick mobile phone with internet access, and telnetted to Magic-1 right in the restaurant. Amazing.
Big milestone today - loaded and launched a user-mode process which made use of supervisor-mode system calls. This is tricky stuff, and I'm developing much more of an appreciation for the folks who develop operating systems. In order to get this to work, I had to do a lot of handling of the user vs. supervisor memory address spaces. My first user-mode program is a standard "Hello World". However, because it is user mode, it cannot directly access the serial ports. Instead, it uses a system call trap (SYSCALL 0x02 - aka "kprint()"). This trap transitions flow to the supervisor mode syscall trap handler.
The syscall trap handler extracts the system call number (2 in this case) and vectors to the kprint routine to display the string passed as a parameter out to the serial port. However, the "Hello World" string is actually data that lives in the address space of the user process, and is not directly accessable in supervisor mode. To get the string, I have to use a special instruction, "TOSYS", which copies data bytes from user space into supervisor space. This copying has to happen for all system call parameters and return values.
Another complication is how to launch the user-mode program in the first place, and how to return gracefully to supervisor mode (and the monitor) when it is complete. The basic mechanism for doing both is to use the "return from interrupt" instruction (RETI). When a fault or interrupt happens, the microcode dumps the machine state onto the supervisor (kernel) stack pointer. RETI does the reverse - reloading the state from the stack and resuming execution. To invoke a user-mode process from supervisor mode, I create a dummy machine state which includes the proper MSW bits for user mode execution as well as the starting PC value. Then, save away the current value of the kernel stack pointer, point SP to the dummy machne state and RETI. I use a similar mechanism to return to the monitor when the user-mode process completes.
Amazingly, it works. I can now load a user-mode program over the serial port (in Intel hex format), execute it and return to the monitor program. Very cool.
In other news, the telnet feature also seems to work well - I've heard from a dozen or so folks from around the world who have telnetted in to try it out. Thanks to all who took Magic-1 for a test drive.
I'd forgotten that I still had some microcode rework to do. When I was building M-1, I discovered that I had neglected to put one signal out on the backplane - the IN_TRAP signal. After giving it some thought, I realized that I could eliminate the signal altogether - but only if I rewrote the microcode dealing with traps/interrupts and the RETI (return from interrupt) instruction. The rewrite was a bit tricky. The issue was that when I'm handling an interrupt, I need to spill the register state onto the system (kernel) stack. Which stack pointer (sp vs. system (kernel) stack pointer) is active is controlled by the mode bit in the MSW. To spill, I need that bit 0, but I also need to save the original contents of the MSW.
Anyway, juggling the status word and stack pointers with only one temporary register was a bit involved, but it's done now (after about 6 attempts at getting it correct).
I still have a bit of work left before I can launch user-mode processes, but I decided that things are stable enough that I can go ahead and put telnet acess to the machine outside of my firewall.
Making rapid progress. I now have most of the infrastructure in place to load executables, create new user-mode processes and launch them. As a first step I've created a process table which can store up to 4 named processes. Each process will have 128K bytes of memory, and will run user-mode. That also means I need to add a system call front-end to the basic I/O routines. I expect I'll get a user-mode "Hello World!" running within a couple of days. A potential hitch is that I haven't done a lot of user-mode testing yet, so there may yet be a hardware problem lurking.
After that works, I next plan to add support to load and store those saved process images onto the hard drive. I'm not going to attempt any real file system - just set aside a few sectors at a fixed location for an index, and then just access the drive raw.
Then, I want to change the current EPROM-based bios into something I load from disk - I really want to stop having to erase and burn EPROMs. Not only is it a pain, but I am getting concerned about physical wear on the card cage connectors. The boot EPROM is mounted in a ZIF socket, but I have to pull the board to get to it. I'm sure I've done this a few hundred times by now. Not good.
In other news, my web hosting service seems to have stopped responding to trouble tickets. I may have to switch to another company, so the site may be down for a few days.
Whew - finally! I've been struggling with the serial ports for weeks. For the last week or two, I've had problem with dropped characters. I assumed my driver code was bogus, and tried many approaches without success. The problem would show up when simulataneously doing serial input and serial output. There would never be any dropped output chars, but I'd lose 1 in 50 or so incoming chars. Further, changing the baud rate seemed to have little (or no) effect on the problem.
While driving to the grocery store today, the problem (and solution) finally dawned on me. It was a hardware design error, related to my earlier design error. To recap, I had designed the interface to the UARTs very similar to the way I interfaced memory devices (because the UART has control register to read and write). In a nutshell, I had been sloppy with my device read signals on the theory that there isn't a problem reading a device more than necessary, so long as you handle writes precisely. For the UART, however, reading control registers change state - so they must be handled similar to writes. To fix this, a few weeks ago I added a new clock signal (IOCLK) which I combined with the read signal to ensure that I only tried to read the UART registers when a valid address appeared on the address lines. This was the right thing to do - but it didn't go far enough.
With the earlier fix, I only asserted READ when the UART was properly selected based on the current address in the MAR. *However* it didn't occur to me at the time that my sloppy design permitted devices (such as memory and the UARTS) to be selected during periods in which they were not actually being accessed. My address setup is based on placing an address in the memory address register (MAR). The device decode logic was previously driven *only* by the MAR - without regard to whether a memory/device access was going to be made.
So, what was happening was that during the serial writes, I would set up the MAR with the address of the data register of the UART. It would correctly write the character to output, but in subsequent microcode cycles the MAR would still have the UART data address register, and so the UART would be selected. Then, when doing a register operation, the READ signal (actually the R/W signal which defaults to READ) would combine with IOCLK and the UART would believe that I had requested a data read. Because I wasn't really reading the char, though, it would just be discarded.
Fixing this was trivial. I have a signal MEMREF (memory reference) which tells whether a machine cycle represents a memory/device read or write. All I had to do with factor this signal into the decode logic so that the UART is now enabled only when its address is present in the MAR *and* a memory/device reference in in progress.
It works, as does hardware flow control. I'm cruising now 19.2K baud. WooHoo! Next up is finishing up the code to allow an executable image in Intel hex format to be loaded over the serial port.
Still trying to get the serial port drivers to work right. The problem is dropped input characters. The code is pretty simple, and looks right to me. I've even tried combining polling and interrupt-driven input. I lose about 1 of 100 chars, particularly when simultaneously doing output - but I'm not getting any overrun, framing or other errors (at least no line status interrupts). Next step is to write some tests to make sure my flow control is working.
Back home from a week-long vacation with the family. Now I need a vacation to recover.
Doing a little fiddling with the serial port driver code. I'm running w/ FIFOs and receiver enterrupts, and it all seems to work okay. Not working perfectly yet is my flow control. I'm trying to use XON/XOFF, but am dropping charactors. Think I'll try doing hardware handshaking (RTS/CTS) and see if that works better.
WooHoo! - the UART problem was, in fact, spurious reads due to signal bounce, and is now solved. Thanks (again!) to Andrew Holme, who suggested a cleaner fix than my hack of delaying CLKS by running it through a bunch of driver gates. Instead, I'm using a previously unused flip-flop in the clock generation circuit to create a copy of CLKS shifted by 90 degrees of phase. Much cleaner - and it may come in handy later. I've updated and uploaded the schematics to reflect the changes.
I'm getting the IIR and LSR signals that I expect now. Next up is to complete the interrupt-driven serial port driver.
In looking over the UART datasheet and my schematics, I think the easiest way to deal with the (presumed) problem of spurious UART control register reads due to bouncing address lines is not to use the ALE (address latch enable) signal, but rather to hold off on asserting the "read" signal until things have settled down. To do this, I need a delayed copy of the system clock (CLKS). My plan is a bit of a hack. I will run CLKS through a series of 74LS244 bus driver gates. Each will cause a 12-nanosecond or so delay, and I believe that 5 gates, or 60 ns of delay, should be plenty. This delayed signal will then be combined with R/w (0 on read) and the UART's chip enable signal in a 74x27 3-input nor gate. When all three are low, the 74x27 output will be high - and will drive the UART's "read" register pin. The signal will stay asserted until 60 or so nanoseconds beyond the rising edge of the true CLKS - plenty of time for any UART data to be clocked into a register.
Yuck. It would have been much cleaner to have a multi-phase clock.
In other news, a new DLINK DWL-810+ wireless bridge arrived in the mail today. Within a few minutes I was able to configure it to talk to my home wireless network, and then plug in the Lantronix DS-200 device server connected to M-1's serial ports. The result: I can now talk to Magic-1 via my laptop computer wirelessly. Following is a window snapshot of me taking to M-1 wirelessly using the Kermit terminal program. The session starts off with me asking for a cold restart, and I get the start-up message. "Help" shows the supported commands, and I then ask for IDE drive identification and the "Hello" program.
Sometime soon (after I fix the UART interrupt problem), I'll move M-1 off the kitchen table. I also plan on using some old X-10 gear I've got lying around to allow me to remotely cycle power when M-1 gets hung.
Problems on the interrupt-driven UART front, but I have a theory.
I've spent the last few sessions struggling with converting my polled serial port driver into an interrupt driven one. Bizarre behavior - when the UART receives a charactor an interrupt is fired and my handler is invoked. Further, if I read the UART's data register, the proper charactor is present. However, I don't get the status flags I am expecting. When an interrupt is fired, the UART's Interrupt Identification Register (IIR) is supposed to tell me which interrupt fired (data ready, transmitter ready, error, etc.). But, when I read the IIR in the handler, it says no pending interrupt. Not only that, but the line status register (LSR) says that there is no new data to read either.
After many attempts to treak intialization, verify proper register read order and so on, I think I know what may be happening. Only a theory, but is seems reasonable to me so far.
I think it's a hardware design error on my part.
In a sense, the UARTs (16650s) look like memory devices. They have control registers that are addressed by pins that you hook up to the address lines, and bi-directional data pins that you hook up to the data bus. When the UART is addressed, there is a chip select line just like memory - as well as Read/Write pins to say whether you are reading or writing the control registers.
This all seemed to me just like a memory chip, so I designed my interface circuits (at least the register address and read/write) the same. For my memory devices, the key signals I worry about are chip select and the R/w pin. For chip select, I make sure that only one device can drive the data bus at a time. The R/w pin says whether we are reading or writing. R/w always starts off selecting a read, and is combined with the system clock to go high after all of the data and address lines have had a chance to settle. This ensures that I don't write junk values into memory during the time in which the address lines may be settling.
As far as a memory read, though, I didn't think there was any problem with spurious reads from the wrong memory address during the address line settling time. Any device that wanted the read value wouldn't need it until the rise of the system clock. Even if I temporarily read the wrong address from a memory chip while the address lines are settling, no harm is done.
And that's what I think the problem is. The UART is *not* like a memory device when it comes to reading register values. A read can change state. When an interrupt happens and the interrupt code is set in the IIR register, it is *cleared* when the IIR is read. Similarly, when the UART recieves a character, it sets a flag in the line status register. That flag is reset when the LSR is read.
My best guess is that after the charactor arrives and the interrupt is fired, the IIR and LSR *do* have the proper values. But, before I can read those values out, bouncing address lines "read" the registers first, clearing them.
The 16650 has a pin that I didn't really understand at at first: ALE, or address latch enable. It's purpose is to signal to the UART when the values on the address lines are valid. I didn't think I needed it, and tied it inactive. Now, I think it may be the solution to my problem. I believe I need to use this pin to hold off on addressing the UART registers until all of the address signals have had a chance to settle. Once again, I regret the simplicity of my clock scheme. If I had several possible clocks with various offsets, it would be a simple matter to clock the ALE line to allow for the settle time. I may have to run my main clock through a few inverters to delay it enough to allow settling before asserting ALE.
Anyway, that's my current theory. I'll want to think about it a bit more before doing any rewiring.
Solved an interesting problem tonight. Last night I put in the first of the code needed to run the UARTs in interrupt mode. This involved something new - relying on a C static variable. Previously, I had only been using frame variables and a few hard-coded statics from assembly. The reason for this was that I needed my virtual memory remapping to be in place before statics would work (as they are originally written in ROM, but must be copied to SRAM before I can write to them).
Anyway, it didn't work - something went wrong. I only had time to try a few variants, which did not pin down the precise problem but strongly suggested that my virtual memory mapping wasn't working. This seemed odd, as I had previous tests which seemed to work. I worried about this - perhaps I had misread my earlier test results. If virtual memory didn't work, I could have serious problems.
So, tonight's first test program confirmed a virtual memory issue. Oddly, the copying of ROM and device SRAM to primary SRAM clearly did work - I was able to verify this via the front panel. However, when executing the code, the behavior was as if I were still executing from ROM. I was ready to break out the logic analyzer when the answer hit me. The way I was turning paging mode on and off was completely bogus.
Here's what was happening: I'm attempting to write the bios in C. However, my C compiler doesn't support inline assembly code, so there are a few things I have to do in assembly - such as turn interrupts on and off, and write page table entries. I had been providing these functions as simple assembly language subroutines that I have my C compiler prepend to every program I compile. When I wrote the code to do the virtual memory remapping the other day, I just copied the routines to turn interrupts on and off and modified them to turn paging on and off.
On reset, M-1 starts off with paging and interrupts off. My remapping code sets up the page table to point to ROM and device SRAM and then turns paging on by calling the paging_on() subroutine. Turning paging on at this point is essentially a NOP, as virtual and physical memory maps are the same. Then I do some remapping to copy the ROM and device SRAM to primary SRAM, and the turn off paging. The next step is to remap the page table such that when I turn paging back on, I was resume execution in primary SRAM. For this to work, the SRAM must be an exact image of the ROM/device SRAM, because the next instruction fetched will come from SRAM instead of ROM.
But, it wasn't an exact image. Almost, but not exact. After doing the copying and remapping it was exact, but when I did the subroutine call to the "paging_on()" routine it went out of sync. As part of the call instruction, the return address was pushed onto the stack located in device SRAM. The paging_on() code correctly turned paging on, switching things over to SRAM. However, when it came time for the return from subroutine, the stack was pointing to primary SRAM, which did not have the proper return address. Whatever previous value was on the stack was popped off and treated as the return address. It likely was some old return address, and bizarre execution happened from that point.
Bios code can be tricky.
Anyway, the solution here is to only switch paging on and off via inline assembly. I'll just rewrite the memory remapping code in assembly. Actually, I'd planned on this anyway in order to use my special user<=>system space copy instructions.
Started writing the code to load a program from the serial port, but realize I'll have to take a step back and fix up the infrastructure a bit. Right now, I'm accessing the UARTs in polled mode - which has been sufficient to this point. Whenever I want a new character, I just go into a busy-wait loop on the UART control register until one shows up. Similarly, when I want to transmit a character, I loop on the UART control register until it is ready to transmit. I implemented no flow control, which was fine for initial bring-up.
Now, however, I need flow control. When I ran the new code to load a program in Intel hex format (a text-based, hex-encoded format), the first line was processed correctly, but while I was displaying some status, the subsequent line was lost because my code didn't check the UART for incoming characters fast enough.
What needs to happen is for me to start accessing the UART in interrupt mode, and add flow control. I designed this capability in from the start, and have two interrupt request lines allocated for the UARTs. I need to change my UART initialization code to set interrupt mode, and then add some interrupt handlers. Basically, whenever a new character arrives, the UART will generate an interrupt. The handler will then be invoked and grab the new character and store it in an incoming char buffer. If the buffer gets too full, I'll use a flow control scheme to tell the sender not to send me any more characters until my buffer drains. I'll probably use XON/XOFF protocol. There is similar handling on output, and as an added feature my UART maintains a little buffer of its own.
This will probably be one of those things that is very simple to do - once I get it right. But, I expect to go quite a few iterations before I get the setup correct.
Found the switch statement problem - it was a bug in my assembler. I spent a lot of time chasing down EPROM problems, and ended up discarding 3 27C256s after they went through two erase/program cycles with errors. I'd pretty much convinced myself that I had a bad microcode EPROM, but that wasn't it. The problem was that I had one path in my assembler in which I was not detecting a branch displacement overflow. The "cmpb.lt.16 a,b,br_tgt" instruction worked just fine, but the displacement to the branch target didn't fit in 1 byte. The bug was that my assembler didn't flag this as a problem, and just silently truncated the displacement.
At the moment, my assembler will attempt to use short forms of the conditional branches, and isn't smart enough to replace the short form with the long one if the target displacement doesn't fit in one byte. I'd been avoiding fixing that, intending to have that work in the new macro assembler that I hope to write.
Anyway, this is good news. The hardware continues to look solid. Next up is writing that program loader.
Haven't attempted to debug the switch statement problem, but I have hit a milestone of sorts. I can now talk to M-1 wirelessly from my laptop. A few days ago I ordered a Lantronix UDS-200 device server, and it arrived today. I hooked up M-1's serial ports to the UDS-200, which is plugged into my wireless router. Lantronix supplies some Windows software that redirects serial port traffic from my laptop over my local intranet to the UDS-200, which then passes it along to M-1. In short, the setup serves as a replacement for the serial port cable I was using.
Now all I have to do is write that program loader, and I can stop having to burn so many EPROMs. I'll be able to write new programs, load them into M-1 and run them without even having to get out of my chair. Also, I will be able to move M-1 off the kitchen table into a safer, more permanent location. That'll make Monica happy.
Once things are a little more solid, I can also move the UDS-200 out from behind my firewall. That way anyone can download the Lantronix redirector software and talk with M-1.
Doing some debugging. I added the command loop using 1 character command names via a C switch statement. It mostly works, but when the switch statement routes to the default case, I get strange behavior. Since I have to burn a new EPROM for each test, it's a bit slow-going, but my current suspicion is a problem in the "cmpb.lt.8 a,b,br_target" instruction. Should be fairly easy to confirm, whenever I can find a free hour or two.
Debugging at this point is a challenge. *Anything* could be the culprit. I can't really rely on the generated C code, or my assembler, or the microcode, or my EPROMs, or the hardware. When a problem shows up, I sort of have to be like one of those old philosophers and start a first principles: "what do I know is true?", rather than "what is not true?". Fortunately for me, I'm twisted enough to think this is fun.
Anyway, it's nice to have an interactive command loop. I may have to go ahead and buy a device server (for example: http://www.lantronix.com/products/ds/uds10/index.html). With one of these guys I can set up Magic-1 on my local network and talk to it wirelessly via my laptop. That would make Monica happy (getting the machine off the kitchen table). To really make it work well, I can also use one of my X10 thingees to remotely cycle power if it gets hung.
Some progress on the bios/monitor front: it is now running out of SRAM with hardware address translation enabled. It turned out to be a bit tricky to get right, but it works. I first set up the page table to do a 1-to-1 mapping of the virtual address space to the boot ROM, and then turn paging on. Next, map the first 16K of SRAM to addresses 0x8000 through 0xbfff. Then, I copy the first 16K of the boot ROM to addresses 0x8000 through 0xbfff - which in effect copies the boot rom to the first 8 pages of physical SRAM. I do this again for the 2nd 16K bytes of the boot rom (actually device SRAM) and then turn paging off. To complete the transition, I rewrite the page table to now map to SRAM instead of the boot ROM and finally turn paging back on. At this point, we're running in SRAM rather than the boot ROM.
All of the above code is written in C, but shortly I'll rewrite it in assembly for speed - that way I can use some of my special block copy operators.
Along the way I discovered that some of my EPROMs appear to be bad or marginal - or perhaps my EPROM burner is flakey. In any event, I'm beginning to wonder whether this accounted for some of the strange problem I encountered recently.
Next up is to prepare a monitor command loop. Right now the bios/monitor runs through all of its functionality and then drops into an echo loop. What I want is for it to do minimal initialization and virtual space remapping and then drop into a command loop. I'll convert much of what I already have to command subroutines.
The next big piece of functionality will be the ability to load and execute a program over the serial port (Intel hex format). My plan is to do this as a separate user-mode process. I'll allocate a process id and associated page table and then load the new program using my FROMSYS byte copy instruction (copies from system to user space). I will have to transition to the new program using a return from interrupt instruction, and when it exits I'll do a syscall trap to return to supervisor mode. To make this really clean, I'll also want to support system calls for basic I/O.
Among the first test programs to run as separate processes will be the dhrystone benchmark. Looking forward to that. It will be fun to compare M-1's MIP rating to some old (and new!) machines.
Decided to tackle the bios/monitor code first. Need to think about the structure a little, but here are my current thoughts:
At start (address 0x0000), first instruction is a branch around the interrupt vector. We then:
Or something like that. Basically, I want to start executing out of SRAM rather than EPROM. By copying, it will allow me to better support initialized C statics in my bios code. I'll also want to do some quick memory testing.
Still need to think a bit about a cleaner partitioning of bios vs. runtime compiler support code. Right now my C compiler is automatically inserting a big chunk of the bios and runtime support. What I want to have is a permanent bios with program loading capabilities. The bios would include the interrupt vector and basic device support, and I think I'd want the C compiler to include libc-like routines (printf, memcpy, etc) as well as the runtime support (multiply, divide, etc). In some ways it would be nice to have the support routines in the bios, but until I have a linker that would complicate program generation.
As far as the bios/monitor's command loop, I'd expect the following commands:
I also might want to have the Forth implementation (whenever I get around to it) embedded as part of the bios.
The next phase of the project is bringing up the software stack. This will take a while The goals for this phase will be the following:
Note that for this effort, I'm going to try to get by without much in the way of an operating system. C programs will pretty much be stand-alone, and will not include file system support (just I/O via the serial port). The BIOS will include a simple monitor. The Forth environment may count as a real operating system, but I'm not sure how far I want to take it. My end-goal continues to be a port of Minix - but it will be a long time before I attempt that.
Over the next couple of days, I'll think about the order in which I'm going to attack this new phase. Here's the current state of the project:
I haven't yet started on this project. I'm currently using a quick and dirty assembler I wrote, "qas" along with the m4 macro preprocessor. Qas has worked well to this point, but it's lifespan is coming to an end. The main feature of qas is that it automatically tracks changes to the ISA. It is built using lex and yacc, and the input grammars are automatically generated (via perl scripts) from the microcode web page on this site. This has worked out very nicely. When I add or change a new opcode, all I have to do is rebuild and a new qas is generated which knows about the new instruction.
However, the price I've paid for this is a lack of flexibility. Qas currently can't properly select whether to use long vs short forms of some conditional branches, and doesn't handle pseudo-ops very well. Finally, and most importantly, it generates an absolute load module - not a relocatable object file. That is becoming a serious problem as I try to build more complex programs.
So, my plan is to write a new macro assembler from scratch, which will generate either COFF of ELF object files (probably COFF - haven't decided yet). I won't attempt to continue the automatic grammar generation on the assumption that my ISA is becoming pretty solid now. I'll probably continue to use yacc and lex, and will mostly follow Macro-11 styles of macros.
I'm in pretty good shape here. The basic retargeting of lcc is complete, though I'm sure I'll run into both code quality and correctness problems as I run more code through it. The output of the compiler is assembly code which must be pre-processed by the m4 macro preprocessor before being assembled with my quick and dirty assembler, qas. As I replace qas with a full macro assembler, I'll have to make corresponding changes to my lcc retargeting - but should be fairly simple to do.
My plan is to use GNU code for the linker. No need to do anything custom here. The only question is whether to go with COFF or ELF.
As part of my current BIOS, I have support for 16-bit integer mul/div/mod, as well as 32-bit add/sub and some compares. I need to add 32-bit left and right shift, as well as 32-bit mul/div/mod. I haven't done anything with floating point yet. I believe that Minix and Linux both have some floating point emulation code, and I'm hoping I can just reuse that.
I really don't have much of anything here yet. At the moment, all test programs are burned into an EPROM and include the BIOS code. Also, everything is pretty much running now in supervisor mode out of EPROM. What I need to do is create a monitor which copies itself from EPROM to SRAM, turns on paging and then falls into a command loop. This command loop would then include an option of load a new program over the serial port (probably in Intel hex format). Right now, I have some difficulty running programs from EPROM because I don't currently have the ability to have statically initialized C global variables. The monitor/program loader capability should make that much easier.
Basic libc capabilities
I have code for strlen, strcmp, strcpy - as well as some basic printf and time support. As a first step, I'll need to add malloc (with null free) and flesh out printf, scanf, getch, putch etc. Until I get a linker going, I'll continue to include these capabilities in the BIOS.
My microcode needs a bit of cleanup, some of which has to wait until I have a more capable assembler. First, I need to redo the trap/interrupt microcode to reflect some late hardware changes I made. The key issue is that the flags register must be the first thing saved and the last thing restored. In brief, interrupt/trap save state is flushed to the system mode stack pointer. When a trap/interrupt happens, I need to grab a copy of the machine status bits (which includes the bit that says system vs. user mode execution), turn on supervisor mode bit in the MSR and then save the original MSR flags to the system stack. Given my lack of internal temporary registers, this is a little tricky (but doable).
Additionally, I have a few redundant instructions in the ISA. For example, SH0ADD a,b,a is functionally the same as SH0ADD a,a,b. However, my current assembler isn't smart enough to alias the two, so I went ahead and assigned each a separate opcode. When the assembler is smarter, I can reclaim those opcodes.
The reason I need to reclaim opcode space is to support some Forth primitives. Once all that is done, I can call the microcode complete and lock it down.
BIOS & monitor
The BIOS is reasonably complete now, but could stand some restructuring. Mostly what I need is to add a monitor command loop, complete with the ability to load a new program over the serial port. This will be one of the first things I attack.
I like Forth a lot, and for small machines it's hard to beat. I plan on doing a Forth implementation for M-1. My expectation is that all I have to do is implement a few primitives, and then I can just import an existing Forth implementation (which is written in Forth). The big question here is how I decide to map Forth stack/code/etc. pointers to M1's architecture.
High-speed simulation environment
My current M-1 simulator is a bit out of date, and not all that useful anymore. It is structured at the package/gate level, and was enormously useful for validating the hardware implementation and microcode. However, it is very slow. To aid in the software bringup, it would be useful to have a fast functional simulator. Should be pretty easy to write the core, but simulating devices will be a little more difficult. Might try to get fancy here and simulate the front panel & blinky lights.
M-1 stand-along web/telnet server
Using Adam Dunkels' UIP, I shouldn't have too much trouble turning M-1 into a web server and permitting remote telnet access. In fact, I probably am not too far away right now into getting it all to work. However, I think I'll wait until I have a bit more capability in M-1 before I attempt this.