A Simple CPU

Magic

When I was a child, I marveled at our family computer. Occasionally, when I was bored of shooting things in Duke Nukem 3D or Doom, I'd lean my face in really close to the monitor and start nudging the arrow keys. As the game camera moved about the virtual space, the pixelated rectangles on screen would shift and distort in complex patterns. I was convinced that if I just stared at it long enough, I'd figure out how they made the colors on screen change to look like a 3D room.

In high school, I finally learned enough math to figure it out. I even built a little ray-casting 3D engine. (I later learned neither game actually used ray-casting. Que sera.) It was one thing to work out on paper how to use wall distances to shrink or expand a pixel-wide column of texture. It was another thing to walk around a room made of your own math.

In my childhood, I'd been shown a magic trick. Hocus pocus, now you see a room! Later, I'd learned the trick myself.

In college, I learned computer architecture, compilers, network stacks, algorithms, and all the trappings of software. I took a digital circuit design class and made sequential logic circuits on breadboards. A deeper magicks had been revealed: computing itself.

I learned how these tricks were done, but I couldn't really do them myself. I wanted to make a CPU to embody what I'd learned, but every path towards this goal felt either trivial or tedious. I could program it all into a FPGA, but this felt just like weirder programming. I could hand wire relays into flip-flops and half adders, but that would take forever. I could wire together a bunch of store-bought ALUs and registers, but that felt like assembling really complicated LEGOs. So I put the project on the shelf for a long time.

Simplicity

The project popped back into my mind a while ago with a twist. Even toy CPUs usually have instruction decoding, data path switching, an ALU, etc. But these toys are simplified models of production CPUs, and production CPUs need to be efficient, performant, and practical.

My CPU wouldn't need to be efficient, performant, or particularly practical. I'd just want it to compute the Fibonacci sequence in under a day or so. This opened up the enormous space of less-than-practical designs.

Any Turing-complete computer architecture can be reduced to any other via software rewriting. So long as the transformation wouldn't make a Fibonacci program completely infeasible, any would work. I didn't want to do a lot of wiring, so I wanted one I could build from very few electronic parts.

One instruction to rule them all

A dazzling variety of Turing-complete abstract machines have only one instruction type. These are appropriately named one instruction set computers (OISCs). Much of production CPUs hardware exists to switch the behavior of the processor when different types of instructions are read. With only one instruction type, I could avoid needing any hardware for this.

SUBLEQ (subtract and branch if result <= 0) is probably the most famous OISC. I explored it thoroughly, but implementing it seemed hardly any easier than a normal CPU. SUBLEQ requires an ALU for subtraction with a status bit for the comparison. The CPU would also need to do have different behavior depending on the value of the status bit.

I consulted various lists of OISCs until I found the perfect one: byte-byte-jump. Any CPU must, at the very least, accept input and deliver output. This instruction requires little else.

Byte-byte-jump takes three addresses as arguments: source, destination, and jump. It reads a byte from the source address, writes it to the destination address, and changes the instruction pointer to the jump address.

Can't add, doesn't even try

It seemed a bit magical to me that such a simple operation could implement arithmetic, comparison, and conditional branching. As with all magic tricks, it's a bit less impressive once you know how.

Tables. A program can perform any arithmetic operation by including a table from all possible inputs to the corresponding output. Production programs use lookup tables for DES S-boxes, hash functions, etc. This is no different, except that usually lookup tables make a program more efficient. Here they make a program possible.

Self-modifying code allows byte-byte-jump to perform table lookups. Say a program includes a byte-byte-jump with the table as the source address. At runtime, before the instruction executes, the program can use another byte-byte-jump to overwrite the low byte of the source address (in the program itself) with the table index. When the modified instruction later runs, it then reads from the correct location inside the table. Overwriting the jump address in this fashion yields conditional branching.

One snag. The naive tables for even 8-bit binary operations (like additions) are 64 KiB large. This plus a program would require addresses to be larger than 16 bits. Once the adresses get past 16 bits, my nostalgia goggles fall right off.

No matter. 4-bit operations chained together can do any 8-bit operation, and 4-bit binary operations tables are only 256 bytes. This allows my CPU to have 8-bit bytes and 16-bit addresses, as God intended.

The cheapest configuration of the commercially-sold IBM 1620 replaced expensive logic with arithmetic tables in cheaper program storage. As a result, users took it's internal code name, CADET, to stand for "Can't Add, Doesn't Even Try."

Engine

Let's talk about hardware. We can start by analyzing byte-byte-jump. Byte-byte-jump involves two reads and one write to main memory. Simple main memories can only read or write one address at a time, so these reads and writes must happen at different times. In between, the CPU will need to remember some data.

Specifically, the CPU must remember the source and destination addresses, the instruction pointer, and the byte to be written. The CPU also needs a way to access or overwrite each of these values without disturbing the others.

There are two easy ways to differentiate things in hardware: space and time. Parallel systems include multiple copies of a thing with different wires and pins for each. Serial systems use the same pins and wires to handle different items at different times, with the timing coordinated by some protocol. Parallel takes more work to physically wire up, so serial is my jam.

So, I need a way to remember some bits with a serial interface. That means either a serial register bank or a serial RAM. I couldn't find any cheap serial registers, so I started shopping for serial SRAMs. Eventually, I selected the 64 KiB Microchip 23LC512. This chip uses SPI to communicate data, addresses, and read/write operations with the chip. SPI is literally the simplest serial communications protocol I can imagine, and the chip only costs around $1.

I'll revel for a moment in the absurdity of using 64 KiB of RAM to remember 48 bits. This one chip has more switching elements than the first ten of mankind's computers combined. Yet, the smallest SPI SRAM I could find was still 8 KiB, and the 64 KiB one was the smallest with a sequential mode I needed.

The SRAM also has an instruction set, address pointer, operating modes, registers, and a communications protocol. Is this SRAM really just a CPU masquerading as a SRAM? Sort of, but not really. It doesn't have an instruction pointer or control flow. This means it cannot request instructions for itself, and the outside system must spoon-feed them in. It's a powerful engine, but without a driver.

Driver

Something needs to drive the SRAM's pins. The SRAM has eight, but only three are interesting: chip select (CS), serial input (SI), and serial output (SO).

The chip can't actually input and output at the same time. When SI is in use, SO is always electrically disconnected, and vice versa. Thus, instead of treating them separately, I wired them together into one "SIO" pin. This reduces the number of electrical connections needed with the rest of the system.

The CS pin controls the timing of the operations sent over the SIO line. Once CS is activated, the operation type (read/write) and address are sent over SIO. Data then flows over SIO from/to the given address (and successive addresses) until CS is deactivated.

The main memory also needs to communicate with the CPU. For this, I used a slightly tweaked version of the SRAM's protocol over SPI. The main memory gets its own CS line, but it shares SIO with the SRAM, allowing bidirectional communications between the two.

With main memory and the SRAM connected together in this fashion, the stage is set. Enter the players. Four more identical SRAMs can be hooked together with the SRAM and main memory. Together, the five SRAMs can operate as a byte-byte-jump CPU.

How does adding more SRAMs help? Well, performing byte-byte-jump requires issuing various instructions and addresses to the SRAM and main memory. These take the form of specific binary signals for SRAM CS, main memory CS, and SIO. An SRAM can generate any binary signal.

To generate a signal, fill the SRAM with repeated copies of the signal and issue a read to address zero. The signal will then come out of SO indefinitely, since the SRAM reads each address successively, and the internal address pointer wraps back around to zero once it reaches the end.

For the two CS lines, these signals are just a sequence of high and low voltages. They thus require one SRAM each. SIO is a little different, since it should only be driven when neither the SRAM or main memory is sending data across it. This means it needs three states: high, low, and disconnected.

Luckily, the SRAMs have HOLD pins that disconnect their outputs. The third SRAM drives SIO, and the fourth SRAM drives the third SRAM's HOLD pin. This allows the third SRAM to only drive SIO sometimes, completing the setup.

While I would have loved to be able to say I built a CPU out of five identical SRAM chips, reality was not so kind. I'd have to program four of them with four different signals every single time the system was powered on. So instead, I used four Microchip EEPROMs. These have nearly identical interfaces to the SRAMs, but once programmed, they retain their behavior without power.

Build

I needed:

1 Microchip SRAM
4 Microchip EEPROMs
2 pullup resistors
4 LEDs w/ resistors, to provide blinkenlights for each signal
~40 wires

I bought all this from Mouser and waited for it to arrive. It cost me about ~$15, and $7 of that was shipping.

The part of main memory was to be played by a Raspberry PI and its GPIO pins. I wrote a program to accept read and write operations over the SIO and main memory CS lines, using a 64 KiB array as backing storage. Whenever the CPU writes a byte to the address 0xFFFF, the program prints it to the terminal. This gives programs for the CPU a simple output device.

I wrote a script to program each of the EEPROMs with its control signal, then verified that each EEPROM would output that signal indefinitely. Finally, I wired the four EEPROMs, the SRAM, and the Raspberry Pi together.

I turned it on for the first time. It immediately started executing byte-byte-jumps, instead the expected behavior, catching fire. All things considered, it took around 200 clock cycles to execute a single byte-byte-jump.

I incrementally built up a library of routines to write a Fibonacci-calculating program into the memory array. All things considered, it took around 200 byte-byte-jump instructions to execute an 8-bit addition.

You can see it in action at various clock rates below. The second-rightmost LED is SIO, and all of the others are control lines. The Turing magic happens on the SIO LED; the others are perfectly cyclical.

Discussion

Compared to the time it took to conceive, design, and program this CPU, building it was incredibly straightforward. It operated reliably at all the clock rates I could get Python to generate, up to 80 kHz. Admittedly, that's not very fast, but it only takes around 40,000 cycles to do an addition. That is, it can do around 2 additions per second, which is more than I can do, and I bet it can do them all night!

The final design resembles early computers far more than modern ones. All early computers were bit-serial, and very limited instruction sets supplemented with lookup tables not unheard-of. In both cases, simple implementation is king. I wanted it to be simple because I'm lazy. They needed it to be simple, since technology placed tight limitations on what even major world governments and universities could build.

I can safely say that running this CPU for the first time was exactly as satisfying as I'd hoped. I still can't believe the damn thing actually computes, and faster than I can! Given how many things could have broken along the way, the fact that it does what I wanted is, well, pretty magical.

I've collected a fair amount of additional details from my notes. Besides the schematics in that document, the canonical control codes and software are all defined in the two Python files I linked to above. If there's anything I left out, shoot me an email.

Daniel Thornburgh (mysterymath@gmail.com)