I'm looking forward to seeing how this gets built out. I think you are doing a great job at explaining the current architecture, so I'm really looking forward to getting to the out of order execution parts, even if it's going to be a couple of years away. Going from Ben Eater, to James Sharman, to this channel has been really eye opening at how simple logic can be used to build very complex machinery.
I really love seeing all the breadboard and related projects on RU-vid the last few years. Videos like this inspired me to finally start a project that I had in my mind for ages: a Z80 based computer on an ATX mainboard, with ports for modern peripherals, but compatible with old software, CP/M and MSX in particular. And of course, take full advantage of the cheap prices for SRAM these days and have a ton of memory, for an 8bit system.
I totally agree! They have been good years for breadboard homebrew computing. Really exciting to see all the projects people come up with. Z80 on an ATX mainboard is a fantastic crossover! 🥳
I'm super curious how you plan to go superscalar OoO with the currently designed parts - especially the register file. even the most basic meaningful dual-issue would require a 2W4R setup. double pumping? doubling them and using LVTs? something more exotic like an XOR-based multi-porting scheme? of course, i'm getting way ahead of myself. keep up the awesome work, i love your videos! the editing/graphics/production are fantastic.
if i understood things right i think we will endup seeing this being a scalar processor first, and then slowly go back to breadboards to add functionality
Yes exactly, my plan is to build this out to a scalar processor first, but also building in the provisions to go superscalar afterwards. For example, once we have an ALU and operations that take multiple cycle, things like out-of-order writeback and scoreboarding/hazards already need to be addressed. My hope is that we can carefully make a simple initial CPU gradually more complex and powerful by loosening constraints, allowing more and more out of order operation, etc. I expect that the registers will see one or two overhauls, first for a content-addressed register scheme to support reservation stations, and second to double the access bandwidth to 2W4R for true dual issue.
@@denilsonbitme1715 lets say you wanna fetch and execute 2 instructions per clock, and worse case they both read 2 registers and write back the result: the register file needs to read 4 different values at once and write back two of them - the shorthand being 2W4R (right now the implementation is 1W3R) it's a really fascinating architectural problem with a huge variety of possible solutions, such as the Live Value Tables i mentioned. if you google "hennesy and patterson" one of the first links will be a PDF of their computer architecture book that is pretty much the gold standard, and onur mutlu on youtube has several full comp arch course playlists
for those that want a simpler option there is also "customasm", in which you just have a few files that define your processor's ISA, and it will take assembly code and spit out a binary file for you
I was thinking that it's a very simple cpu... too simple.... Then it hit me ! Everything is done in 1 cycle !! Bl**dy h*ell ! That's efficient !!! Sounds like your RISC background is in this project. Love it. :):):)
Yeah pipelining is an interesting technique, and it's somewhat counterintuitive: you don't need pipeline stages to build a CPU. You can build arbitrarily complex machines that do all that they have to do in a single clock cycle. The cost is in the physical time it takes for the signals to run through all the gates you line up in between registers. So you'll always be able to build a CPU that executes instructions in a single cycle -- it just won't run very fast in terms of clock frequency (physical speed), but you don't see that when you just look at the cycle count (architectural speed). Architecturally speaking, considering the number of cycles your CPU takes to run a program, adding pipeline registers *always* makes your CPU slower. The fastest you can go is without a pipeline register, such that an instruction runs in 1 cycle. Adding a pipeline register makes it run in 2 cycles. You can often hide that by having two instructions overlap in the pipeline, but you're likely to have parts of your program where an instruction depends on the result of the instruction just before it. This is often unavoidable since there are times where you don't have any other instructions that you could squeeze in between to spend the wait doing something useful. That waiting introduces bubbles in the pipeline where the CPU doesn't do anything. So in the best case the pipelined CPU takes as many cycles as the unpipelined one if you can always have instructions overlap, but as soon as there's one data dependency between instructions the pipeline will immediately take more cycles. The pipelinging is often worth the effort from a physical perspective since the pipeline registers cut the long chains of logic gates, which allows you to clock the CPU faster. But this is only beneficial up to a point; pretty quickly it will be almost impossible to keep the pipeline filled if it has *a lot* of pipeline stages, so the physical speedup starts to lose against the architectural slowdown. Another interesting perspective is that physically speaking, adding pipeline registers makes instructions take more physical time to execute (besides the number of clock cycles). The reason being that your ADD or MUL instruction has a fixed set of logic gates it has to pass through to do its computation. Say that takes 100ns. Adding pipeline registers doesn't change the fact that the instruction has to flow through those gates, but now it also needs to pass through the pipeline registers, which have a propagation delay like any other gate as well. If your registers have a propagation delay of 5ns, an ADD on a four-stage pipeline will take 120ns, whereas the same ADD on a single-stage pipeline only takes 100ns. Again, you can often mask that slowdown by allowing multiple instructions to overlap in the pipeline, which is beneficial in general. A fun thing I realized while working on processors: in modern silicon manufacturing processes (around 28nm and smaller), you can build a fully-fledged single-stage processor that runs at 1 GHz. And since we can't really make our CPUs faster than 2-4 GHz for various physical reasons, you'll only ever need 2-4 pipeline stages; everything beyond starts to push into diminishing returns territory.
@@fabianschuikiin most RISC CPUs despite the pipeline the next instruction can use the results of the previous instruction. You mean branches and multi cycle instructions like LOAD and co-processor. In case of vectors ( MMX, SSE2, 3dNow, N64 RSP ) branches make no sense.
Yes I agree, although that only holds for simpler RISC CPUs with a fairly shallow pipeline. More specifically, you would need execution stages like the ALU, FPU, etc. to be around a single cycle. In that case the instruction fetching and decoding can happen in different pipeline steps, and they would configure a forwarding/bypassing network to send an instruction's result immediately to the instruction immediately after, without going through the register file. However, this only saves you the latency of writing to the registers. In general, with more complex multi-cycle operations, the forwarding doesn't buy you too much. It's just a hack to save you one cycle of latency in common instructions. That's actually a pretty neat example of what pipelines do: they allow the core to run faster, but architecturally they only introduce delays. In some cases (forwarding in simple RISC cores) you can work around those delays.
@@fabianschuiki reduced means simple. FPU was a separate chip on 8086. Intel engineers then sold a coprocessor for RISC CPUs under the name Weitek. Really expensive SPARC CPUs did not MUL. Tell me how a pipeline for RISC or 6502 ALU operations makes sense? Instruction Pointer++ is single cycle on 6502 and VICII sprite x position . Subtract and Add( for the division and MUL circuitry) is half a cycle on Jaguar
For some reason I'd gotten it into my head that to get the next register we'd need an adder and another decoder, but just tying the read enable lines to the previous register is such an simple solution. Woods and trees. And where did you find the SST memory chips in the DIP package? Everywhere I've looked doesn't have them in stock, so I have to chose between either PLCC or TSOP (although with the latter I could fit two onto a 40 pin 0.6" DIP outline carrier board and wire it to provide a 16bit word...)
I have also seen them pop up in the usual suppliers' stocks occasionally. AliExpress is also a fairly reliable source for these parts. The parts you get may be salvage or rejects from the production, but for these breadboard builds that's usually good enough; you'll get 10-20 chips for little money and if some don't work you'll still come out ahead. I recently bought a stack of 10 SST memory chips there and they all ended up working as expected -- at least for reading/writing with minipro.
@Jamie Foster I still think that we'll eventually want that adder and another decoder: as soon as the CPU needs to do any form of register renaming, or once we switch to content-addressed registers, we'll want to just have three fully independent ports on the register file, and just use `ra3 = ra2 + 1` in the instruction decoder as a way to save encoding space -- but as far as the rest of the CPU is concerned, there's nothing special about the third port.