My CPU Does 16 Bit Jumps Now - Superscalar 8-Bit CPU #22

Fabian Schuiki

Подписаться 2,9 тыс.

Просмотров 1,6 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

13 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 35

@josephc8482 Год назад

I'm looking forward to seeing how this gets built out. I think you are doing a great job at explaining the current architecture, so I'm really looking forward to getting to the out of order execution parts, even if it's going to be a couple of years away. Going from Ben Eater, to James Sharman, to this channel has been really eye opening at how simple logic can be used to build very complex machinery.

@fabianschuiki Год назад

Thank you for the very kind words! I'm very excited myself 😁

@TSteffi Год назад

I really love seeing all the breadboard and related projects on RU-vid the last few years. Videos like this inspired me to finally start a project that I had in my mind for ages: a Z80 based computer on an ATX mainboard, with ports for modern peripherals, but compatible with old software, CP/M and MSX in particular. And of course, take full advantage of the cheap prices for SRAM these days and have a ton of memory, for an 8bit system.

@fabianschuiki Год назад

I totally agree! They have been good years for breadboard homebrew computing. Really exciting to see all the projects people come up with. Z80 on an ATX mainboard is a fantastic crossover! 🥳

@cj09beira Год назад

i was just re-watching the series 😂

@LaSchui Год назад

*hands over popcorn"

@GregCoonrod Год назад

I really enjoy these videos. They are very polished and it's clear that you spend a lot of time and thought on them.

@fabianschuiki Год назад

Thank you for the kind words 😊!

@bignope5720 Год назад

I'm super curious how you plan to go superscalar OoO with the currently designed parts - especially the register file. even the most basic meaningful dual-issue would require a 2W4R setup. double pumping? doubling them and using LVTs? something more exotic like an XOR-based multi-porting scheme? of course, i'm getting way ahead of myself. keep up the awesome work, i love your videos! the editing/graphics/production are fantastic.

@cj09beira Год назад

if i understood things right i think we will endup seeing this being a scalar processor first, and then slowly go back to breadboards to add functionality

@denilsonbitme1715 Год назад

Could you explain yourself in layman's terms? Could you also recommend readings on that topic and superscalar OoO CPUs?

@fabianschuiki Год назад

Yes exactly, my plan is to build this out to a scalar processor first, but also building in the provisions to go superscalar afterwards. For example, once we have an ALU and operations that take multiple cycle, things like out-of-order writeback and scoreboarding/hazards already need to be addressed. My hope is that we can carefully make a simple initial CPU gradually more complex and powerful by loosening constraints, allowing more and more out of order operation, etc. I expect that the registers will see one or two overhauls, first for a content-addressed register scheme to support reservation stations, and second to double the access bandwidth to 2W4R for true dual issue.

@CeDoMain Год назад

@@fabianschuiki Wuhuuu 🎉 Im happy to hear this. I cannot wait!

@bignope5720 Год назад

@@denilsonbitme1715 lets say you wanna fetch and execute 2 instructions per clock, and worse case they both read 2 registers and write back the result: the register file needs to read 4 different values at once and write back two of them - the shorthand being 2W4R (right now the implementation is 1W3R) it's a really fascinating architectural problem with a huge variety of possible solutions, such as the Live Value Tables i mentioned. if you google "hennesy and patterson" one of the first links will be a PDF of their computer architecture book that is pretty much the gold standard, and onur mutlu on youtube has several full comp arch course playlists

@LaSchui Год назад

not all heroes wear capes. this was incredibly helpful bro!

@MissTrollwut Год назад

This is the series that keeps on giving. Seriously good stuff dude. Keep up the good work!

@fabianschuiki Год назад

Glad you enjoy it! 😀

@LaSchui Год назад

Totally agree! I've been binge-watching it like a Netflix series 😇

@CeDoMain Год назад

Im looking forward to programming an assembler. 😎

@fabianschuiki Год назад

Coming up soon! 😏

@cj09beira Год назад

for those that want a simpler option there is also "customasm", in which you just have a few files that define your processor's ISA, and it will take assembly code and spit out a binary file for you

@fabianschuiki Год назад

`customasm` is really cool! It even has a web interface: hlorenzi.github.io/customasm/web/

@tmbarral664 Год назад

I was thinking that it's a very simple cpu... too simple.... Then it hit me ! Everything is done in 1 cycle !! Bl**dy h*ell ! That's efficient !!! Sounds like your RISC background is in this project. Love it. :):):)

@tmbarral664 Год назад

I can't wait to see how you will deal with the ALU.... in how many cycles.... :)

@fabianschuiki Год назад

Yeah pipelining is an interesting technique, and it's somewhat counterintuitive: you don't need pipeline stages to build a CPU. You can build arbitrarily complex machines that do all that they have to do in a single clock cycle. The cost is in the physical time it takes for the signals to run through all the gates you line up in between registers. So you'll always be able to build a CPU that executes instructions in a single cycle -- it just won't run very fast in terms of clock frequency (physical speed), but you don't see that when you just look at the cycle count (architectural speed). Architecturally speaking, considering the number of cycles your CPU takes to run a program, adding pipeline registers *always* makes your CPU slower. The fastest you can go is without a pipeline register, such that an instruction runs in 1 cycle. Adding a pipeline register makes it run in 2 cycles. You can often hide that by having two instructions overlap in the pipeline, but you're likely to have parts of your program where an instruction depends on the result of the instruction just before it. This is often unavoidable since there are times where you don't have any other instructions that you could squeeze in between to spend the wait doing something useful. That waiting introduces bubbles in the pipeline where the CPU doesn't do anything. So in the best case the pipelined CPU takes as many cycles as the unpipelined one if you can always have instructions overlap, but as soon as there's one data dependency between instructions the pipeline will immediately take more cycles. The pipelinging is often worth the effort from a physical perspective since the pipeline registers cut the long chains of logic gates, which allows you to clock the CPU faster. But this is only beneficial up to a point; pretty quickly it will be almost impossible to keep the pipeline filled if it has *a lot* of pipeline stages, so the physical speedup starts to lose against the architectural slowdown. Another interesting perspective is that physically speaking, adding pipeline registers makes instructions take more physical time to execute (besides the number of clock cycles). The reason being that your ADD or MUL instruction has a fixed set of logic gates it has to pass through to do its computation. Say that takes 100ns. Adding pipeline registers doesn't change the fact that the instruction has to flow through those gates, but now it also needs to pass through the pipeline registers, which have a propagation delay like any other gate as well. If your registers have a propagation delay of 5ns, an ADD on a four-stage pipeline will take 120ns, whereas the same ADD on a single-stage pipeline only takes 100ns. Again, you can often mask that slowdown by allowing multiple instructions to overlap in the pipeline, which is beneficial in general. A fun thing I realized while working on processors: in modern silicon manufacturing processes (around 28nm and smaller), you can build a fully-fledged single-stage processor that runs at 1 GHz. And since we can't really make our CPUs faster than 2-4 GHz for various physical reasons, you'll only ever need 2-4 pipeline stages; everything beyond starts to push into diminishing returns territory.

@ArneChristianRosenfeldt Год назад

@@fabianschuikiin most RISC CPUs despite the pipeline the next instruction can use the results of the previous instruction. You mean branches and multi cycle instructions like LOAD and co-processor. In case of vectors ( MMX, SSE2, 3dNow, N64 RSP ) branches make no sense.

@fabianschuiki Год назад

Yes I agree, although that only holds for simpler RISC CPUs with a fairly shallow pipeline. More specifically, you would need execution stages like the ALU, FPU, etc. to be around a single cycle. In that case the instruction fetching and decoding can happen in different pipeline steps, and they would configure a forwarding/bypassing network to send an instruction's result immediately to the instruction immediately after, without going through the register file. However, this only saves you the latency of writing to the registers. In general, with more complex multi-cycle operations, the forwarding doesn't buy you too much. It's just a hack to save you one cycle of latency in common instructions. That's actually a pretty neat example of what pipelines do: they allow the core to run faster, but architecturally they only introduce delays. In some cases (forwarding in simple RISC cores) you can work around those delays.

@ArneChristianRosenfeldt Год назад

@@fabianschuiki reduced means simple. FPU was a separate chip on 8086. Intel engineers then sold a coprocessor for RISC CPUs under the name Weitek. Really expensive SPARC CPUs did not MUL. Tell me how a pipeline for RISC or 6502 ALU operations makes sense? Instruction Pointer++ is single cycle on 6502 and VICII sprite x position . Subtract and Add( for the division and MUL circuitry) is half a cycle on Jaguar

@JaenEngineering Год назад

For some reason I'd gotten it into my head that to get the next register we'd need an adder and another decoder, but just tying the read enable lines to the previous register is such an simple solution. Woods and trees. And where did you find the SST memory chips in the DIP package? Everywhere I've looked doesn't have them in stock, so I have to chose between either PLCC or TSOP (although with the latter I could fit two onto a 40 pin 0.6" DIP outline carrier board and wire it to provide a 16bit word...)

@cj09beira Год назад

if you search for it in mouser it's currently in stock at least for the 256k x 8 version (in case my last message isn't visible)

@fabianschuiki Год назад

I have also seen them pop up in the usual suppliers' stocks occasionally. AliExpress is also a fairly reliable source for these parts. The parts you get may be salvage or rejects from the production, but for these breadboard builds that's usually good enough; you'll get 10-20 chips for little money and if some don't work you'll still come out ahead. I recently bought a stack of 10 SST memory chips there and they all ended up working as expected -- at least for reading/writing with minipro.

@fabianschuiki Год назад

@Jamie Foster I still think that we'll eventually want that adder and another decoder: as soon as the CPU needs to do any form of register renaming, or once we switch to content-addressed registers, we'll want to just have three fully independent ports on the register file, and just use `ra3 = ra2 + 1` in the instruction decoder as a way to save encoding space -- but as far as the rest of the CPU is concerned, there's nothing special about the third port.