I felt bad five minutes in that I hadn't yet liked, subscribed, and ringed that bell, and queued up the rest of the playlist, so i got on that. OH WAIT THERES 34 VIDEOS WHOOA
@@fabianschuiki I'm so excited its going great so far. I always viewed superscaling as the edge of understandably for people who don't work at intel or something, and out-of-order-execution as just beyond that, you can get the principle, but as a normal person i'd never be able to implement this. Hopefully seeing you do so wont make me go insane and either throw all my money at random IC's and breadboards and an overprices oscilloscope or more likely and less destructively just steal my time and make me write an emulator, but hopefully it will just help me understand.
@Rowlesisgay 😃 I hope it will all work out. It should in theory, but you never know if some breadboard or PCB will just randomly decide to catch fire. But it should be possible to get some decent OoO and superacalar execution going even with a very simple 8 bit design. Focusing more on how it works rather than the complexity of doing it for 64 bits at 5 GHz.
@@fabianschuiki I always have trouble believing my cpu can go at multiple GHz, like, that's wifi frequency (my cpu boosts past 4), that's so high frequency it should make microwaves shoot out of the PCB, how can normal PCB material carry the signals, and logic circuits have any ability to make digital out of the analog wobbling everything is in reality. I have, however, learned that when it comes to computer science, if there's an idea that makes no sense how it would be more practical than sticking with what you've got even if its better under intensive load, the industry obviously switched to it in the 90s, or the 00s for outliers like multi-core cpus.
holy shit I was looking for a video like this since forever. I have the concepts for a basic cpu but finding information on modern superscalar concepts is a mess of navigating wikipedia pages and random sources, having such an exemplary series is university course worthwhile! Thanks for making this, ringed da bell, commented liked hoping this finds more people lookiong for this
Ich bin soooo gespannt auf diese Reise. Ich kann es kaum abwarten das nächste Video #25 , #26, ... #70 zu sehen. ☺ I am soo exicted about this journey. I can't wait to see the next Video #25, #26, ... #70 ☺ 👍👍👍👍
Amazing. I just got recommended one of your videos. As others have commented, at the moment, your views are criminally low, especially considering their high quality. Hopefully the recommendation I just got is a sign that’s about to change. Thank you for creating this!
Thank you for the kind words 🙂! It's an interesting journey. But I also appreciate the videos as just documentation of the build itself. So it's worthwhile either way 😉
Just found this series after watching a few others. Really looking forward to it. I'm in the same process, building my own processor to power a over the top chicken coop. Was initially thinking about using the ESP32 platform, but this seems a lot more fun, challenging and I have an entire wall to display all the boards that will be running. Thanks for documenting this.
Watched bits and pieces so far and this is just so great. Ben Eater eat your heart out! Seriously I loved Ben’s work and learned so much and this is the next level.
Can’t believe I haven’t found your videos before! Very very cool. 5:29 surprised not to see basic pipelining (one instruction issue per clock, multiple clocks to complete an instruction) here, since it’s the logical next step after concluding your clock rate won’t scale with every instruction needing to execute in a single clock cycle. Anyway, great stuff!
The content is very good and the videos are neat and polished ! I am amazed at both your work and the extreme lack of views for such high quality content. Expressing my gratitude for your work ! This is gonna blow up some day ! Best of Luck Mate !
Subbed as I kinda wanted to tackle the VLIW processor based on RISC-V, and I kinda wanted to know exactly how the out-of-order execution scheduler is put together, which is the most important black box to me so I can figure out where and how to push it - VLIW fetch and compaction stage would be separately responsible for superscalar issues internally, of course. Thanks for putting together the playlist for this processor, as it's a very useful and kind of hard to find information.
@@fabianschuiki The extensions are built on floating point instructions, specifically used to enable HPC Monte Carlo Simulations, with these instructions I was able to take advantage of pseudo-dual issue 🙂
Your projects are very interesting and educational. Congratulations. Do you know the hex display (4 bits)? It could be very interesting for your projects. IBM used it in the 90' Mainframes. For example the old HP 5082-7340 or the current HDSP-0772 (very expensive).
In terms of RU-vid videos and tutorials, I'd recommend: - Ben Eater's 8 bit breadboard CPU build (minimalist, simple-as-possible): ru-vid.com/group/PLowKtXNTBypGqImE405J2565dvjafglHU - James Sharman's 8 bit breadboard CPU build (pipelined, a bit more involved): ru-vid.com/group/PLFhc0MFC8MiCDOh3cGFji3qQfXziB9yOw If you're looking for the full CPU design treatment: - Computer Architecture: A Quantitative Approach, by Hennessy & Patterson: www.google.com/books/edition/Computer_Architecture/MBQFuAEACAAJ?hl=en For computer architecture in general, the RISC-V instruction set is a very good starting point to get your hands dirty. It's very clean and elegantly designed, and it lacks some of the baggage that has accumulated with other ISAs over the years.
I don’t get why “out of order” involves a “look ahead” . I thought we just have our parallel decode and fetch circuit, which is simply a vector unit because instructions have a fixed length. So this won’t stall. But execution may need to wait for the score board. So out of order simply means that we don’t stall for these. We just mark their result register on the scoreboard also and go on. The trick is to fetch4 instructions at once and do the score board, but only have 2 ALUs. So fetch runs ahead automatically. There is no extra look ahead circuitry. Why do we care for register renaming? 32 names are enough. Let’s have a separate stack pointer and instruction pointer. Only use I see is that we could eliminate the write back register for reg-reg .
Yes you're totally right. The look-ahead isn't something you'd add explicitly, but it's a side effect of being able to decode and issue instructions even if their inputs aren't available yet. As you say, if you fetch faster than you can execute, or if registers have to wait for results, this effectively looks like the processor is looking ahead of currently stalled instructions to find work it can already do. No additional circuitry needed indeed! 👍
I’d like to add that vector fetch and decode will need to be followed by a scoreboard stage similar to sprite priority on the C64. So instructions ( their sources) on the “right” hand side in the vector will be blocked by target register names on the “left”
The Hennessy-Patterson book on computer architecture should cover that in a fair amount of detail if I recall correctly: - Computer Architecture: A Quantitative Approach, by Hennessy & Patterson: www.google.com/books/edition/Computer_Architecture/MBQFuAEACAAJ?hl=en Also in general you should be able to dig up a lot of interesting resources if you Google for "Tomasulo's algorithm". That was one of the initial implementations of out-of-order and superscalar execution. Most modern architectures are using modified flavors of this original idea.
I tried to look up reservation stations in Wikipedia. I don’t get it. I expect a matrix where every unit ( ALU, barrel shifter, DIV, Load ) outputs it’s result on its row together with the name of the destination register. On the columns ever unit ( Store, CMP, Test ) waits on input. The cells compare register names.
They come in different forms. The earlier ones used to store the operation to be executed, with additional fields for the operands. If a result wasn't available when the instruction was put in the reservation station, that field would be set to an ID that identifies the result that still has to be produced by another functional unit. When a result has been computed, it would be sent over a bus alongside its ID. The result gets stored in the register file, but all reservation stations also observe the ID and check if it matches one of the entries (like your cells). If it does, that entry would be replaced by the result. Once all entries have all their operands present, the instruction would be executed. Some newer variants of the technique put all of the information into the reorder buffer, in a more centralized manner.
@@fabianschuiki buffer means access conflicts if superscalar. Latency due to the indirection between the execution logic and the buffer ( opcode which needs another decode). This only makes sense for scalar RISC CPU which has to orchestrate external hardware like vector processors (GPU, Cray ), large memory banks ( servers, main frame), slow EEPROM (embedded, console with cartridge)
I don't have the stuff to do this in real life, but I was wondering, if I would do it digitally, should I use the game "Turing Complete", or the logic sim "Digital" to do this?
Digital should be a good platform to experiment with stuff like this! Pretty sure that there are also some videos on RU-vid that discuss the topic specifically with Digital as the underlying sim 🙂
There are a few very nice series here on RU-vid where people build custom CPUs. (Check out Ben Eater's 8-bit CPU series and James Sharman's 8-bit pipelined CPU series, for example.) Ben Eater also has component lists and kits that allow you to build his 8-bit CPU, and follow along with his 6502 build.