This Critical Path exercise reminds me so much of my FPGA endavours, where the design simulates perfectly, but fails to run the the same on an actual FPGA because of incorrect clocking constraints and critical paths. Seems that no matter what desigm process one does for digital circuits, or how modern your tools are, you are always bound to the physics of electrical circuits 🙂 Keep up the good work!
James, nice video - brings back memories. I put in a lot of effort searching for critical paths in my Magic-1 HomebrewCPU back in 2004-2006. Like you, I was mostly interested in validating my design. It didn’t really matter if Magic-1 peaked at 3 Mhz or 4 Mhz or 5 Mhz, but it did matter to me that I wasn’t leaving performance on the table because of a needlessly long critical path. In my case, I found that memory accesses were the first critical path - but realized that with a little redesign I could move some of the address formation into the previous half clock. I had to redo my memory board and some of the microcode, but those changes enabled Magic-1 to go from about 3.5 Mhz to the current 4.09 Mhz. Once there, though, I identified 3 or 4 new closely occurring paths such that even if I improved the new critical path another would immediately crop up and I wouldn’t really be able to speed up the machine much. So, 4.09 Mhz was good enough. As far as methodology, I started with my schematics and data sheets. I guessed at what likely critical paths would be and then added up the worst-case gate delays from the data sheets. Once key paths were identified, I first looked for redesign possibilities to simplify them. Next, I shortened them by substituting key 74LS parts with faster 74F parts. At one point, I thought I’d just make everything 74F - but that was a bad idea. The “F” parts are much noisier and there is no point in using them where they are not needed. Also like you, I found that real-world performance of the devices was generally quite a bit better than the worst-case specs from the data sheets. This exercise was also another in which my logic analyzer was super useful. I wrote some special programs to test paths with the timing module of the analyzer, and for one particular path I even whipped up some special test microcode. Anyway, interesting stuff!
Thanks Bill! Your project is as always an inspiration! I have a small set of changes I could make that I think would get me about 50% faster but one of the changes would add an extra cycle of latency to my conditional jumps. I'm satisfying myself by adding all my ideas to a future project plan so I can focus on getting this machine finished.
It’s took over two weeks on and off but I’ve made it through all the CPU build videos, sound and VGA’s, I hate to think how many hours that must be but I’ve enjoyed every minuet, now to go back to the first one and start again as I’ve realised I haven’t been giving the early ones a 👍🏻 Thank you James, fantastic content, I’m now tempted to break my fear and start using SMD components as well.
That’s quite alright James we all have work lives as well, I’ve never been one of them ones that moans at people about getting videos out as you just end up watching videos done for the sake of it (fillers) and would rather wait for the hood content to drop.
Over and abpve the tracing of the timing of the critical path, it was a great reminder of how all the various components of the cpu link together. Very interesting..
That's really interesting. Unfortunately the purple line was nearly impossible to see some of the time, I think it could have done with being logarithmically boosted in brightness (so the dim bits are increased a lot while the bright bits are increased only a little) It was also interesting to see the address lines going up and down on the same spot, something you normally only see in those really confusing timing diagrams.
Yeah I appreciate that wasn't always clear. I did what I could when editing but there was only so much. I wish I had thought to swap the yellow / purple traces around as the yellow was far more visible.
I'm actually having trouble seeing the red lines, the purple ones are quite visible for me. Depending on the video, I just boost the brightness and they are easier to see.
Long-time lurker here. Just came in to say thanks for this interesting series, and kudos on being potentially the only RU-vidr with Penrose's Road to Reality in almost every video 🤣
Really interesting, James. Nice to be back on the CPU for a bit :) You'll be at 4 GHz in notime! :) On my FPGA implementation, the ROM are the issue on that, too. There are several circumstances of race conditions where a ROM output feeds a ROM input through some logic, which takes time to settle. That causes some issues. I ended up having to use a faster clock and register the ROM outputs with the fast clock to remove the metastability caused by the feedback.
Thanks George, glad you are still enjoying it! Not sure about your FPGA issue, there are no places in my build where a ROM data output directly feeds a ROM address input. It's always got a 574 d-type in the way.
@@weirdboyjim Sorry, yes, I have the same now! I had a bug in an early design that didn't have the latches quite right. It was a task to port the design to Verilog! Working on a V2 now.
@@m1geo Any updates? I'm curious, if you got your hands on the mandelbrot program weirdboyjim wrote to benchmark his system, exactly how much faster is the FPGA version? Also, what clock speeds can you get it to?
Yeah, most of my projects that used glue logic have been simple and didn't required a fast clock. So, this is great to see what the effects of propagation delay and how you evaluated them. Seriously, thanks for this.
You're very welcome! It's great to have got the clock rate as fast as this. The vga circuit actually goes faster but that has the advantage that errors are visual artifacts rather than crashes.
Very interesting. Another side to CPU and related circuitry I have never thought about. I have seen some 8 bit systems clocked at 3.7 MHz and wondered why. This may be part of that reason.
Yeah lots of the 8-bit systems fell into that general clock rate range and part of that was the general speed of the logic parts available at the time!
Hmm, super cool! I can think of a couple of paths I might go down we’re I in your shoes. 1) elimination of the zero flag completely, not the best option as you lose the jump if equal instruction, but certainly possible. 2) decoupling the program counter assertion from the fetch, either by having it pre saturate the address bus at the tail end of the previous instruction or by splitting it into two clock cycles. 3) adding a pre fetch stage to the pipeline that uses a hole in the previous instruction to preemptively fetch the next instruction and hold it in a register, only re-fetching it in the event of a jump It seems like deliberately delaying the bus control to account for timings is going to be a trip-up at higher speeds… I’d have to give it a lot more thought, but my instincts tell me that giving more cycles to some operations might bear fruit… instructions per cycle might not be the best thing to optimize? I’m not sure. Regardless, I love this thing and I only wish I had the time/resources/creativity to do similar projects. Awesome stuff
You are thinking about the right kind of stuff. I'd latch the flags at the end of the cycle so the decode sees them a cycle behind (but they get eliminated from the critical path).
I'm not saying I haven't learned anything from this channel but most things you've talked about and done, I've at least had an idea beforehand... but I've never even considered what the phrase "critical path analysis" might mean before watching this (something "real" EEs do, that I don't need to bother my pretty little head with?)... thanks for a superb introduction.
That's good to hear. Worrying about what is on/off the critical path is something we do a lot optimizing game engine code. Very easy for a less experienced programmer to put a lot of work into optimizing a piece of code and then discovering it makes no difference to the frame rate!
I'm taking this advice for my software microservices architecture, thank you for sharing, I love to watch your videos on the second screen while coding :D
From my understanding (possibly oversimplified) of your CPU design, one way to dramatically improve the critical path would be to generate the pipelined control signals all at once, concurrently with the following instruction fetch, and then just latch them to the cycle they are meant to be used, instead of generating those signals out of the ROMS within their own cycle. I would expect an improvement of maybe one third of the critical path. But of course I may be misunderstanding or overlooking things (?)
No, but you are re thinking about the right type of thing. The biggest change I'll make in a future build is designing the pipeline such that it get's the entire cycle for a decode. Having layers of logic form part of the pipeline rom address input makes for a simpler circuit but adds up badly for high speed. Separating the rom's into the stages didn't add to the problem (Although I did end up with things in stage 2 that would have gone in stage 1 if it wasn't necessary to balance the line counts).
Might have said it before, but in the old days and working with the 8086 the eproms and the ram were two slow and gave us issues with reliability. The work round was to break the program into odd and even byte and program two eproms and have two ram chips so then they can run at ½ the clock speed. Also in your case the clock does seem to have quite a long transmission line of undefined impedance so may have a few reflections bouncing around at points. This can be were then you put the scope probe on to see the issue, the impedance changes and the fault hides until the probe is removed.
I've been doing some planning for future builds surrounding the rom's, having the flag chain as an address input is a nasty limitation. Modifying the design so all the address inputs are resolved at the start of the clock giving the outputs an entire cycle to settle is the first big step.
@@weirdboyjim i do see as well you might get lots of other issues while going outside the datasheet spec limits on timings etc. having always had to work to about 80% of the spec limits as even going close to them is seen as a bad design idea. I think maybe you should move away from eproms in the logic chain and have a look at Pal's these are of the same epoch.
Jolly interesting analysis. You do indeed need to be very careful giving too much (or any?) weight to observed timings. It’s the datasheets that matter. It would be interesting to see manufacturer’s data on the shape of the distribution curves; might be wider then we think. Also temperature plays a hand.
It's been suggested that the EEPROM's maybe tested and sold based on performance buckets. So the ultra expensive 70ns parts are the ones that passed the test and the 150ns parts are a bigger window to have a large yield rate.
@@weirdboyjim yup. “Binning” I think it’s called. It would be interesting to buy job lots of 150ns parts, from the same batch and wildly different batches, and seeing how close to 70ns they get. None of that helps with mass production though when, unless you want to play games like Uncle Clive did, you have to stick to the sheets out of fear that the manufacturer will change some manufacturing parameter, knowing that it will still meet his published spec, as he’s entitled to do.
It is interesting to see that much of the timing issues are down to using discrete logic (entirely expected) like the 8 input nor-gate not being available isn't really an issue when you bake your own silicon. But I wondered if some modules push the actual output signals through LEDs? I have a vague recollection that this happened early in the build. Sure, the propagation delay will be dwarfed by chip settling time, but there are an awful lot of blinky lights. It is always a good idea to check your assumptions especially when writing code, today we use so much code written by others, and we just assume it is correct and performant.
Interesting, would you mind expand a bit for a “happy amateur” on the negative effect driving LEDs has on the quality/delay on the signals? Is it negligent or major at these speeds?
@@kvadratbitter Since everything in electronics has a propagation delay time taken and an LED has an associated resistor in series, we do get "some" delay. It is probably in the sub nanosecond range, so it is probably negligible on its own, but stack up a few, and it starts to matter. And remember we are looking at 250 ns slices, so negligible is relative. Granted, passive components don't have much of a delay (most datasheets doesn't list it) then it is more about the length of the data path, but again that is probably too small to matter. Stack up 100 LEDs in series and I bet you would start to see some propagation delays, again probably not the issue here, and I think James quickly changed to tapping off the signal and using an LED driver to drive the LEDs and let the signal path not include the LEDs. But some of the boards in the build are rev 1.0 and is fairly old, so the ida popped into my head. As for signal quality, I have no idea how noisy LEDs are, but there seems to be some interference for radio operators. Again a single one is probably fine but 100 will probably really mess with the S/N ratio. If I had a scope, I would check. You need a square wave and as many LEDs you can jam into a breadboard, and a scope to check the signal.
@@weirdboyjim Thinking back, I seem to remember an early video where the LEDs were causing issues, and that was when you started using driver chips, but I can be entirely wrong. Since you built it, your knowledge of it supersedes anyone else's. But as kvadratbitter pointed out the delay would be very small indeed, if the LEDs were in the data path, but they aren't, so I was wrong. But at least we fed the algorithm some interaction.
A very interesting video! I think 4MHz is already quite an achievement in and of itself, but a bit of optimization never hurts. Two questions popped up in my head while watching, one of which you already answered I think, namely the one about the speed of the ROM and RAM chips. The other one intrigues me. Watching all this and trying to follow your train of thought made me wonder if there is actually something in the microcode itself that could be optimized. Since pipeline stage 1 is smack dab in the middle of the critical path, is there logic in the microcode that could be sped up, or exchanged with logic in pipeline stage 2 to take things out of the critical path that do not need to be there?
Thanks! Glad you are finding it interesting! The logic in the pipeline stages is handled by pretty much using the rom's as a big lookup table to replace what would be a complex set of logic. I don't think there is anything you could change there unless you had some extremely specialized knowledge about the internal layout of the roms chips and what address changes would be handled quicker. I would very much be looking at removing any inputs that aren't resolved at the start of the cycle.
Lots of interesting points in there, especially surprising to see the Zero line delayed so much. I guess a 74HC4078 or similar is going to solve that. I do wonder if the physical and electrical size of the main bus is causing slow slew rates because of fan-out (drive current) limitations and line capacitance / inductance + load resistance. Rather than use a fixed time delay, you might find it more reliable to derive the main clock from a higher frequency, and use a divider to generate a second phase delayed by 90-degrees, like the 6502 uses.
The zero calculation is only about 20ns, sure you can make it faster but will never make it zero. it would be better to move the dependencies around so it was no longer a constraint.
One thing I found when I went looking for multiple input nor gates was it seemed to be difficult to find them. Worse they seemed painfully slow. My plan for my design is to use some diodes and a single inverter. For the pipeline ROMs, (and maybe this is considered cheating by your design goals), I considered using some PLAs. The other ideas I had seen were to use some RAM chips that are preloaded with the contents of your lookup ROMs during the reset phase of your CPU. I look forward to seeing how you handle these challenges.
Using RAM for the lookup would be very interesting, especially if it could be self-modified. Could make a crazy encryption algorithm that modifies what each instruction _does_ in addition to modifying its own code... it goes past "self-modifying code" and becomes "self-modifying code on a self-modifying architecture" Edit: could also be used to implement new instructions for super specific instructions. If you're trying to run a game or something and you need fast transfers between a specific set of registers, for instance, you could make your program overwrite the instruction table and replace an instruction you don't need with the very specific one you do
Lots of people have been talking about replacing the rom chips with ram chips but it would take a bunch of extra components to do and it's worth asking yourself if you can make an improvement elsewhere with fewer parts. The point of analyzing the critical path is to make it shorter which can be much more than just asking "Can I make this one thing faster?".
@@weirdboyjim Makes sense and I understand that that would increase the project scope. However, I wonder how hard it would be to create a daughter board that would plug into the 4 ROM slots, have RAM chips on it and somewhere to put the original ROMs and connect to the main data/address bus. It would probably be similar to the temporary ROM-in-RAM board you built for program code while testing it, except for 4 chips at once. Maybe if/when you're running out of ideas for how to keep the series going that could be another option ;) I think there's actually a chance that it could be useful to have a self-modifying instruction set, not to mention the fact that this would be (AFAIK) the only architecture with the ability for a program to not only modify itself at runtime, but modify the actual CPU architecture.
@@aaronjamt Could be, but as James points out, there's a trade off with the complexity of the circuit. This is one of the reasons why I was considering PLAs for my design. Which I think most purists would shun; but they are fast and almost identical to how ROMs behave.
The Zero flag delay is always a surprise when you design a CPU like this. If you are aware of this problem from the very start in your design, you can optimise the instruction set in a way that you move the condition to be tested to the compare instruction, instead of the conditional jump instruction. This ends with one single flag, that can be computed early in the NEXT (pipelined) instruction cycle, thus avoiding completely the overhead of the Zero flag, which otherwise must be computed at the end of the current cycle.
When I started this build I wasn't thinking about it running at anything like this clock-rate. The main difference I would make to this bit of circuit is put the ALU->Pipeline flags connection through a d-type so they only changed at the start of the next cycle.
@@weirdboyjim Not sure if I understand what you mean... but you would still have to compute the Zero Flag after the ALU result in the same cycle, right? (I do not mean to suggest a radical redesign of your processor, just pointing out a solution that worked for me)
@@RelayComputer The idea is to only use it (outside the ALU) a cycle behind so we don't need to speed up the generation but the decode will always have a stable value from the start of a cycle.
As the clock speed rises, we also have to be on the lookout for weird antenna effects in the traces (standing wave ratios, reflections and the such). The microwave signal engineers point out that it's less the clock speed itself, but rather the rise/fall time within the signal that becomes the basis for a virtual frequency that needs to be considered as well. You could justify a collaboration with Phil's Lab for a "Design in Review Playlist", either for this build or for the next. If you were to have this or the next build translated to IC (whether that's realistic or not), would you want it to be a system on a chip, or a classic chipset? Just looking into it could be a great excuse for an episode, even if you don't actually do it. Are interrupts actually needed for peripherals or is it possible to do I/O exclusively through sets of dedicated registers and code? Also, would that prevent anything ubiquitous from being accessed by such a system?
Interrupts are never needed, but polling status slowly adds complexity to code so wanting interrupts develops as a way to manage that complexity rather than as a "Interrupts is the one true way to handle X" way people often talk about it these days.
It's fascinating to see how it could be made to clock higher - but actually doing so wouldn't be as interesting as adding to the capabilities. Personally? I'd be delighted to see some form of demodulation from audio - preferably an old tape deck ;)
Glad you are finding it interesting. I'm very focused on finishing this build but gathering learning's and ideas for a future build rather than getting distracted. Ahh man, audio tape storage, I could touch on that for curiosity's sake but it would never be my main storage mechanic.
@@weirdboyjim - I look forward to anything you care to share with us! it's an intellectual treat to follow the thought processes of a genius :P And yes - certainly not preferred storage but given tape's prevalence historically (I'm thinking more home-micros here in the UK) it does feel rather fitting - not to mention interesting! if a 56k Modem is near one end - and a ZX spectrum near the other - what's the simplest possible circuit and how much further did the average home-micro take it? Happy New Year! :)
Very interesting. I wonder if exchanging the 28xx EEPROMs dor 27xx EPROMs would give you enough headroom for the bottleneck to be moved to the zero flag / bus control 🤔. I think the 2732 is available with 50ns access time.
I should have some 2732 EPROMs I can send. We use them in old arcade machines, will have to check after Christmas but if I do I don’t mind sending you some
I spent most of the video wondering if changing the clock could help the problem. After watching your analysis of the fetch cycle my idea seems superfluous, but what if you had a clock say 150ns high / 100ns low, or something like that? Also do you think a "more square" signal might help?
Yeah, NOR-Flash really is the modern tech for this kind of thing. You used to be able to get really fast UV EPROM's that would have been a drop in here but they are tough to get now.
When us programmers are looking for what limits our project's performance we are "profiling" our code and its execution to figure out where the slowest parts are, that are serving as performance "bottlenecks". It looks like you're profiling your CPU's component parts to find its performance bottleneck! :D
Im just finishing up my 8 bit CPU build and have discovered that the EPROMs are the biggest hindrance to CPU speed. typical EEPROMs have a Taac speed of about 150ns... this is the time the data output is valid according to the address inputs. So if that's the case how are you able to achieve 4mhz with EEPROMs? then comes the issue of most 74 series chips have a propagation delay of on average 25 ns for LS (some logic chips like the ACT series are as fast as 10ns but have horrendous ringing in these types of builds)
Ok, this is a really tough comment to reply to as I’m not sure how much of my content you have watched. My pipeline roms are indeed 150ns EEPROMS. For 4mhz we have 250ns to play with in a cycle which gives us 100ns over the raw lookup time. The simplest overview of why it works I can give is that I don’t do the lookup and then use the outputs in the same cycle which would create a long critical path, I do the lookup and then store the data in a 574 for use at the start of the next cycle, this is the essence of pipelining.
very cool, but it left me thinking just how fast of a processor it would be if you built it doing the bulk of it in a FPGA. I realize that isn't the purpose of this build, and I love how you built out circuits using basic logic IC chips, and while I have no experience with FPGA's, they are well above my skill level. I know that some people have replicated entire CPU's in them, maybe a future project for you??? not that it would be as interesting to watch as this build. I tried watching a video that was an introduction to programming FPGA's and I did not get very far before falling asleep.
Absolutely, on an fpga it would be possible to approximate this circuit at a vastly higher clock rate. This build is very much about research / education / learning. Can't have led's on the control lines inside an fpga!
@@weirdboyjim from experience, my non pipelined CPU that I built in an intro course hit 10 MHz first try with no effort so I'm sure that's one of the lowest bounds on performance haha
Have you thought about using static RAM for the pipeline stage lookups? Yes you need to add a circuit to copy from EEPROM on reset or power on, but might give a huge boost in latency I assume modern CPUs with their microcode update while running need to do something similar, EEROM/flash access is too slow, no matter what tech you use, RAM will be faster.
It would indeed be much faster but the copy logic would be adding quite a bit to the pipeline stage circuits. You would probably want to mere those stages up so you could share the counters etc.. Personally I'm more interested in maximizing the utilization of the components I have. If you solve all the other issues fairly cheap flash chips would work to several times the current clock rate, so solving the challenges ti make that happen is interesting. Especially since the same changes would let you get still more performance out of faster memory technologies.
@@weirdboyjim It's all about the purpose you choose, and the battles you pick. I'm just throwing ideas into the room, keyboard warrior style. (OK; got at least some experience to back it up...) Using your critical path approach tells your weak points. If the eeproms become your limit and this can be solved by faster chips: go for it :) Btw, I'm courious if there are off-the-shelf hybrid ram/rom chips that do the copy on power up or writing. Or if they leave it to system designer to do this when needed, like in the early bios days where they reserved precious RAM for this.
There are hybrids of course like Flash or NVRAM (battery backed up SRAM) Flash seems to be suitable but I agree with James that overall critical paths are also important. I liked the idea suggested in these threads to pre saturate the address bus for instruction/data loads at the end of the former cycle. That could cut out a chunk from the load cycle. The Zero flag could well be done with the diodes and an inverter yes. Might be a lot faster and easier than cascading nors with an and
You could put a shadow ram system in but it would be a whole stack of extra circuitry. You would need to be able to cycle all the addresses to write the ram on power up which would either need multiplexers or some clever overriding of the input circuitry upstream.
Have you thought about changing the ROM chips to actual logic? Should be able to get that to work in much less than 150ns. Just wondering, now the CPU is pretty fixed.
I thought about it, and I’ve thought about all the things I can change that would make it easier. I don’t think I will though as I could easily get into a perpetual cycle of refining this thing. My plans for a 2nd generation while not ROM less do implement a number of the changes designed to make decoding without ROM’s easier. The next build will have less ROM’s in that role.
@@m1geo If I wanted to reduce the ROM complexity I would first look at all the different assert lines, in the data they have explicit values but most of the time they are actually “don’t care”, for examples LHS and RHS select only matter then ALU Op!=0, Xfer assert only matters when Xfer load is not zero etc.. That would let you reduce the data right down and I expect you could reorganise the instructions such that some of them can be directly derived from the incoming bits.
The datasheets show you the speed ratings for when they are driving a certain capacitance, usually equal to just a few CMOS inputs. If a chip is driving a lot of inputs then it will be far slower than what is listed on the datasheet and if its driving less it will be far faster. So as you've found out, the datasheet speeds aren't even that good as a rule of thumb!
You might save a few nano seconds that way but this exercise shows it wouldn't make any difference. If fetch were improved it would become an issue but rather than save a maximum of 21ns (if you made it instant) I would redesign the dependencies to remove the flag calculation from the chain completely.
Why not swap the roms for faster ones? I get you don't need a increase, but the possibility of that is just a side effect. As you concluded you are running the current parts of of spec. Why not replace them just to prevent possible instability issues later down the line. Running parts out of spec might lead to unexpected failure modes later on, maybe just from putting wear on them.
Earlier in the build I had planned to replace the 28c256's with 27c256's which are pin compatible and much faster (but OTP) but they are not readily available due to the chip shortage. Other ROM's are not pin compatible so I'd have to redo the pipeline pcb's OR make a messy adapter board.
@@weirdboyjim in that case let's hope you are not forced to redo them later on anyway because of some failure. These roms work at this speed, but being outside spec it's not guaranteed all roms of that type will do that. Some might be better off the production line then others. Like transistors having a b c suffixes for classification in past specifying how good they were. Before I forget, happy new year, and i hope to see many more interesting videos as to expand the video and other stuff.
When this is reduced to a FPGA something a AMD/XILINX SPARTAN it should be faster or have it fabricated using Boron doped Diamond then you will be able to get it up to something like 5GHZ
If you go over to the main channel page you will find multiple playlists for different sections of the build. The core cpu one is here: ru-vid.com/group/PLFhc0MFC8MiCDOh3cGFji3qQfXziB9yOw
Not my goal here but some viewers have been trying to do just that. My use of ROM chips makes that goal difficult, my future build plans would be more suitable.
Surprised the ROMS are so slow, I expected 15nS or better.. Why not use SRAM and fill them from very slow ROM during reset. That way you can get < 10n! :) The ALU car and other flags are a surprise!
I use a shadow ram for the rom in the memory subsystem (That is as much for code flexibility though) but here it's a lot of extra components to do that copy. It would be cool to absolutely maximize the performance of this design but I only have so much time and I think it will be more interesting to explore some other bits of architecture.
@@weirdboyjim Ok you lazy bum! :) ..then just solder a couple tiny 3V cells with a resistor and diode etc. on top of each SRAM (assuming you can find pin compatible ones, and "program" them as ROMS. Failing that, and this would take more time, make emulator boards as such.