Yeah, it sucks the energy right out of you by continuously peeling away your attention. Eventually the frustration builds up and I personally fall back onto unproductive social media. ADHD, family medical challenges, and pandemic just aren't that helpful either.
Yay! Soooo happy to see this, thank you. I have just finished another course in my studies. Have to rewatch the series, I’m getting so much more out of it now.
Welcome back Robert. Very keen to see where you go with this as I have just started exploring nMigen to build stuff on a ULX3S ECP5 dev board and RISC-V is the main target.
"Ben Eater" level clarity of explanations. Subscribed and liked. Also, it's good to see an example of implementing the RISC-V architecture, maybe taking away some of the fear and loathing aspect of it.
Robert, I feel for pain when you mention that you have all of these projects on the backburner and you get depressed looking at all of the equipment to dive into in your "junk room". I feel the same way about software projects I want to do and soooo many computer books (many are 30+ years old b/c I like pre-Internet topics) I want to read. I mentioned on a Discord channel relating to Erlang that that all of the people I have come across who either program in Erlang or creating compilers on top of the BEAM (i.e., the VM for Erlang) are so smart. One guy replied, "I don't think it's that. I think it's because we are stubborn". THAT'S the key Robert, be stubborn, be unwavering in your resolve. Do small blocks of concentrated effort. They all add up to achieving your goals.
Wow! A Vector Graphics MZ!!! It was the 2nd computer I had access to back in 1980. A lot of fun memories from those days, incl hand typing in machine code in the built-in memory monitor to play games...
It's great to see you back on this project, and designing it in HDL first is probably a very good idea. I think you may be over-complicating parts of your architecture though. First thing to keep in mind is that the instructions require two reads and a write per Instruction, not per cycle. That's the programmer state view / architectural view. The actual implementation depends on the microarchitecture, where you could have a single bi-directional data bus which first reads RS, then reads RT, then writes back RD. If you wanted to, you could even pass the flags and ALU op on that same bus. There's nothing wrong with trying to make every operation a single cycle, but it's not necessarily required. Another way to look at it is you have a large cycle (1 per instruction) broken down into multiple phases, where each phase puts something else on the bus. For your function units, you may want to look at some of the modern out of order CPUs, not to implement something out of order, but to see how they broke down the function units. All OOP CPUs basically have fetch logic -> decoder -> issue bus -> function units, which is very similar to your backplane idea (i.e. the issue bus is a back plane which can issue instructions to function units based on the decoded instruction). From that perspective though, you probably should not combine the sequencer with the memory access component, and instead should implement a load/store card. If you take the multi-phase idea, then you could dual purpose the load/store card to do both data R/W and instruction reads. Also, if you compress the backplane down to fewer buses, you could run them parallel, say X, Control, Addr, Data, Mem Control. Then your load store could just bridge the two lanes. Or even compress the memory bus down further so you have X, Control, AD, and Mem Control (in that case, you memory bus would look a lot like PCI, which could be an interesting architecture choice - i.e. implement your memory bus using the PCI PHYS, or even using the PCI standard). To better utilize SRAM for a register file, you could multi-phase that as well. So for example, let's say you run your main phase clock at 1 MHz. Then, you can run your internal register file clock at 4 MHz, and read 8-bits at a time. That way you can have a single x8 SRAM which stores your 32x32-bit register words. Though, to make sure you don't have synchronization issues, you would probably be better off running the register clock a little higher so that you are guaranteed that you have to correct output regardless of the clock phase offset. I believe that would be +1 cycle, so run the register clk 5x the phase clock. Hopefully that was helpful.
@@KaneYork If you look at one of his newer videos, you see that he sort of ran into a problem there where he has to sequence control signals based on sub-phases. Which is essentially making each "cycle" multiple cycles long. E.g. he is using two 6 phase clocks, and may need to increase the phase count to avoid multiple bus drivers.
Yay, you're back, and LMARV is back too! This is absolutely my favorite series, very interested to see how this turns out with your new design process. Would it be worth having a "register clock" which is essentially 2x the "system clock"? The register clock gives you two cycles to do your work which translates to a single system clock cycle.
Another somewhat similar approach is how Motorola implemented a two-phase clock for the 6800 series - by having two overlapping clocks (in quadature), so that any of four phases could be synthesized just by ANDing the two. Robert's use of an overlapping clock specifically for write pulses is probably a good solution, though, since it can be used everywhere else in the system that needs similar setup and hold times. If you use a 2x clock, then you still have to AND the clock with something in order to determine WHICH state you are in, which is the way Intel liked to do it back in the 8-bit day.
You _could_ use dual-port memories. But you need an additional MUX for the case of reading and writing the same address. One MUX input is the read output of the memory, one input is the write data. In the case of read_addr == write_addr, switch the write data to the output of the MUX, else use the read data of the memory.
The idea to use nMigen for simulation of logic ICs is nice :D Maybe you can get an estimate of the timing constraints by synthesizing the design for some platform and use a tool like icetime.
Been cleaning up my workspace for weeks. A little here and a little there. Though it might not look it to an outsider (say my wife), I've actually made a great strides, but I still have a ways to go. lol.
So it looks like you'll have "internal" buses and "external" buses. One thing I might suggest is having a simple Memory Management Unit (MMU) that bridges the "internal" and "external" bus. This could be as simple as an address latch and a data latches. You can treat it like a normal functional unit on the internal bus and have it manage the address and data buses for the "external" cards. The "sequencer" (Control Unit?) can then handle data movement to and from the MMU. You might also want to implement a "cache card" to save round trips to the "external" bus for at least instruction memory. Regarding "internal" bus movement for data, you could make your operand fetch take multiple cycles, using the same bus for bidirectional data transfers. You would need to add input latches/registers (I think you were using buffers before anyways) to the functional unit cards, but your bus can be much smaller. Instead of 3x32b you can do all of it with 1x32b. This also helps you avoid having duplicate register files just to handle simultaneous reads. You can still use your scheme of reading on the rise and writing on the fall, it just might take 2 cycles instead of 1 cycle to fetch both operands. You might leverage that unused write slot during the first cycle for updating the Program Counter (PC += 4) or some other operation.
Design choices! I did think of multiple ideas, and they all ended up being a compromise between number of cycles and amount of hardware. My line sort of settled on fewer cycles for most instructions.
Love it! That sequencer block needs more fleshing out. You know at least you need a Fetch and Decode unit in there somewhere. Are you going to do any pipelining?
Pipelining wouldn’t work on this simple bus architecture because you have simultaneous incoming and outgoing data streams in addition to the internal register transfers. For pipelining you need a matrix switch fabric that can route your incoming data/instruction to the decoding, transfer the register values to the Alu/shifter/whatever and the result back in the registers and transfer the last operations output into the databus output buffer. Your bus is now an active component in the cpu not just some lines.
Setting an uninitialized value to 0 can hide propagation of bad data errors. Better to use a crazy value like 0xDEADBEEF used by IBM in their dev tools.
I have a question - do I inderstand correctly that you use fpga-like environment for a concept and proof of work and then after you make sure it all works you'll produce the cards/hardware directly without fpga, right?
There's no part of the RISC-V specification that requires you to do two reg-reads and one reg-write in the same cycle. With latches on the ALU, you could very well sequence the same operation over several cycles instead. You only need 2R1W registers if you plan to make a tightly pipelined, scalar architecture. I'm certainly all for making that kind of architecture :), but there's no reason that you *need* to do that if it makes the project more complicated than you need it to be.
14:02 Can someone explain why we need two banks to access two registers at the same time? It makes sense for memories due to single addressing but I don't get it for the flip-flops case.
Because otherwise you would need 32 register flipflops and 2x32 latches (x latch and y latch) to send the bits from R to X and/or Y. RS1 can be the same register as RS2. So that's 3x32 bit chips instead of 2x32 if you parallel write the registers.
@@robertmenteer3462 I see. I don't think that addresses the problem, though. The problem is that he needs to read two addresses and write one, all in the same cycle. If the two addresses were just offsets within the same chip, then writing could only be done to one of these at a time, so you'd have to use a second clock pulse to write to the second copy of any given location. By using two memory chips, each write updates both chips (both copies of the same location) in the same clock cycle.
At 12:05 you say that RV32E has 32 registers, but that's not right. The base integer instruction set, RV32I, has 32 registers, and RV32E reduces it to 16.
In case you haven't seen it yet, this project is featured in a Hackaday blog post: hackaday.com/2020/11/09/the-logic-chip-risc-v-project-reboots/#comment-6293177. I'm looking foward to the rest of the modules!
@18:24 this doesn't sound right. The ALU takes a non-zero amount of time to do its operation... wouldn't the write be on the next clock? (or the falling edge or whatever, but not at the same time)
Why wouldn't the memory run instructions like the ALU? It needs rs1 and rs2, and can write to registers too. Maybe you could put the opcode on the bus, and decode it on each card? Also, you said RV32E has 32 registers, but it has 16 registers.
Wellllll.... originally I meant LMARV-1 to be the discrete version, with FPGAs slowly replacing each piece (LMARV-2) until the whole thing was just one FPGA (LMARV-n). In reality, I'll never do that.
Like the channel! nmigen doesn't support delays? I've been building my own 8bit CPU called SPAM-1 that has a somewhat similar multubus design superficially perhaps ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-VJgfgP1Q89U.html and I've similated the components in Icarus verilog and spent a LOT of time approximating the delays of the datasheets of the 7400 devices. I found this invaluable for spotting glitches or other propagation delay side effects. How do you plan to discover at least some timing issues prior to committing to hardware? I understand"careful design" but simulation and automated tests have been invaluable spotting the problems that my care and attention didn't avoid.