It’s been 2 months since I wrote about setting up the SPI connection between the supervisor and FPGA. That time hasn’t been idle, but I still don’t quite yet have a proven SPI connection. What I do have is a Z80 CPU running a program to exercise the SPI connection in a simulation. Valuable lessons were learned along the way that I hope you’ll find useful. Let’s start with a nice picture of the simulation waveforms!
Simulation Waveform, click for bigger image
This image is a screen capture from GTKWave. In this capture, you can see data being clocked out of the SPI as the ‘master’ toggles the clock line. That looks great, but it’s a long way from where I started.
I started building the SPI from a model in my head of how the SPI client should read data real time from the MOSI line, and as it receives instructions from the master it would return its response message. That’s great in theory, but I later found that the Z80 CPU couldn’t keep up. The SPI interface runs at 40MHz, meaning 5 million bytes per second can be transmitted across the bus. The support CPU would run at 24MHz, and at its best can execute 6 million instructions per second, assuming there is no data fetch. There’s no way it has time to execute any sort of logic in response to received messages in real time. I had to think of a different approach. I decided on an exchange coordinated by handshake signals. Most of the requests will come from the slave device, so it would work like this:
- The master would not transmit until the slave indicates that it’s ready
- The master can transmit any time the slave ready signal is high (such as when it’s prepared a message).
- If the master sets its master ready flag before the slave ready flag is set, then the slave needs to prepare to receive the message then set its slave ready flag. This could involve creating a ‘no operation’ message or storing a message for transmit.
- Message are exchanged when both signals are active.
- The messages may be related, or unrelated. The protocol doesn’t make any distinctions.
- When both ready signals are active, the SPI master completes the exchange under DMA. Each side of the interface processes the received message.
- The messages are packetized for convenience and to ensure integrity with some form of CRC yet to be decided.
The core SPI mechanism worked in simulation, so I moved onto feeding the SPI module with data from a CPU. That’s when problems became apparent.
Until this point, I’d been using verilator to simulator my HDL code. I’d observed some oddities in the SPI MISO line, as it didn’t seen to drop to a Hi-Z state. I knew from tests with iVerilog that GTKWave could show this Hi-Z state, but it was not appparent in the output from Verilator.
After adding the Z80 CPU to the mix, the simulation failed to operate correctly under Verilator and wouldn’t execute any instructions, even a hardwired HALT (0x76) to the data line. There was something fundamental wrong here.
I switched over to iVerilog to test the process, and lo-and-behold, the CPU started to work. The main issue that was causing problems is that the CPU I used (A-Z80) uses a real tristate data bus, just like a physical Z80 CPU. I read the verilator documentation and realised that it’s a two-state system. This means that high impedance signals and “inout” lines will not work. I spent days assuming I’d done something wrong in the deployment of the CPU but realised too late that it was the simulator I was using. I reluctantly switched over to my trusty iVerilog.
This is a shame because Verilator provides a simpler interface for driving HDL code and would allow me to extract memory real-time, such as pulling out the screen buffer. It’s also about 10 times faster than interpreted iVerilog code. However, the unconventional A-Z80 CPU was necessary because it’s developed from reverse engineering the real Z80 and so should be super accurate. I’ve used TV80 before which uses a more conventional data-in and data-out bus, but I really liked the design premise of the A-Z80.
Switching over to iVerilog solved the problems I was seeing and the HDL behaved as expected. I then wrote a data bus interface that would convert the bi-directional data lines to one input bus and one output bus controlled by the RD signal. These unidirectional busses are more aligned with general HDL practice and are converted to selectors by the HDL compilation process. I then wrote the memory module, that provided 64KB onboard and the I/O controller that would interface to the SPI and up to 16 other peripherals and connected all of these through the data bus. the bus is coordinated by the MREQ, IORQ, RD and WR signals.
I set up Eclipse with a Makefile project to compile my test code. The test code feeds the SPI with data for transmission, then sets the slave ready line for the test harness to start clocking the bits through. The HDL harness emulates the SPI master in the supervisor chip and clocks data out of the SPI module.
Once I got this working successfully, I decided it was time to run this code in silicon. Yes, it was time to get out my prototype CPC2 board. Here things started to get really interesting!
I’d spent a long time ensuring that all of the transition points in the simulations worked correctly. An rookie mistake I made when I first started FPGA work was relying on a signal level at it’s transition point, that is expecting to find a logic high on the clock edge at which it changes. In simulation, it will probably work OK. In an FPGA it might see a high, or it might see the previous low state, in fact, it might see something completely in-between depending upon the layout of the FPGA logic in the floorplan of the chip. This behaviour can change from one compilation to the next as the floorplan changes through optimization and confuse the heck out of you as you work out why you see a problem here when you only changed there. Yep, it happened to me.
Anyway, the way to avoid this is to change your signals on one clock edge and sample them on the other clock edge. The Z80 CPU encourages this by changing it’s control signals on the rising edge of the input clock, but sampling the signal levels on the falling edge (generally). I’d been very careful to observe this when simulating my design, so I wasn’t expecting any problems. Murphy and his irritating law had other ideas.
Upon downloading the image to the board it flatly refused to execute the test program. I used Altera’s SignalTap functions to review what was happening with the busses inside the logic, but they were all over the place and didn’t seem to correlate to any sort of pattern (at least I thought initially).
I was having trouble seeing the trace edges as the clock I used to capture the data was the same clock that powered the logic in the Z80. I needed a finer resolution (more samples) on each clock cycle to see what was going on. I suspected that there was some sort of timing error, with the signals not reaching the CPU on time before they’re sampled on the falling edge of the clock. I created a clock divider that powered the CPU with a clock that was 1/4 of the master capture clock. This should give me enough resolution to see the edges and when the signal actually arrive at the CPU bus.
Incredibly, the CPU began to behave properly after this change. In that lightbulb moment that happens (many times) on every project, I remembered that the TimeQuest timing analysis had suggested the maximum clock is around 50MHz to allow for signal propogation delays through the logic. The clock I was using was definitely exceeding that, so slowing it to 1/4 of the original speed put the clock period into specification. Hey presto! The CPU was running! The clock I was using was fed directly from the internal oscillator that is used to drive configuration and runs nominally at 100MHz. The clock was toggling before the signals were half way to their destination! No wonder there were problems!
I left the clock divider in there because the support CPU will run at about 24/25MHz. Any faster and the memory won’t be able to keep up. Take a look at my timing analysis to see what I mean. The clock module will need to provide a variety of clocks all in sync with each other, so I’ll leave it as-is and fix it up later, when I’ve replaced the 50MHz oscillator that I soldered incorrectly during the build.
So, like all good tales, here’s the moral(s) of the story:
- Always have a plan – the SPI module took so long to write because I didn’t clearly understand the way a client SPI function should work. Think through the design approach before starting and write it out in a document that will be useful later, but will also force you to think through the process, step-by-step.
- Decide if you need to see HiZ and Unknown signals in your simulation. If not, Verilator is an excellent tool and provides much better performance and access to internals than Icarus Verilog. You can even switch between the two using a Makefile if the piece you’re simulating doesn’t require 3 or 4 state logic.
- Remember that your simulation doesn’t care what speed you run the logic at. Your silicon prefers the laws of physics. Simulating using `timescale 1ps/1ps in your code header will show your logic is good for 1THz operations. Einstein prefers you simulate your logic with the correct timescale for your target silicon so that you can use GTKWave to check the interval periods. (He told me so).
OK, so I’ve got the CPU dumping data into a buffer, signalling the SPI master that it’s ready to go and the SPI module cranks out the data at 40Mbps. The next steps are:
- Document what I’ve done so far (very important!)
- Write the supervisor side of the SPI interface and verify the data is coming across the SPI bus correctly.
- Write the TTY/STDIO interface so that I can provide a virtual terminal to the support module.
- Start a new repository on GitHub to upload the whole image to date so you folks can look for yourselves.
I only get a few hours a week to work on this project (work and family, huh!), so bear with me. We’re getting closer a full blown CPC image. It’s a short hop once you’ve got the basics working and getting the A-Z80 module working is a major step forward. Adding another Z80 and a bunch of CPC logic should be a piece of cake after this! I’ve also got a working proof of concept for the CPC logic developed for the Terasic G5X kit, so I’ll reuse a lot of that code, especially the custom Amstrad gate array ASIC.
Stay tuned for the next installment!
(If you think that I’m going into too much detail on the whole process, let me know in the comments and I’ll try to focus more on the development of the prototype than experimenting and writing up the process on the blog).