Hello CPC fans! Has it really been 4 months since my last post? How time flies. Thanks to codepainters for the prompt to get going on my next post, and for reminding me that there are readers following the progress of this project.
I spent the first 3 months of this year refining the caching SDRAM controller and chasing timing closure issues. The SDRAM controller and it’s byte cache worked absolutely fine in simulation but failed to work reliably when I put the design in silicon at high speed. The sort of random and unpredictable behaviour I saw is usually indicative of timing issues, (especially when it worked at lower speeds), so I turned my attention to the timing reports from Quartus. This was new ground for me. I knew that IO timing was something that I should pay attention to, but until today my designs generally ran at sub 50MHz speeds where there’s enough timing slack to not notice the issues. However, on every design I’ve ever produced, at least one or two of the Timequest reports were in red, meaning that there were timing violations and hidden problems waiting to rear up when I least wanted them to.
With the high-speed logic of the SDRAM controller aiming for 160MHz, I had to solve these timing issues. My first stop was the FMAX report, which analyses the logic, calculates up the gate and linkage delays from the input to the output to determine what the approximate maximum frequency is. It was showing a paltry 55MHz – that’s 34% of my target and a full 105MHz slower than needed. I had some serious work to do tuning this design. I had no idea where to start, so a few days of Googling “FPGA best practices” and “Timing Closure” let me to a few useful links, here, here, here, and eventually into the Altera Training portal. There’s also a lot of great material on YouTube in the Altera (Intel) FPGA channel. I watched every instructional video that I could, read every PDF document available and undertook the formal Altera free courseware. It finally started to make a little sense. I have to tell you straight out that there are no shortcuts to this process. If you’re experiencing timing issues in high-speed designs, you have to read this stuff to fix them. There are no quick fixes and every problem is unique because every design is unique. Even if it’s not unique, some variation of platform, PCB layout, or signal impedance or skew will make it unique that you have to consider timing issues and craft an SDC (system design constraint) file for each of your inputs and outputs.
One of the first couple of documents that I read was Altera Quartus Handbook Volume 1 under section 11 and 12, recommended HDL and design practices, and AN584 Timing Closure. These taught me things such as;
- Use CASE statements rather than multiple if-then-else statements to reduce logic usage.
- Follow good synchronous design practices, like making everything activate on a clock edge and reduce cascading logic and asynchronous signals to a minimum feeding synchronous inputs.
- Use an asynchronous RESET signal. This is counter-intuitive to the point above, but the way Altera FPGAs work is that they have both a clock and a separate reset signal, so using always @(posedge clock or posedge reset) will give better glitch resistance than checking a possible asynchronous reset inside the synchronous always block.
- Avoiding combinational loops. That is, avoid logic that feeds it’s own output back into its input. Quartus will warn you if this is coded during compilation. You can feed the result back, but only through a clocked register.
- Avoid combinational logic outputs, and use a register to hold output values where possible to avoid output glitches, unless your external inputs are completely synchronous and glitches are ignored between clock edges.
- Avoiding latches by fully specifying all conditions in a case statement. Latches lead to glitches because they’re not clocked and transitory signals can trigger a latch inadvertently.
These are just a few of the mistakes I made when coding my memory controller. The Altera documentation lists many, many, many others, but these were the few that affected my design. After fixing these, I was still only getting 75-85MHz maximum clock speed, still well off the target speed.
I started to delve into some of the timing reports. One of the most important is the ‘Report Timing’ for the SDRAM clock. This will show the top 10 or so timing violations in sorted in order of the negative slack. Slack is the available time between when the data is ready and when it’s needed, so negative slack suggest the data is available after it’s needed, causing a timing violation. I tackled these starting with the worst timing path and going down the list fixing each in turn.
As it happened, many of the worst timing violations were on the logic that retrieved data from on-chip block ram and presented it to the DQ data pins of the SDRAM. This didn’t make any sense to me as these chips were rated to 800MHz, so I delved a little deeper to find out why these missed timing.
Right-clicking on the failing clock, and running a timing path on the clock will show a summary of the failing paths. Right-clicking on the path in the report and selecting Locate Path->Locate in Technology Map Viewer will show in a diagrammatic form the path that the data takes from the source clocked register to the destination clocked register. In this case, the destination clocked register was off-chip on the DQ pins of the SDRAM. The signal waveform is shown and the negative slack time is clearly displayed in red as the difference between the data required time and the actual data arrival time.
This is where it gets tricky because the technology map viewer is a post-fit view and the RTL compiler changes the names of the components and registers used. Or it may be showing parts of the Altera proprietary IP, for which the source code is not easily visible.
However, in my case, it was fairly obvious that the life of a data bit pulled from block ram is complex. The 2-port RAM that I was pulling the data from was a little more complex than a bunch of registers wired in parallel to the output with an enable signal. For a start, I hadn’t considered that with port-A at 8 bits wide and port-B at 16 bits wide, there would need to be some conversion logic in there. It’s obvious when you think about it, but I fell into the trap thinking that I’m building logic at 160MHz and it’s an 800MHz chip, so there’s sure to be plenty of slack.
It turns out that while the FPGA fabric may be able to clock a single register at the 800MHz top speed of the device, combinational logic adds small delays of a few fractions of a pico-second for each stage. An AND gate here adds 0.4nS, an or gate there adds another 0.6nS and it all sums up to a much lower clock speed.
I recalled that as clock speeds increase, pipelining becomes critical and I read somewhere that the newest Intel i7/i9 chips have hundreds of stages in their data pipeline to provide the 4GHz+ speeds achieved.
I decided to add a pipeline stage that would hold the 16-bit DQ data ready prior to it being output onto the SDRAM bus. This isn’t as complicated as it sounds. Instead of routing the 2-port ram output through a multiplexor to the DQ pins, I would extract the data on one clock, and hold it in a “holding” register for presenting to the DQ pins on the next clock. This shortened the data path for the clock cycle and took quite a few nano-seconds off the data path. Surprisingly, there were still issues getting the data from the block ram to the holding register.
After a bit of head scratching, I went back to the Cyclone V data sheet and re-read specification for the block ram. I have a speed grade-8 device on the CPC and according to the specs the block ram logic runs at a maximum of 240MHz, so my 160MHz SDRAM controller was pushing the limits with un-optimised logic.
I’d managed to crack about 100MHz with the revised pipelining logic, but I was still a fair way off the target 160MHz. I looked back at the timing reports again, and more importantly, looked at the waveform in the path summary for the worst failing paths.
I realised with a shock, that the diagram was showing my data needed to be ready at the 3nS mark. My clock speed out of the PLL was 160MHz, so this should give 6.25nS between clock ‘launch’ and ‘latch’. This would be true if all my logic is triggered on the rising edge of the clock. Unfortunately, this was not the case and I had logic clocked on both edges. I’d followed my earlier coding practice that improved reliability by splitting logic to update in rising edge and output on falling edge. I’d used the same technique in the SDRAM controller, but the higher speed highlighted the problem. Clocking logic on both edges effectively halves the maximum speed of the logic block. Data is requested (launched) on the rising edge and sampled for output on the falling edge (latch edge). 3.125nS between these clock edges makes the effective clock speed 320MHz. In the same way that running DDR ram at the same clock speed as SDRAM will give double the throughput, logic on both edges turn my logic into a sort of half-way DDR block.
The math didn’t add up: 320MHz effective clock speed for the data transport logic didn’t give the block ram enough time to get it’s data out through the complex output logic. Especially when it’s only rated to 240MHz. I removed as much logic as I could from the falling edge always statement, but there were a couple of essential buffering registers that I couldn’t move without a major rewrite of the core logic that I’d spent months testing.
Using tricks such as asymmetric clock waveforms, that is putting the falling edge at 80% of the cycle time, I managed to get the logic to run at 152MHz with 300 pico-seconds of slack. Quite an achievement from where I started but I was running at the limits of what was possible. Any changes to the logic in the design, even in other modules, would move the logic blocks around and affect the timing. Longer data paths meant longer delays between registers and I was constantly going back to the SDRAM module to fix new failing paths. That’s the thing about timing closure, that is when you fix one timing path it uncovers the next failing path and you’re never done.
160MHz should be easily possible, even on the speed grade 8 chip if the logic is written with short data paths and sensible pipelining of data. However, I needed to move on – I’ve sunk 5 months into building, testing, tweaking and timing closure on this module alone. That’s enough for me. I did a quick calculation on the back of a beer mat and decided that 120MHz would suffice. It’s only three-quarters of my original planned speed, but it will allow me to move on to other parts of the system.
I expect that there will be worst-case scenarios that the CPC CPU requires data and it won’t be available in time because the caching controller is in the middle of a read-modify then refresh cycle. My plan to address this is to suspend the CPC clock using a clock enable clock until the data is available so that timing is not missed. This would be disastrous on a real CPC as the video refresh is tied into the clock speed, but on my ‘fake’ CPC the video output is up-scaled to HD before output to the monitor, so it will be able to tolerate a bit of clock throttling without spoiling some of the CRTC special frame effects. This was a really tough decision for me because I wanted the RTL to be a faithful reproduction of the original CPC so that video effects timing is perfect. Inevitably there are compromised in any complex electronic system and this is one I had to make for my sanity. I may come back to this at some future point and rewrite the caching SDRAM controller so that this can run at full speed.
I discovered a few other things during my in-depth analysis of timing closure and I’ll write these up next time. I promise I’ll do this sooner than 4 months though! Thanks for sticking with me and feel free to ask any timing related questions discussed in this post.
Last Post <====> Next Post