Following a summary of the timing closure challenges in my last post, here’s a few more lessons learned from the process of trying to get my SDRAM and DMA controller to run at their fastest possible speed.
A lot of my timing closure process involved changing the RTL code and checking the effect on the timing. It’s a slow and laborious process, so here’s a list of my findings so you can save the hours of compilation time that it took me to test these.
In no particular order:
Dual Port Block Ram
- The documentation doesn’t appear to suggest there’s any relationship between the Port A clock and Port B clock. However, the timing reports suggested there was some sort of relationship between the clocks for port A+B. I started with Port A running at 48 MHz and Port B running at 160MHz. Regardless of how I adjusted the phasing of these clocks, I couldn’t overcome setup or hold violations. When there is an integer relation between Port A+B, such 48/96 or 48/144, the timing became easier to adjust because the clock edges align every 2 or 3 clocks.
- If port A has the higher clock speed and clock B is an integer fraction of the Port A speed, then the logic runs faster. I gained an additional 15MHz FMAX by switching the connections between port A+B so that A had the higher speed.
- Block RAM incurs additional time penalties if the Port A width is different from the Port B width. Initially, I had used an 8-bit width on port A, and a 16-bit width on port B to better align with the 8-bit Z80 on A and the 16-bit SDRAM on B. However, it proved near impossible to close the timing in this configuration at 160MHz because it took longer than a half-clock cycle to just retrieve the data from 2 locations in the block ram, and mux it into a 16 bit data path. Switching to 8-bit width on both sides of the block RAM saved 1.3nS, which was 12% of my total timing budget for this operation.
- Forcing register types to be MLAB rather than a default M10K can save a lot of time. I was transferring from the 64K block RAM to a holding register so that it can be processed quickly in the next clock cycle. However, I was finding that access into and out of the holding register was unexpectedly slow. When I looked up the technology map for this transfer, it showed that the holding register had been fitted to another block RAM. This really defeated the purpose of the holding register as this is supposed to be fast access. Adding the synthesis attribute (* ramstyle = “mlab” *) fixed this issue, by forcing the holding register to be a logic block register, rather than block RAM. This dramatically improved the retrieval times from the holding register to output to the SDRAM data pins.
Synchronous Output Clocking
This describes the situation where the clock for an external device is generated internally from a PLL. The FPGA generates the logic signals and synchronises this to the output clock. The timing challenge here is that the PLL output should be connected straight to the output port and the timing delay is minimal from the PLL output port to the FPGA pin. The data path for the logic is almost always longer, so there will be some timing skew between the clock and the signals. This is the focus of the timing closure work.
- The cycle time of the output clock is usually going to be the same as the clock driving the logic. The phase on the PLL clock is used to meet timing, along with the multi-cycle timing constraint. It was somewhat of a revelation to realise that it didn’t really matter which rising edge triggered the external logic, as long as the rising edge occurred when the data lines were stable. For example, the logic can look like this:
- PLL0 – 160MHz 0Deg phase – logic clock
- PLL1 – 160MHz 180Deg phase – output clock
- Using two PLL outputs in this way, the the phase can be adjusted to ensure that the rising edge of the clock is right in the middle of the ‘data valid’ window.
- Linking the data signals to this generated clock in the SDC file is very simple:
-
- Create a generated clock
create_generated_clock -name sdram_Clk -source master_clock:master_clk|master_clock_0002:master_clock_inst|altera_pll:altera_pll_i|altera_cyclonev_pll:cyclonev_pll|divclk[6] [get_ports sdram_Clk]
- Set a multi-cycle relationship between the clock and the data because the logic data path is particularly long
set_multicycle_path -from [get_clocks {master_clock:master_clk|master_clock_0002:master_clock_inst|altera_pll:altera_pll_i|altera_cyclonev_pll:cyclonev_pll|divclk[0]}] -to [get_clocks {sdram_Clk}] -setup -start 2
- Add the setup and hold delays in relation to the clock
set_output_delay -clock [get_clocks sdram_Clk] -max 2 [get_ports sdram_*] set_output_delay -clock [get_clocks sdram_Clk] -min -2 [get_ports sdram_*] -add_delay
- Note that the -min delay which is the hold delay is negative because it’s after the clock edge. I use -add-delay to add this hold delay to whatever Quartus is calculating for the minimum hold delay anyway.
- Create a generated clock
-
- Quartus will take these settings and ensure that data is valid 2nS before and stable for 2nS after the rising clock edge. This ensure data is not only presented to the FPGA pins at the right time, but that it also has time to propagate to the internal gates of the SDRAM and the setup time for the SDRAM needs to be included in this maximum set-up time (and datasheet hold time added to the negative min hold time).
Timing calculations always assume rising edge unless the -fall option is used.
The only other thing that caught me out in this timing exercise was the use of synchronizers for multiple clock domains. I have two domains, one being the SDRAM clock and the other the CPU clock. Signals from the CPU clock domain need to get over to the DMA/SDRAM domain through a synchronizer (slow to fast domain). Quartus should detect these synchronizers and produce a metastability report, but I couldn’t get this to work. No matter how I flagged these synchronizers, they would not show up in the fitter report. I tried using the global flag to register a synchronizer if the input signals are asynchronous, but this caused a heap more problems and I wanted to focus this more. I’ve since found out that synchronizers can be identified in the SDC file, so I’ll try that. Metastability is a concern for this project and I’m expecting all sorts of odd unreplicable issues when I get the RTL to run together in silicon. I’ll look at this another time.