This post talks about HyperRAM, what it is, how to interface to it and how to improve the performance of high-speed parallel interfaces.
HyperRAM is described well by Cypress. It is essentially a double data rate RAM with a compact 12-line interface that masks the underlying technology of a DDR SDRAM. It can provide 333MB/s of data transfer in short bursts. Data is transferred on both edges of the clock, and the narrow bus makes it ideal for microprocessors or pin-constrained FPGAs.
Initially, I was put off using this type of memory because of the DDR nature of the interface, but I realised that these devices can be easily interfaced at a slower speed, using an internal FPGA clock that runs at twice the bus speed. This will allow the signals to change in between the clock transitions. For example, a 120MHz internal clock can transition the signal on the first clock edge and on the next clock edge toggle the clock out to latch the data onto the HyperRAM bus. The actual DDR clock speed is a quarter the internal rate at 30MHz. Data is clocked on both edges, so 60MB/s is possible at this rate. A 120MHz internal clock is possible, even on the slowest FPGA devices.
The only downside to this interface is that the latency can be fairly high which means that the data takes some time to appear on the bus. It takes 6 clock edges to clock in the read or write command, followed by one or possibly two latency cycles. Each latency cycle can be up to 6 clock cycles (not edges). This latency is typically more visible as clock speeds increase and all SDRAM memory devices suffer the same problems, so switching to ‘un-managed’ SDRAM wouldn’t address the issue.
In my CPC2 project, I’ve switched from SDRAM to HyperRAM to alleviate a timing problem. The 4Mhz CPC CPU cannot wait for its data to become available without affecting the performance and timing fidelity of the CPC. While 4MHz doesn’t sound too demanding, once again latency can cause problems if the SDRAM controller is busy fetching video data or servicing the floppy disk emulation. Adding a second device allows the video circuitry to operate completely independently from the CPC RAM/ROM. Unfortunately, adding a second SDRAM was out of the question, as the FPGA device I had selected didn’t have enough pins to support two SDRAMs plus the video, and other functions. HyperRAM solves the pin-count problem, with each device needing just 12 pins, meaning two independent devices uses less pins than a standard 16MB SDRAM. One HyperRAM is dedicated to servicing the CPC and the other will provide video images. In this way, no wait states are needed because even with latency, 120Mhz core clock resulting in a 30MHz HyperRAM clock will deliver the memory result in less than 1.5 CPC clock cycles.
Building the RTL to operate the HyperRAM wasn’t too challenging, but required a real device to test it because the timing on the real device is different from the simulation models provided by Cypress. Once I’d worked through the subtle issues, the device was rock steady and worked perfectly at its 30MHz clock.
There were some challenges with the FPGA logic, as the 120MHz core clock presented some challenges to the fitter. However a couple of well placed timing constraints soon sorted that out.
The SDC max-skew allows you to define the maximum permissible skew between a signal and its peers. Since the HyperRAM clock is not really a clock, but another signal, I could set the maximum skew across all signals with this constraint:
set_max_skew -to [get_ports hyperram[*]] 8
At 30MHz the clock cycle is 33ns. Since we’re running double data-rate (DDR), the timing cycle is 16ns. As we need to signal transitions in the middle of this cycle, we need to keep skew to about 8ns around the middle of the clock cycle. Compiling the design then checking the timing report shows whether this timing was achieved and will encourage the design to place signals in the right location to ensure this skew is achieved.
There are two other timing constraints that help to meet timing, fast input registers and fast output registers. While I didn’t need them for the HyperRAM module, it proved very helpful to streamline the performance of my SDRAM controller by moving the input and output buffers to the edge of the chip so place-and-route didn’t try to get signals across the chip under the tight timing of a 120MHz core clock. These system design constraints work well if your logic is pipelined, because it gets the signal on-device as simply as possible, then the place-and-route can handle the rest of the timing without needing further guidelines. Under Quartus, FIR and FOR can be defined in the pin assignment GUI tool. This reduces the long delays from the external device pins to the FPGA pins to the capture logic. Quartus does this anyway during P&R, but with this option switched on you don’t need to fiddle so much with the set_input_delay constraint. Here’s a note on the Altera forum.
OSH Park Prototype HyperRAM
Since assembling the two HyperRAM chips on my new OSH Park prototype boards (above), work on the CPC2 has come on in bounds. This lack of large memory was really holding back progress. I’ve completed the ROM/RAM management cores, so that 64 ROMs and 4096KB of RAM is now available to the CPC2, managed by the support processor. Based on the required CPC personality, I can switch out the CPC464/664/6128 ROMs, the BASIC 1.0/1.1 ROM, the AMSDOS ROM and others like my beloved Maxam ROM. The ROMs are stored in the FPGA configuration Flash. Storing 64 ROMs beyond the FPGA configuration image takes just a small fraction of the Flash memory. As they never change, it’s a good place to hold these. In the future, ROMs will also be stored on the backing storage for more volatile images, such as ROMs under development.
I also progressed work on the floppy disk emulation. It can now manage 4 floppy disks of 82 tracks, 2 sides and 10 sectors per disk, stored in the HyperRAM. This is a total of 3280KB of memory plus meta-data needed to manage this space. I’m using 8MB HyperRAMs, so there is plenty of space.
Take a look at the video below for a run-through. Right now, performance is pretty good, but will degrade when the video core is added because it will share the second HyperRAM bus. Fortunately, both use cases can tolerate wait states. The video will buffer image lines and will permit FDC access between these buffers, and the FDC can wait if the video is using the memory.
The next steps are to debug the FDC support software then add backing store in the form of eMMC or uSD. As both eMMC and uSD are electrically and logically the same the RTL core and support software will be the same. I need to do some more tests on the PCB assembly of eMMC before deciding on this option.
Last Post <====> Next Post