Retro CPC Dongle – Part 35

Time for a quick update. I’ve integrated the SDRAM controller, the byte cache, and created some logic to map some of the SDRAM address space to the CPC ROM enable line. I also created some logic to allow the support CPU to push data into the SDRAM. This means that the support CPU can alter the ROM configuration of the CPC2 based on a user-set configuration.

To test the set-up, I created an example ROM that when booted by the CPC, it copies itself to address 0x4000, then dumps 64 bytes of memory at 0x4000. This will test the SDRAM controller, the cache and the cache replacement algorithm. Here’s the output.

Booting a dynamically installed ROM

Continue reading


Retro CPC Dongle – Part 34

Following a summary of the timing closure challenges in my last post, here’s a few more lessons learned from the process of trying to get my SDRAM and DMA controller to run at their fastest possible speed.

A lot of my timing closure process involved changing the RTL code and checking the effect on the timing. It’s a slow and laborious process, so here’s a list of my findings so you can save the hours of compilation time that it took me to test these.

Continue reading

Retro CPC Dongle – Part 33

Hello CPC fans! Has it really been 4 months since my last post? How time flies. Thanks to codepainters for the prompt to get going on my next post, and for reminding me that there are readers following the progress of this project.

I spent the first 3 months of this year refining the caching SDRAM controller and chasing timing closure issues. The SDRAM controller and it’s byte cache worked absolutely fine in simulation but failed to work reliably when I put the design in silicon at high speed. The sort of random and unpredictable behaviour I saw is usually indicative of timing issues, (especially when it worked at lower speeds), so I turned my attention to the timing reports from Quartus. This was new ground for me. I knew that IO timing was something that I should pay attention to, but until today my designs generally ran at sub 50MHz speeds where there’s enough timing slack to not notice the issues. However, on every design I’ve ever produced, at least one or two of the Timequest reports were in red, meaning that there were timing violations and hidden problems waiting to rear up when I least wanted them to.

With the high-speed logic of the SDRAM controller aiming for 160MHz, I had to solve these timing issues. My first stop was the FMAX report, which analyses the logic, calculates up the gate and linkage delays from the input to the output to determine what the approximate maximum frequency is. It was showing a paltry 55MHz – that’s 34% of my target and a full 105MHz slower than needed. I had some serious work to do tuning this design. I had no idea where to start, so a few days of Googling “FPGA best practices” and “Timing Closure” let me to a few useful links, here, here, here, and eventually into the Altera Training portal. There’s also a lot of great material on YouTube in the Altera (Intel) FPGA channel. I watched every instructional video that I could, read every PDF document available and undertook the formal Altera free courseware. It finally started to make a little sense. I have to tell you straight out that there are no shortcuts to this process. If you’re experiencing timing issues in high-speed designs, you have to read this stuff to fix them. There are no quick fixes and every problem is unique because every design is unique. Even if it’s not unique, some variation of platform, PCB layout, or signal impedance or skew will make it unique that you have to consider timing issues and craft an SDC (system design constraint) file for each of your inputs and outputs.

One of the first couple of documents that I read was Altera Quartus Handbook Volume 1 under section 11 and 12, recommended HDL and design practices, and AN584 Timing Closure. These taught me things such as;

  • Use CASE statements rather than multiple if-then-else statements to reduce logic usage.
  • Follow good synchronous design practices, like making everything activate on a clock edge and reduce cascading logic and asynchronous signals to a minimum feeding synchronous inputs.
  • Use an asynchronous RESET signal. This is counter-intuitive to the point above, but the way Altera FPGAs work is that they have both a clock and a separate reset signal, so using always @(posedge clock or posedge reset) will give better glitch resistance than checking a possible asynchronous reset inside the synchronous always block.
  • Avoiding combinational loops. That is, avoid logic that feeds it’s own output back into its input. Quartus will warn you if this is coded during compilation. You can feed the result back, but only through a clocked register.
  • Avoid combinational logic outputs, and use a register to hold output values where possible to avoid output glitches, unless your external inputs are completely synchronous and glitches are ignored between clock edges.
  • Avoiding latches by fully specifying all conditions in a case statement. Latches lead to glitches because they’re not clocked and transitory signals can trigger a latch inadvertently.

These are just a few of the mistakes I made when coding my memory controller. The Altera documentation lists many, many, many others, but these were the few that affected my design. After fixing these, I was still only getting 75-85MHz maximum clock speed, still well off the target speed.

I started to delve into some of the timing reports. One of the most important is the ‘Report Timing’ for the SDRAM clock. This will show the top 10 or so timing violations in sorted in order of the negative slack. Slack is the available time between when the data is ready and when it’s needed, so negative slack suggest the data is available after it’s needed, causing a timing violation. I tackled these starting with the worst timing path and going down the list fixing each in turn.

As it happened, many of the worst timing violations were on the logic that retrieved data from on-chip block ram and presented it to the DQ data pins of the SDRAM. This didn’t make any sense to me as these chips were rated to 800MHz, so I delved a little deeper to find out why these missed timing.

Right-clicking on the failing clock, and running a timing path on the clock will show a summary of the failing paths. Right-clicking on the path in the report and selecting Locate Path->Locate in Technology Map Viewer will show in a diagrammatic form the path that the data takes from the source clocked register to the destination clocked register. In this case, the destination clocked register was off-chip on the DQ pins of the SDRAM. The signal waveform is shown and the negative slack time is clearly displayed in red as the difference between the data required time and the actual data arrival time.

This is where it gets tricky because the technology map viewer is a post-fit view and the RTL compiler changes the names of the components and registers used. Or it may be showing parts of the Altera proprietary IP, for which the source code is not easily visible.

However, in my case, it was fairly obvious that the life of a data bit pulled from block ram is complex.  The 2-port RAM that I was pulling the data from was a little more complex than a bunch of registers wired in parallel to the output with an enable signal. For a start, I hadn’t considered that with port-A at 8 bits wide and port-B at 16 bits wide, there would need to be some conversion logic in there. It’s obvious when you think about it, but I fell into the trap thinking that I’m building logic at 160MHz and it’s an 800MHz chip, so there’s sure to be plenty of slack.

It turns out that while the FPGA fabric may be able to clock a single register at the 800MHz top speed of the device, combinational logic adds small delays of a few fractions of a pico-second for each stage. An AND gate here adds 0.4nS, an or gate there adds another 0.6nS and it all sums up to a much lower clock speed.

I recalled that as clock speeds increase, pipelining becomes critical and I read somewhere that the newest Intel i7/i9 chips have hundreds of stages in their data pipeline to provide the 4GHz+ speeds achieved.

I decided to add a pipeline stage that would hold the 16-bit DQ data ready prior to it being output onto the SDRAM bus. This isn’t as complicated as it sounds. Instead of routing the 2-port ram output through a multiplexor to the DQ pins, I would extract the data on one clock, and hold it in a “holding” register for presenting to the DQ pins on the next clock. This shortened the data path for the clock cycle and took quite a few nano-seconds off the data path. Surprisingly, there were still issues getting the data from the block ram to the holding register.

After a bit of head scratching, I went back to the Cyclone V data sheet and re-read specification for the block ram. I have a speed grade-8 device on the CPC and according to the specs the block ram logic runs at a maximum of 240MHz, so my 160MHz SDRAM controller was pushing the limits with un-optimised logic.

I’d managed to crack about 100MHz with the revised pipelining logic, but I was still a fair way off the target 160MHz. I looked back at the timing reports again, and more importantly, looked at the waveform in the path summary for the worst failing paths.

I realised with a shock, that the diagram was showing my data needed to be ready at the 3nS mark. My clock speed out of the PLL was 160MHz, so this should give 6.25nS between clock ‘launch’ and ‘latch’. This would be true if all my logic is triggered on the rising edge of the clock. Unfortunately, this was not the case and I had logic clocked on both edges. I’d followed my earlier coding practice that improved reliability by splitting logic to update in rising edge and output on falling edge. I’d used the same technique in the SDRAM controller, but the higher speed highlighted the problem. Clocking logic on both edges effectively halves the maximum speed of the logic block. Data is requested (launched) on the rising edge and sampled for output on the falling edge (latch edge). 3.125nS between these clock edges makes the effective clock speed 320MHz. In the same way that running DDR ram at the same clock speed as SDRAM will give double the throughput, logic on both edges turn my logic into a sort of half-way DDR block.

The math didn’t add up: 320MHz effective clock speed for the data transport logic didn’t give the block ram enough time to get it’s data out through the complex output logic. Especially when it’s only rated to 240MHz. I removed as much logic as I could from the falling edge always statement, but there were a couple of essential buffering registers that I couldn’t move without a major rewrite of the core logic that I’d spent months testing.

Using tricks such as asymmetric clock waveforms, that is putting the falling edge at 80% of the cycle time, I managed to get the logic to run at 152MHz with 300 pico-seconds of slack. Quite an achievement from where I started but I was running at the limits of what was possible. Any changes to the logic in the design, even in other modules, would move the logic blocks around and affect the timing. Longer data paths meant longer delays between registers and I was constantly going back to the SDRAM module to fix new failing paths. That’s the thing about timing closure, that is when you fix one timing path it uncovers the next failing path and you’re never done.

160MHz should be easily possible, even on the speed grade 8 chip if the logic is written with short data paths and sensible pipelining of data. However, I needed to move on – I’ve sunk 5 months into building, testing, tweaking and timing closure on this module alone. That’s enough for me. I did a quick calculation on the back of a beer mat and decided that 120MHz would suffice. It’s only three-quarters of my original planned speed, but it will allow me to move on to other parts of the system.

I expect that there will be worst-case scenarios that the CPC CPU requires data and it won’t be available in time because the caching controller is in the middle of a read-modify then refresh cycle. My plan to address this is to suspend the CPC clock using a clock enable clock until the data is available so that timing is not missed. This would be disastrous on a real CPC as the video refresh is tied into the clock speed, but on my ‘fake’ CPC the video output is up-scaled to HD before output to the monitor, so it will be able to tolerate a bit of clock throttling without spoiling some of the CRTC special frame effects. This was a really tough decision for me because I wanted the RTL to be a faithful reproduction of the original CPC so that video effects timing is perfect. Inevitably there are compromised in any complex electronic system and this is one I had to make for my sanity. I may come back to this at some future point and rewrite the caching SDRAM controller so that this can run at full speed.

I discovered a few other things during my in-depth analysis of timing closure and I’ll write these up next time. I promise I’ll do this sooner than 4 months though! Thanks for sticking with me and feel free to ask any timing related questions discussed in this post.

Last Post <====> Next Post


Reliable Wake-on-LAN

Wake-on-LAN is a great tool and allows your home PC to seem almost cloud-like. You can switch on your PC without needing to touch it, and have it available when you need it without it using power constantly.

However, there are some drawbacks. It won’t work from a cold-start, that is the first time you power cycle the PC. On some motherboards, it won’t work if you issue a shutdown command, and Windows and Linux can operate differently, even on the same motherboard. It does seem to work well for the suspend modes (S1-S3), but then you’re consuming more power than an S5 soft-off state. Of course, if you suspend and then lose AC power not only can you not restart, it may also corrupt your system.

Then, of course, it’s called wake-on-LAN, not sleep-on-LAN, so there’s no way to turn it off again. “Ah-ha”, you might say, “log on remotely and issue a shutdown command”.

Raspberry Pi Zero W


That’s all possible if your machine hasn’t crashed or frozen, with a kernel panic or just the old-fashioned BSOD. What about a simple reset? These things are taken for granted when you’re physically near the machine.

After years of struggling with variable levels of success with W-O-L, I decided to fix the problem by fitting a second PC inside my home computer. It’s not as silly as it sounds, because of the internet-of-things and the truly amazing work of the Raspberry Pi foundation.

Continue reading

“Got Your Back”

if_Drive-Backup_79136Back-ups are one of those things that everyone knows that they need, but seldom puts much time or effort into setting up and maintaining properly. My previous safety net was CrashPlan, who are exiting from the consumer back up space. This left me in a difficult place to try and find a cloud provider that supports large server backups from Linux at a consumer price.

I looked at Amazon Cloud Storage ($60 per TB per year), Google ($240 per TB per year), and Backblaze B2 at $60 per TB per year (I didn’t consider Azure, given my Linux infrastructure). While Amazon may seem a safer bet on the surface, I found their EC2 pricing unnecessarily confusing, not transparent, and potentially a “runaway” cost as everything has a price per unit. This led me to believe the consumer cloud pricing may just be a transient offer in their quest for per-byte/second billing of computing, storage and networking cloud services. I needed something that had been around for a while, had simple pricing and a focused offering. Backblaze B2 fits those criteria. Continue reading

Retro CPC Dongle – Part 32

Well, as promised in my last CPC2 post, I finished the next build of the CPC2 board and learned a lot of things during the process. Somethings worked, some things didn’t, but every build is giving me a wealth of knowledge of product design, fault diagnosis and rectification work. Yes, de-solder braid really was my best friend in this build!

Finally, a working board (click here for large)
Bottom side of board (click here for large)

Continue reading

Arduino ISP

As a brief reprieve from my main CPC2 project, I sidetracked into Arduino programming to solve a problem on the CPC2. I need to create an interface to a memory card, with only two wires. To do this, I’m going to use an Atmel (Microchip) ATtiny841 to interface between the memory card SPI interface and a two-wire serial UART. To program this device without spending heaps of money on a dedicated programmer, I’ll use an Arduino to program the Tiny. This post covers setting up an Arduino to write a bootloader into another Arduino or Atmel chip. To test this process, I used an Freetronics EtherTen to program a Freetronics Eleven that had a damaged bootloader.  I’m using version 1.8.5 of the Arduino IDE.

Continue reading