As I was explaining to someone recently, bootstrapping a completely new, custom design is hard. There are no tools or pre-existing software to manage the device (Atmel/Altera tools aside). Everything has to be built from scratch. It’s been quite a while since I started this project to get to this point, but I’m getting close now to a workable infrastructure. Once the foundation is in place it will allow rapid development of the CPC portions of this project. Here’s the UI part of the work so far:
In my last post, I talked about building the proof of concept for the technology pieces in the supervisor chip, the SAM4S, including the Flash and MRAM memory and the menu structure. In the last two weeks I’ve:
- Built a simple menu structure to allow management of a stored FPGA image (screen grab above)
- Auto run the stored FPGA image upon start up
- Simple upload application that can be run stand-alone or within the dump terminal program, picocom
- Optimised these programs to perform well (trickier than it sounds)
Uploading an image in optimised “release” mode gives the following upload times:
$ ./cmodem HelloWorld.rbf /dev/ttyACM0 Complete 100.000%, time 3s eta 0s Successfully sent 765254 bytes
Considering upload times started at 180 seconds, it’s been a difficult process to achieve these tiny upload times. I couldn’t work out the profiling tools in the Atmel toolset, so I didn’t know where most of the time cost was. I suspected that it was in the translation of the USB maximum 64-byte packet to single bytes for consumption by the Atmel ASF STDIO library.
I changed the code in the monitor program to read these 64-byte packets directly and consolidate these into a single CMODEM packet of 354 bytes or less, so that 6 calls are made to the USB receive function, rather than 354 to the getchar function, with the associated overhead of buffer management.
This vastly improved performance from approximately 180 seconds down to about 60 seconds. Compiling the monitor program in release mode rather than Debug mode, shaved this time again from 60 seconds down to about 30 seconds. Seemed reasonable, given the limitations of the USB full-speed protocol. I calculated that at the standard polling time of millisecond intervals, 64 bytes could be transmitted for a total of 64KB/s, making my 765K FPGA image about 12 seconds to transfer.
However, this was faulty thinking. Further reading suggested that 1mS is the maximum gap between transmission and it can work at 12Mbps or 1.2MB/s if kept fully loaded. I couldn’t see any more opportunity for tuning, so I put the discrepancy down to limitations of the SAM4S device.
However, when I wrote the start up code to pass the FPGA image from Flash to the FPGA fast parallel port (FPP) I knew something was wrong. Start up was taking about 12 seconds to load the image from Flash and push through to the FPGA. The FPP could handle 100MB/s, so this was not the bottleneck. Loading the flash image seemed to take an extraordinarily long time. During debugging, I used a delay_s() function to display debug messages on the debug port and a one second delay took exactly 10 seconds to expire. I traced the code and I could see that nothing was wrong with the library or my config settings.
I eventually, I traced the problem to a missing sysclk_init() in my code. This initialises the PLL so that the CPU clock runs at the speed of the PLL output. Otherwise, it runs at the input clock speed. In my case, 12MHz instead of 120MHz. I have NO idea how my code actually worked without it, or how USB starts up. I guess sysclk_init doesn’t manage the USB clock driver. After including this call in the initialisation routines, my code suddenly runs ten times faster, as the clock speed jumps from the clock input rate of 12MHz to the PLL output rate of 120MHz!
The performance of the data uploads over USB drops from 30 seconds to about 3 seconds. However, I then started to experience problems with other components that were previously reliable, before the speed rocketed. The Flash interface stopped working and tracing the code I found that functions in nand_flash_raw_smc.c hung when accessing Flash. There is no apparent reason for this, but rather than waste time trying to find out why, I just worked around the problem by adding __attribute__((optimize(0))) before the affected function definitions. This turns off optimisation for these functions and fixed the Flash issue. With such a small piece of code, it didn’t really affect performance, so I was happy to leave this as-is. The MRAM also stopped working as the SPI interface ran at a maximum rate. This was fixed by adding a microsecond delay after chip select and for some reason between sending the low and high bytes of the address. Weird, but OK, I’ll live with it for now.
I’ve seen speed problems like this in the Atmel ASF before when you turn on optimisations, some of the standard library functions stop working. I guess they’re not tested against fully optimised code. If you come across this problem yourself, my advice is to change the optimisation of the “Debug” version to -O3 and trace the code until it hangs. Try adding the __attribute__((optimize(0))) above the function to see if this fixes the problem. You can then turn optimisation off again (-O0) in the Debug version and the Release version should then work properly.
Boot from flash on power on is about half a second, including the time it takes for the SAM4S to start, so I was pretty pleased with the performance. Upload to Flash (option 3 in the screen grab) or direct programming (option 1) is roughly equivalent at 3-4 seconds. Good enough as a prototype code. If you’re interested, my code is posted on GitHub.
There is enough of this monitor program functional now to move onto coding the SPI interface in the FPGA. The SPI interface is a key function as it handles both the debug port and the backing-store requests from the FPGA. The SPI interface sits outside the CPC HDL and handles traffic between the FPGA and SAM4S at 40MHz / 5MB/s.
This is my next target. I hope to have the SPI interface connected to the support Z80 CPU before I’m back at work in the New Year.
Update: Actually, my next post turned out to be a side project evaluating the eMMC chip.
[…] next post describes the process of refining the upload process to […]
LikeLike
[…] my last post I’ve been beavering away on the SPI interface between the Atmel supervisor chip and the FPGA. […]
LikeLike