In the last post, I laid out the hardware and architecture planned for this USB-sized 8-bit home-computer. Since then, I’ve been developing the various components in Verilog and testing some of the unknowns. Specifically, I’ve been trying to work out if the memory chip has the required bandwidth for my device.
To do this, I need to study the datasheet for the SDRAM device and understand the best and worst case operating parameters. One timing diagram for a burst read operation is shown below:
We can see from this that there is a delay between the read operation and the availability of the data. In the example above, it shows that for every 6 clocks, we get data available during 4 of those T-Cycles. Combining this information with the expected operating profile is critical to determining if we can service the expected memory requests.
Let’s review the architecture diagram from my last post, below.
Focus specifically on the Caching SDRAM controller. There’s a lot of input/output required of this controller.
On the face of it, the SDRAM device that I have selected, has a reasonable specification. A clock speed of 166MHz 16 bit words, giving us 332MB/s. However that data throughput is only achievable if you’re reading sequential memory locations in the memory. For a sequential read, a data word is available at every clock.
A random read, however, is much slower. I used the worst case scenario for every calculation to ensure that if everything went the wrong way, then I’d know if the system would still perform.
To understand the following calculations you’ll need to understand how SDRAM works, so I suggest a quick review of the FPGA4FUN site that discusses SDRAM functionality.
I’m using 8 word-bursts (more on that later). Following the datasheet, a random read could entail the follow SDRAM instructions:
- Precharge all (close all active banks, 3 clocks)
- BankActivate the chosen row(can be between 3 and 10 clocks)
- Read with precharge (2 clocks latency and 8 clocks for data)
Adding all that up gives me 3+10+10 = 23 clocks for a random read. At 166MHz this is approximate 139nS. If the byte that I’m seeking is byte 8 of the burst, then I could be waiting 139nS for a single byte.
If every read is completely random, then the maximum number of reads that I can make in 1 second is 1/139nS is approx 7.2 million transfers per second. Each transfer is 16 bits, so 14.4MB/s is the maximum if each read is completely random, about 4.3% of capacity.
Calculating the data transfer requirements of each of the components:
- RAM/ROM controller = Read/Write 1MB/s as data accesses are every 4th clock (approximately)
- Gate array/CTRC = Read 16K scanned 50 times per second, so 819200B/s
- The PAL renderer need 832 x 288 pixels x 50 frames/s = Write 11.9808MB/s
- The HDMI upscaler needs to read the data of the PAL renderer, 60 times per second, so read 14.37696MB/s
- The supervisor CPU will run at 48MHz, so requires Read/Write 12MB/s (assuming mostly sequential reads)
The total for this is 40.17969MB/s. This seems fairly modest for an interface that can serve 332MB/s. However, as I noted before, 332MB/s is a sequential access and some of our accesses will be random for the two CPUs. The memory required for the video processing units will give some help as they either read or write in large sequential blocks, meaning we get closer to the 332MB/s for these memory accesses.
So let’s calculate the proportion of time available for random access by taking out the amount of time required for the video accesses. Excluding the 13MB/s for the random CPU accesses, we need approx 19.44928MB/s for the video hardware.
It takes 139ns for an 8 byte burst, so this is 2.43116M accesses per second for the video hardware. Multiplied by 139nS for each access, gives us 337.9mS of data transfer for each second required for the video handling. This leaves 662mS for random-access. At 139nS each, this gives us about 4.76M random access per second. Excluding the non-negotiable 1M for the CPC, this only leaves about 3.76M for the supervisor CPU, far short of the 12M required for the 48MHz supervisor CPU.
The solution to this is to use the WAIT signal on the Z80 supervisor CPU, so this will wait when the memory is busy doing other things. If we’re lucky, the cache will handle subsequent requests, so a wait may not be required for a further 7 T-Cycles. The memory controller will serve memory accesses in the following priority order:
- CPC CPU – this cannot wait on memory access or the emulation is likely to suffer, so this will always get priority if it needs an access.
- CRTC/GA – servicing the data requirement of this component will ensure the fidelity of the resulting image. Split mode displays and clever switching techniques will only work if this is serviced in a timely manner.
- The PAL renderer will take the output of the CRTC/GA and store this as a 832X288 image updated 50 times per second, frame switching at each fly-back.
- The HDMI upscaler will take the data stored in the PAL frame and upscale this for display through the HDMI output, 60 times per second.
- The supervisor CPU is not time critical so, it will have the lowest priority.
Write access is easy to schedule, as this can be stored for later writing into the SDRAM. It’s the read accesses that are critical.
Lets’s look at the latency of the requests and see if the components will get their data in time. As I mentioned before, the CPC CPU must not wait for it’s data. The Z80 datasheet shows that data is read into the CPU in the second half of the second T-Cycle:
This diagram shows me that I have 1.5T-Cycles to get the data ready for the CPU. At 4MHz, this is 375nS. A single random read can be 139nS, but if we’re really unlucky the memory controller has just started servicing another request, so we may have to wait almost 139nS for our memory cycle to start. It may be 277nS before we get our random byte. However, this is within the 375nS latency requirement of the CPC CPU, so no waiting is required.
Latency for the video modules is not as critical. The requests from these modules is fairly easy to anticipate, and they’re sequential reads, something the SDRAM excels at. The CRTC/PAL module reads the top 16K of memory 50 times per second and with the border area taking about 25% if the scan line width, we have 12% (half each side of the frame) of the scan line to read the memory for the stream of bytes for this line. The scanline is 50Hz x 288 lines, so 14.4K lines per second / 69uS per line. 12% of this is 8333nS, plenty of time to start reading the memory for the next block of data. Our video controller will need to be fairly smart to make sure it stays ahead of the data requests.
The PAL render is write-only, so latency does not apply here. The data will get written, eventually. It doesn’t really matter when, so long as it happens.
The HDMI upscaler will read the PAL frame in a fairly predictable manner, and we will have plenty of notice for the next request, so I’m not expecting any latency misses here. I’ll be upscaling an image in a frame buffer, so we don’t have to worry about the data changing and can queue the data for display.
Finally the supervisor CPU will have a much greater capacity for random reads, and each 48MHz T-Cycle lasting 20.1nS, this is tighter than a single burst, so we’ll need to use the WAIT signal to halt the processor if the data is not ready. I’m expecting the caching mechanism of the SDRAM controller to service the other 7 bytes of the request fairly efficiently, so hopefully the 48mHz speed of the CPU will not be wasted.
In summary, I’m satisfied that a single SDRAM chip will be able to service all 5 of these components and meet the timing requirements. We can move ahead with the hardware design, knowing that the physical components should hang together.
The next post covers creating new components in the design tool to begin developing the schematic.
[…] my last post I evaluated the technical feasibility of memory timing in the expected architecture of the CPC […]
LikeLike
[…] run at about 24/25MHz. Any faster and the memory won’t be able to keep up. Take a look at my timing analysis to see what I mean. The clock module will need to provide a variety of clocks all in sync with each […]
LikeLike