Nios II Flash Accelerator Using Max10
Nios II Flash Accelerator Using Max10
Introduction
As part of the initiative to improve Nios II/f "fast" core performance in real time applications, the Nios II flash accelerator is introduced. The accelerator is optimized to fetch instructions from flash memory and cache them in registers for fast instruction access.
The flash accelerator is implemented through a Max 10 device that runs instructions directly (execute in place) from user flash memory (UFM) or a Max 10 integrated on-chip flash. The flash memory for Max 10 operates at 120MHz, producing 32-bit data each cycle. The read access to the UFM through a normal Nios II instruction master introduces five cycles of wait state each time because the Nios II instruction master does not support burst at the UFM burst boundary. The wait states refer to the number clock cycles for the data to be available at the output of the on-chip flash.
Flash accelerator takes advantage of the wait states by performing the next cache line fetch (pre-fetch) during the five cycle wait states. The on-chip flash IP latches the next read address while the read data of the previous transaction becomes available at its data registers. The valid data arriving at the flash accelerator is stored into a fully-associative cache that is implemented in registers. This speeds up instruction execution when running from high latency memory such as flash memory.
The use of flash accelerator is not limited to the Max 10 on-chip flash IP but can also be connected to other memory devices with a standard Avalon-MM master interface. It is suitable for Nios II systems that requires smaller cache and do not use any memory block.
Feature Description
The Nios II flash accelerator provides a small fully-associate cache where both its line and cache sizes are user-configurable. The cache is constructed from registers. Standard Nios II cache management instructions such as flushi and initi can be used to flush or initialize the flash accelerator cache. The flash accelerator supports sequential instruction pre-fetch to improve performance over sequential instruction execution.
Cache Operation Block
This block detects cache misses and triggers read requests to the Address Generator. It also handles cache fill operations. Incoming read data are filled into the cache lines using the Least Recently Used (LRU) technique to ensure that the most used cache lines remain in the cache.
It handles Nios II cache management instructions such as flushi and initi.
Pre-fetch Block
The Pre-fetch block contains circuitry that detects the next sequential cache line to be executed relative to the current executed cache line. If the next line is predicted to miss and that there is no cache operation being performed (idle mode), the Pre-fetch block can trigger a read request to the Address Generator.
Address Generator Block
Cache (Registers) Block
The two main cache components of this block, the cache tag and cache data are composed of registers. The cache tag stores part of the address while the cache data stores valid read data. The cache data size is configurable through Qsys parameters.
Interfaces
It is a 32-bit read-only Avalon-MM master interface. It has support for wrapping burst where the burst starts from the critical word first. The burst size is configurable through the line size parameter where the burst count size equals to Line Size divided by 32-bits data.
Parameter
Parameter | Usage | Configurable Option |
---|---|---|
Line Size |
Determines the width of each cache line in bits: Burstcount size = Line Size (bits)/32 |
|
Cache Size | Determines the number of cache lines for the cache data | 2, 4 |
Performance
Nios II Instruction Interface | Random Access | Sequential Access |
---|---|---|
Normal | 5 | 5 |
Flash Accelerator | 5 | 1 |
For a normal instruction master, fetching sequential addresses from the Max 10 On-Chip Flash IP is considered a random access because it does not take advantage of the burst read feature of the Flash IP by default. Enabling the optional burstcount signal at the Instruction Master does not improve the sequential access because the wrap burst of 8 from the instruction Master will be translated to equivalent incremental burst of 2 or 4 at the Flash IP. The most optimum setup is to ensure that both the master and slave have the same burst count size.
Max 10 Device | Cache Line Size (bits) |
---|---|
10M08 | 64 |
10M16 | 128 |
10M25 | 128 |
10M50 | 128 |

Document Revision History
Date |
Version |
Changes |
---|---|---|
June 2015 | 2015.06.30 | Initial release |