AN 870: Stencil Computation Reference Design
AN 870: Stencil Computation Reference Design
In all versions of a stencil code, a specific algorithm is applied to every element in a matrix in a sweep. Newly calculated values are piped forward into the next sweep. Cells within the same sweep have no dependencies and are calculated in parallel. Sweeps can also be calculated in parallel if the results from one sweep are continuously piped into the next.
- Large internal and external bandwidth
- A highly pipelined and connected fabric that allows the massive parallelization inherent in a stencil code to be exploited to achieve minimal execution times
Four-Point 2D Stencil / 2D Jacobi Iteration Algorithm
In the 1D representation of a 2D matrix, assuming the length of a row is N and the number of columns is M , the 4-point neighborhood/stencil of any individual point is given by : {x-N,x+N,x-1,x+1}, where all values outside of the boundary of the matrix are assumed to be 0. Assume we have matrix A with values {x0, x1, x2, x3, …}, where x is in the range 0 to NxM . The matrix initially begins at timestep T0. The values at T1 are defined as: T1(xi) = 0.25 * (T0(xi+1) + T0(xi-1) + T0(xi+N) + T0(xi-N)).
Now assume we have a 2D representation of a 2D matrix B with values {x0y0, x1y0, x0y1, x1y1, …} where x in the range of 0 to N, and y is in the range of 0 to M. The matrix initially begins at timestep T0. The values at T1 are defined as follows:
T1(xiyj) = 0.25 * (T0(xiyj-1) + T0(xiyj+1) + T0(xi-1yj) + T0(xi+1yj))
The main idea behind this specific stencil pattern is to iteratively update the data within a matrix until the values converge (Jacobi Iteration). Performance is optimized by using caching and feed-forward techniques to reduce to reads from global memory (DDR) as much as possible, ideally to 1. Data is typically retrieved from real world measurements, but for the sake of testing all data in the associated reference design is pseudorandom.
OpenCL Design
Stencil computations in general are memory-bound applications. The optimizations included in this OpenCL* 1 2 reference design seek to leverage the power of an FPGA by both parallelizing as much as possible and using channels/pipes to saturate the bandwidth from on-board DDR to maximize GFLOPS and minimize execution time.
Data is initially read from global memory by the compute unit (CU) named "feeder" and written back to global memory by the "writer"’ CU. Data is read and written in blocks of 16 floats at a time, and the net data bandwidth depends on the frequency of the kernel and supported data read/write rate of the on-board DDR and memory controller.
- A source location where data from the host is sent via PCIe to be calculated
- An output location where data is read by the host after kernel execution
The following diagram captures the flow of data into and out of the kernel system:
Between the feeder and writer CUs is the main part of the kernel system – a series of chained and replicated calculation CUs. Data flows into the first calculation CU in the same order it was read from DDR, and the calculated data is piped into the next CU in the chain in the same order. Data coming into a CU is cached, boundaries are updated, and the result of each calculation is immediately sent into the next kernel. In this way every single CU, and hence every iteration, can be calculated in parallel once enough data has been sent through the chain.
Within each of these calculation CUs is a local memory system called "cache" that is composed of M20k blocks adjacent to the CUs. The cache size must be large enough to store an entire row of the incoming matrix. The height of the matrix can be as large as device memory allows. Matrix attributes are fed forward through the system before stencil computations begin.
Because of the nature of the computation, only 14 of the 16 floating point values being forwarded can be calculated at a time, so boundary conditions are updated between CUs. This is because each element requires both its left and right neighbors to perform a stencil computation, so the first and last elements in the block cannot be calculated. The entire computation requires a total of 3 blocks of 16 floats to be loaded into private registers, the row being calculated and their top/bottom neighbors. After a calculation is performed one block with 14 new and 2 outdated values is piped to the next CU.
Performance
- GPUs
-
- NVIDIA* Tesla* C2075 companion processor (C2075)
- NVIDIA* GeForce* GTX 760 graphics card (GTX760)
- NVIDIA* GeForce* GTX 960 graphics card (GTX960)
- CPUs
-
- Intel® Xeon® Processor E5-1650 V3 (E5-1650 V3)
- Intel® Core® i7-4960X Processor Extreme Edition (i7-4960X)
Thirty kernels were chained together in a feed-forward approach in order to perform 30 iterations of the stencil algorithm in parallel. Each individual kernel began execution as soon as it was sent enough information from the previous kernel.
Processor | Execution Time (s) |
---|---|
A10GX1150 | 42.455 |
C2075 | 232.4 |
GTX760 | 176.5 |
GTX960 | 111.7 |
E5-1650 V3 | 258.9 |
i7-4960X | 260.2 |
- Compiling with a newer version of Intel® Quartus® Prime Design Suite
- Fitting more calculation kernels in the chain
- Using an FPGA device that is larger, faster, or both
- Removing profiling hardware
No. of Kernels | Data set 4088x65536 | Data set 4088x32760 | Data set 4088x4088 | ||||||
---|---|---|---|---|---|---|---|---|---|
Exec. Time (ms) | GFLOPS | Stall % (Worst) | Exec. Time (ms) | GFLOPS | Stall % (Worst) | Exec. Time (ms) | GFLOPS | Stall % (Worst) | |
1 | 151.34 | 7.081040518 | 29.9 | 75.75 | 7.0718352 | 29.25 | 9.71 | 6.8843 | 34.41 |
2 | 148.89 | 14.39511951 | 34.74 | 74.56 | 14.369408 | 34.67 | 9.62 | 13.898 | 38.03 |
3 | 150.16 | 21.41005605 | 32.45 | 75.1 | 21.399129 | 32.44 | 9.2 | 21.798 | 30.05 |
10 | 150.33 | 71.28614861 | 35.03 | 75.23 | 71.207167 | 34.99 | 9.51 | 70.291 | 35.23 |
20 | 164.24 | 130.4974028 | 19.78 | 82.12 | 130.46554 | 20.09 | 10.38 | 128.8 | 33.22 |
28 | 165.77 | 181.0101394 | 15.41 | 82.88 | 180.97686 | 15.13 | 10.45 | 179.11 | 22.77 |
29 | 162.62 | 191.1062322 | 22.53 | 81.4 | 190.84833 | 22.78 | 10.42 | 186.04 | 15.94 |
30 | 165.8 | 193.9043435 | 17.46 | 82.92 | 193.81025 | 17.35 | 10.43 | 192.27 | 12.03 |
The following heat maps show the convergence of values after running the kernel. The first pair of images represents the unprocessed raw input. The heat maps illustrate 280x280 grids, where the values in the left maps were initially created with a pseudorandom function, and the values in the right maps were created by looping from 1 to 100.
To calculate the execution time for a system requiring more than 30 iterations, divide the desired number of iterations by 30 multiplied by the time it takes 30 iterations to run () For example, 300 iterations applied to a 4088x65536 data set should take around 1658 ms to run.
The following heat map shows the result of running the initial values through a kernel with 30 chained calculation CUs. You can see the beginnings of convergence emerge.
doi: 10.1109/TPDS.2016.2614981
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582502&isnumber=7894348
Document Revision History for AN 870: Stencil Computation Reference Design
Document Version | Changes |
---|---|
2018.10.10 | Initial release. |