1. Introduction to Standard Edition Best Practices Guide

Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

Download PDF

ID 683176

Date 9/24/2018

Version 18.1

Public

Visible to Intel only — GUID: egp1518024670493

Ixiasoft

View Details

Document Table of Contents

Document Table of Contents x

1. Introduction to Standard Edition Best Practices Guide 2. Reviewing Your Kernel's report.html File 3. OpenCL Kernel Design Best Practices 4. Profiling Your Kernel to Identify Performance Bottlenecks 5. Strategies for Improving Single Work-Item Kernel Performance 6. Strategies for Improving NDRange Kernel Data Processing Efficiency 7. Strategies for Improving Memory Access Efficiency 8. Strategies for Optimizing FPGA Area Usage A. Additional Information

1. Introduction to Standard Edition Best Practices Guide x

1.1. FPGA Overview 1.2. Pipelines 1.3. Single Work-Item Kernel versus NDRange Kernel 1.4. Multi-Threaded Host Application

2. Reviewing Your Kernel's report.html File x

2.1. High Level Design Report Layout 2.2. Reviewing the Report Summary 2.3. Reviewing Loop Information 2.4. Reviewing Area Information 2.5. Verifying Information on Memory Replication and Stalls 2.6. Optimizing an OpenCL Design Example Based on Information in the HTML Report 2.7. HTML Report: Area Report Messages 2.8. HTML Report: Kernel Design Concepts

2.3. Reviewing Loop Information x

2.3.1. Loop Analysis Report of an OpenCL Design Example 2.3.2. Changing the Memory Access Pattern Example 2.3.3. Reducing the Area Consumed by Nested Loops Using loop_coalesce

2.4. Reviewing Area Information x

2.4.1. Area Analysis by Source 2.4.2. Area Analysis of System

2.5. Verifying Information on Memory Replication and Stalls x

2.5.1. Features of the System Viewer 2.5.2. Features of the Kernel Memory Viewer

2.7. HTML Report: Area Report Messages x

2.7.1. Area Report Message for Board Interface 2.7.2. Area Report Message for Function Overhead 2.7.3. Area Report Message for State 2.7.4. Area Report Message for Feedback 2.7.5. Area Report Message for Constant Memory 2.7.6. Area Report Messages for Private Variable Storage

2.8. HTML Report: Kernel Design Concepts x

2.8.1. Kernels 2.8.2. Global Memory Interconnect 2.8.3. Local Memory 2.8.4. Nested Loops 2.8.5. Loops in a Single Work-Item Kernel 2.8.6. Channels 2.8.7. Load-Store Units

3. OpenCL Kernel Design Best Practices x

3.1. Transferring Data Via Channels or OpenCL Pipes 3.2. Unrolling Loops 3.3. Optimizing Floating-Point Operations 3.4. Allocating Aligned Memory 3.5. Aligning a Struct with or without Padding 3.6. Maintaining Similar Structures for Vector Type Elements 3.7. Avoiding Pointer Aliasing 3.8. Avoid Expensive Functions 3.9. Avoiding Work-Item ID-Dependent Backward Branching

3.1. Transferring Data Via Channels or OpenCL Pipes x

3.1.1. Characteristics of Channels and Pipes 3.1.2. Execution Order for Channels and Pipes 3.1.3. Optimizing Buffer Inference for Channels or Pipes 3.1.4. Best Practices for Channels and Pipes

3.3. Optimizing Floating-Point Operations x

3.3.1. Floating-Point versus Fixed-Point Representations

4. Profiling Your Kernel to Identify Performance Bottlenecks x

4.1. Best Practices 4.2. GUI 4.3. Interpreting the Profiling Information 4.4. Limitations

4.2. GUI x

4.2.1. Source Code Tab 4.2.2. Kernel Execution Tab 4.2.3. Autorun Captures Tab

4.2.1. Source Code Tab x

4.2.1.1. Tool Tip Options

4.3. Interpreting the Profiling Information x

4.3.1. Stall, Occupancy, Bandwidth 4.3.2. Activity 4.3.3. Cache Hit 4.3.4. Profiler Analyses of Example OpenCL Design Scenarios 4.3.5. Autorun Profiler Data

4.3.1. Stall, Occupancy, Bandwidth x

4.3.1.1. Stalling Channels

4.3.4. Profiler Analyses of Example OpenCL Design Scenarios x

4.3.4.1. High Stall Percentage 4.3.4.2. Low Occupancy Percentage 4.3.4.3. Low Bandwidth Efficiency 4.3.4.4. High Stall and High Occupancy Percentages 4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency 4.3.4.6. No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency 4.3.4.7. Stalling Channels 4.3.4.8. High Stall and Low Occupancy Percentages

5. Strategies for Improving Single Work-Item Kernel Performance x

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback 5.2. Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays 5.3. Good Design Practices for Single Work-Item Kernel

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback x

5.1.1. Removing Loop-Carried Dependency 5.1.2. Relaxing Loop-Carried Dependency 5.1.3. Simplifying Loop-Carried Dependency 5.1.4. Transferring Loop-Carried Dependency to Local Memory 5.1.5. Removing Loop-Carried Dependency by Inferring Shift Registers

6. Strategies for Improving NDRange Kernel Data Processing Efficiency x

6.1. Specifying a Maximum Work-Group Size or a Required Work-Group Size 6.2. Kernel Vectorization 6.3. Multiple Compute Units 6.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization 6.5. Reviewing Kernel Properties and Loop Unroll Status in the HTML Report

6.2. Kernel Vectorization x

6.2.1. Static Memory Coalescing

6.3. Multiple Compute Units x

6.3.1. Compute Unit Replication versus Kernel SIMD Vectorization

7. Strategies for Improving Memory Access Efficiency x

7.1. General Guidelines on Optimizing Memory Accesses 7.2. Optimize Global Memory Accesses 7.3. Performing Kernel Computations Using Constant, Local or Private Memory 7.4. Improving Kernel Performance by Banking the Local Memory 7.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor 7.6. Minimizing the Memory Dependencies for Loop Pipelining

7.2. Optimize Global Memory Accesses x

7.2.1. Contiguous Memory Accesses 7.2.2. Manual Partitioning of Global Memory

7.2.2. Manual Partitioning of Global Memory x

7.2.2.1. Heterogeneous Memory Buffers

7.3. Performing Kernel Computations Using Constant, Local or Private Memory x

7.3.1. Constant Cache Memory 7.3.2. Preloading Data to Local Memory 7.3.3. Storing Variables and Arrays in Private Memory

7.4. Improving Kernel Performance by Banking the Local Memory x

7.4.1. Optimizing the Geometric Configuration of Local Memory Banks Based on Array Index

8. Strategies for Optimizing FPGA Area Usage x

8.1. Compilation Considerations 8.2. Board Variant Selection Considerations 8.3. Memory Access Considerations 8.4. Arithmetic Operation Considerations 8.5. Data Type Selection Considerations

A. Additional Information x

A.1. Document Revision History for the Standard Edition Best Practices Guide

1. Introduction to Standard Edition Best Practices Guide

1.1. FPGA Overview

1.2. Pipelines

1.3. Single Work-Item Kernel versus NDRange Kernel

1.4. Multi-Threaded Host Application

2. Reviewing Your Kernel's report.html File

2.1. High Level Design Report Layout

2.2. Reviewing the Report Summary

2.3. Reviewing Loop Information

2.3.1. Loop Analysis Report of an OpenCL Design Example

2.3.2. Changing the Memory Access Pattern Example

2.3.3. Reducing the Area Consumed by Nested Loops Using loop_coalesce

2.4. Reviewing Area Information

2.4.1. Area Analysis by Source

2.4.2. Area Analysis of System

2.5. Verifying Information on Memory Replication and Stalls

2.5.1. Features of the System Viewer

2.5.2. Features of the Kernel Memory Viewer

2.6. Optimizing an OpenCL Design Example Based on Information in the HTML Report

2.7. HTML Report: Area Report Messages

2.7.1. Area Report Message for Board Interface

2.7.2. Area Report Message for Function Overhead

2.7.3. Area Report Message for State

2.7.4. Area Report Message for Feedback

2.7.5. Area Report Message for Constant Memory

2.7.6. Area Report Messages for Private Variable Storage

2.8. HTML Report: Kernel Design Concepts

2.8.1. Kernels

2.8.2. Global Memory Interconnect

2.8.3. Local Memory

2.8.4. Nested Loops

2.8.5. Loops in a Single Work-Item Kernel

2.8.6. Channels

2.8.7. Load-Store Units

3. OpenCL Kernel Design Best Practices

3.1. Transferring Data Via Channels or OpenCL Pipes

3.1.1. Characteristics of Channels and Pipes

3.1.2. Execution Order for Channels and Pipes

3.1.3. Optimizing Buffer Inference for Channels or Pipes

3.1.4. Best Practices for Channels and Pipes

3.2. Unrolling Loops

3.3. Optimizing Floating-Point Operations

3.3.1. Floating-Point versus Fixed-Point Representations

3.4. Allocating Aligned Memory

3.5. Aligning a Struct with or without Padding

3.6. Maintaining Similar Structures for Vector Type Elements

3.7. Avoiding Pointer Aliasing

3.8. Avoid Expensive Functions

3.9. Avoiding Work-Item ID-Dependent Backward Branching

4. Profiling Your Kernel to Identify Performance Bottlenecks

4.1. Best Practices

4.2. GUI

4.2.1. Source Code Tab

4.2.1.1. Tool Tip Options

4.2.2. Kernel Execution Tab

4.2.3. Autorun Captures Tab

4.3. Interpreting the Profiling Information

4.3.1. Stall, Occupancy, Bandwidth

4.3.1.1. Stalling Channels

4.3.2. Activity

4.3.3. Cache Hit

4.3.4. Profiler Analyses of Example OpenCL Design Scenarios

4.3.4.1. High Stall Percentage

4.3.4.2. Low Occupancy Percentage

4.3.4.3. Low Bandwidth Efficiency

4.3.4.4. High Stall and High Occupancy Percentages

4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency

4.3.4.6. No Stalls, High Occupancy Percentage, and Low Bandwidth Efficiency

4.3.4.7. Stalling Channels

4.3.4.8. High Stall and Low Occupancy Percentages

4.3.5. Autorun Profiler Data

4.4. Limitations

5. Strategies for Improving Single Work-Item Kernel Performance

5.1. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback

5.1.1. Removing Loop-Carried Dependency

5.1.2. Relaxing Loop-Carried Dependency

5.1.3. Simplifying Loop-Carried Dependency

5.1.4. Transferring Loop-Carried Dependency to Local Memory

5.1.5. Removing Loop-Carried Dependency by Inferring Shift Registers

5.2. Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays

5.3. Good Design Practices for Single Work-Item Kernel

6. Strategies for Improving NDRange Kernel Data Processing Efficiency

6.1. Specifying a Maximum Work-Group Size or a Required Work-Group Size

6.2. Kernel Vectorization

6.2.1. Static Memory Coalescing

6.3. Multiple Compute Units

6.3.1. Compute Unit Replication versus Kernel SIMD Vectorization

6.4. Combination of Compute Unit Replication and Kernel SIMD Vectorization

6.5. Reviewing Kernel Properties and Loop Unroll Status in the HTML Report

7. Strategies for Improving Memory Access Efficiency

7.1. General Guidelines on Optimizing Memory Accesses

7.2. Optimize Global Memory Accesses

7.2.1. Contiguous Memory Accesses

7.2.2. Manual Partitioning of Global Memory

7.2.2.1. Heterogeneous Memory Buffers

7.3. Performing Kernel Computations Using Constant, Local or Private Memory

7.3.1. Constant Cache Memory

7.3.2. Preloading Data to Local Memory

7.3.3. Storing Variables and Arrays in Private Memory

7.4. Improving Kernel Performance by Banking the Local Memory

7.4.1. Optimizing the Geometric Configuration of Local Memory Banks Based on Array Index

7.5. Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor

7.6. Minimizing the Memory Dependencies for Loop Pipelining

8. Strategies for Optimizing FPGA Area Usage

8.1. Compilation Considerations

8.2. Board Variant Selection Considerations

8.3. Memory Access Considerations

8.4. Arithmetic Operation Considerations

8.5. Data Type Selection Considerations

A. Additional Information

A.1. Document Revision History for the Standard Edition Best Practices Guide

Visible to Intel only — GUID: egp1518024670493

Ixiasoft

View Details

1. Introduction to Standard Edition Best Practices Guide

Updated for:
Intel® Quartus® Prime Design Suite 18.1

The Standard Edition Best Practices Guide provides guidance on leveraging the functionalities of the FPGA Software Development Kit (SDK) for OpenCL™ ¹ Standard Edition to optimize your OpenCL ² applications for FPGA products.

This document assumes that you are familiar with OpenCL concepts and application programming interfaces (APIs), as described in the OpenCL Specification version 1.0 by the Khronos Group™. It also assumes that you have experience in creating OpenCL applications.

To achieve the highest performance of your OpenCL™ application for FPGAs, familiarize yourself with details of the underlying hardware. In addition, understand the compiler optimizations that convert and map your OpenCL application to FPGAs.

For more information on the OpenCL Specification version 1.0, refer to the OpenCL Reference Pages on the Khronos Group website. For detailed information on the OpenCL APIs and programming language, refer to the OpenCL Specification version 1.0.

Section Content
FPGA Overview
Pipelines
Single Work-Item Kernel versus NDRange Kernel
Multi-Threaded Host Application

Related Information

OpenCL Reference Pages

OpenCL Specification version 1.0

¹ The is based on a published Khronos Specification, and has passed the Khronos Conformance Testing Process. Current conformance status can be found at www.khronos.org/conformance.

² OpenCL and the OpenCL logo are trademarks of Apple Inc. and used by permission of the Khronos Group™.

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

1. Introduction to Standard Edition Best Practices Guide