Intel FPGA SDK for OpenCL Standard Edition: Programming Guide
Intel FPGA SDK for OpenCL Standard Edition Overview
Intel FPGA SDK for OpenCL Standard Edition Programming Guide Prerequisites
Before using the Intel® FPGA SDK for OpenCL™ or the Intel® FPGA Runtime Environment (RTE) for OpenCL to program your device, familiarize yourself with the respective getting started guides. This document assumes that you have performed the following tasks:
- For developing and deploying OpenCL kernels, download the tar file and run the installers to install the SDK, the Intel® Quartus® Prime software, and device support.
- For deployment of OpenCL kernels, download and install the RTE.
- If you want to use the SDK or the RTE to program an Intel® SoC FPGA, you also have to download and install the Intel® SoC FPGA Embedded Development Suite (EDS).
- Install and set up your FPGA board.
- Verify that board installation is successful, and the board functions correctly.
If you have not performed the tasks described above, refer to the SDK's getting starting guides for more information.
Intel FPGA SDK for OpenCL Standard Edition FPGA Programming Flow
The following SDK components work together to program an Intel® FPGA:
- The host application and the host compiler
- The OpenCL kernel(s) and the offline compiler
- The Custom Platform
The Custom Platform provides the board support package. Typically, the board manufacturer develops the Custom Platform that supports a specific OpenCL board. The offline compiler targets the Custom Platform when compiling an OpenCL kernel to generate a hardware programming image. The host then runs the host application, which usually programs and executes the hardware image onto the FPGA.
In a sequential implementation of a program (for example, on a conventional processor), the program counter controls the sequence of instructions that are executed on the hardware, and the instructions that execute on the hardware across time. In a spatial implementation of a program, such as program implementation within the Intel® FPGA SDK for OpenCL™ , instructions are executed as soon as the prerequisite data is available. Programs are interpreted as a series of connections representing the data dependencies.
Intel FPGA SDK for OpenCL Offline Compiler Kernel Compilation Flows
An OpenCL kernel source file (.cl) contains your OpenCL kernel source code that runs on the FPGA. The offline compiler groups one or more kernels into a temporary file and then compiles this file to generate the following files and folders:
- A .aoco object file is an intermediate object file that contains information for later stages of the compilation.
- A.aocx image file is the hardware configuration file and contains information necessary to program the FPGA at runtime.
- The work folder or subdirectory, which contains data necessary to create the .aocx file. By default, the name of the work directory is the name of your .cl file. If you compile multiple kernel source files, the name of the work directory is the name of the last .cl file you list in the aoc command line.
The .aocx file contains data that the host application uses to create program objects, a concept within the OpenCL runtime API, for the target FPGA. The host application first loads these program objects into memory. Then the host runtime uses these program objects to program the target FPGA, as required for kernel launch operations by the host program.
One-Step Compilation for Simple Kernels
The following figure illustrates the OpenCL kernel design flow that has a single compilation step.
A successful compilation results in the following files and reports:
- A .aoco file
- A .aocx file
- In the <your_kernel_filename>/reports/report.html file, the estimated resource usage summary provides a preliminary assessment of area usage. If you have a single work-item kernel, the optimization report identifies performance bottlenecks.
Multistep Intel FPGA SDK for OpenCL Standard Edition Design Flow
The figure below outlines the stages in the SDK's design flow. The steps in the design flow serve as checkpoints for identifying functional errors and performance bottlenecks. They allow you to modify your OpenCL kernel code without performing a full compilation after each iteration.
The SDK's design flow includes the following steps:
- Emulation
Assess the functionality of your OpenCL kernel by executing it on one or multiple emulation devices on an x86-64 host. For Linux systems, the Emulator offers symbolic debug support. Symbolic debug allows you to locate the origins of functional errors in your kernel code.
- Intermediate compilation
The intermediate compilation step checks for syntactic errors. It then generates a .aoco file without building the hardware configuration file.
- Review HTML Report
Review the <your_kernel_filename>/reports/report.html file of your OpenCL application to determine whether the estimated kernel performance data is acceptable. The HTML report also provides suggestions on how you can modify your kernel to increase performance.
- Profiling
Instruct the Intel® FPGA SDK for OpenCL™ Offline Compiler to instrument performance counters in the Verilog code in the .aocx file. During execution, the performance counters collect performance information which you can then review in the Intel® FPGA Dynamic Profiler for OpenCL™ GUI.
- Full deployment
If you are satisfied with the performance of your OpenCL kernel throughout the design flow, perform a full compilation. You can then execute the .aocx file on the FPGA.
For more information on HTML report and kernel profiling, refer to the Intel® FPGA SDK for OpenCL™ Best Practices Guide.
Obtaining General Information on Software, Compiler, and Custom Platform
- The Intel® FPGA SDK for OpenCL™ Offline Compiler command options ( aoc <command_option> ) now have single dashes (-) instead of double dashes (--). The double-dash convention was deprecated in the 17.1 release and will be removed in a future release.
- The Intel® FPGA SDK for OpenCL™ Offline Compiler command options now follow the convention <command_option>=<value>, where value can be a comma separated list of user input values. The use of -option value1 -option value2 was deprecated in the 17.1 release and will be removed in a future release.
Displaying the Software Version (version)
aocl <version>.<build> (Intel(R) FPGA SDK for OpenCL(TM), Version <version> Build <build>, Copyright (C) <year> Intel Corporation)
Displaying the Compiler Version (-version)
Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline Compiler Version <version> Build <build> Copyright (C) <year> Intel Corporation
Listing the Intel FPGA SDK for OpenCL Standard Edition Utility Command Options (help)
Displaying Information on an Intel FPGA SDK for OpenCL Utility Command Option (help <command_option>)
aocl install - Installs a board onto your host system. Usage: aocl install Description: This command installs a board's drivers and other necessary software for the host operating system to communicate with the board. For example this might install PCIe drivers.
Listing the Intel FPGA SDK for OpenCL Offline Compiler Command Options (no argument, -help, or -h)
- aoc
- aoc -help
- aoc -h
Listing the Available FPGA Boards in Your Custom Platform (-list-boards)
Board list: <board_name_1> <board_name_2> ...
Where <board_name_N> is the board name you use in your aoc command to target a specific FPGA board.
Displaying the Compilation Environment of an OpenCL Binary (env)
INPUT_ARGS=-march=emulator -v device/vector_add.cl -o bin/vector_add.aocx BUILD_NUMBER=90 ACL_VERSION=16.1.0 OPERATING_SYSTEM=linux PLATFORM_TYPE=s5_net
Managing an FPGA Board
You can install multiple Custom Platforms simultaneously on the same system using the SDK utilities, such as aocl diagnose with multiple Custom Platforms. The Custom Platform subdirectory contains the board_env.xml file.
In a system with multiple Custom Platforms, ensure that the host program uses the FPGA Client Driver (FCD) to discover the boards rather than linking to the Custom Platforms' memory-mapped device (MMD) libraries directly. As long as FCD is correctly set up for Custom Platform, FCD finds all the installed boards at runtime.
Installing an FPGA Board (install)
- Follow your board vendor's instructions to connect the FPGA board to your system.
- Download the Custom Platform for your FPGA board from your board vendor's website. To download an Intel® FPGA SDK for OpenCL Reference Platform, refer to the Intel® FPGA SDK for OpenCL FPGA Platforms page.
-
Install the Custom Platform in a folder that you own (that is,
not a system folder).
You can install multiple Custom Platforms simultaneously on the same system using the SDK utilities, such as aocl diagnose with multiple Custom Platforms. The Custom Platform subdirectory contains the board_env.xml file.
In a system with multiple Custom Platforms, ensure that the host program uses the FPGA Client Driver (FCD) to discover the boards rather than linking to the Custom Platforms' memory-mapped device (MMD) libraries directly. As long as FCD is correctly set up for Custom Platform, FCD finds all the installed boards at runtime.
- Install the Custom Platform in a directory that you own (that is, not a system directory).
- Set the QUARTUS_ROOTDIR_OVERRIDE user environment variable to point to e installation directory.
-
Add the paths to the Custom Platform libraries (for example,
the memory-mapped (MMD) library) to the PATH
(Windows) or LD_LIBRARY_PATH (Linux)
environment variable setting.
The Intel® FPGA SDK for OpenCL™ Standard Edition Getting Started Guide contains more information on the init_opencl script. For information on setting user environment variables and running the init_opencl script, refer to the Setting the Intel® FPGA SDK for OpenCL™ Standard Edition User Environment Variables section.
-
Invoke the command
aocl
install
<path_to_customplatform>
at a command prompt.
Invoking aocl install <path_to_customplatform> also installs a board driver that allows communication between host applications and hardware kernel programs.Remember: You need administrative rights to install a board. To run a Windows command prompt as an administrator, click Start > All Programs > Accessories. Under Accessories, right click Command Prompt, In the right-click menu, click Run as Administrator.
-
To query a list of FPGA devices installed in your machine,
invoke the
aocl
diagnose
command.
The software generates an output that includes the <device_name>, which is an acl number that ranges from acl0 to acl31.Attention: For more information on querying the <device_name> of your accelerator board, refer to the Querying the Device Name of Your FPGA Board section.
- To verify the successful installation of the FPGA board, invoke the command aocl diagnose <device_name> to run any board vendor-recommended diagnostic test.
Uninstalling the FPGA Board (uninstall)
To uninstall your FPGA board, perform the following tasks:
- Disconnect the board from your machine by following the instructions provided by your board vendor.
- Invoke the aocl uninstall <path_to_customplatform> utility command to remove the current host computer drivers (for example, PCIe® drivers). The Intel® FPGA SDK for OpenCL™ uses these drivers to communicate with the FPGA board.
- Uninstall the Custom Platform.
- Unset the LD_LIBRARY_PATH (for Linux) or PATH (for Windows) environment variable.
Querying the Device Name of Your FPGA Board (diagnose)
aocl diagnose: Running diagnostic from INTELFPGAOCLSDKROOT/board/<board_name>/<platform>/libexec Verified that the kernel mode driver is installed on the host machine. Using board package from vendor: <board_vendor_name> Querying information for all supported devices that are installed on the host machine ... device_name Status Information acl0 Passed <descriptive_board_name> PCIe dev_id = <device_ID>, bus:slot.func = 02:00.00, at Gen 2 with 8 lanes. FPGA temperature = 43.0 degrees C. acl1 Passed <descriptive_board_name> PCIe dev_id = <device_ID>, bus:slot.func = 03:00.00, at Gen 2 with 8 lanes. FPGA temperature = 35.0 degrees C. Found 2 active device(s) installed on the host machine, to perform a full diagnostic on a specific device, please run aocl diagnose <device_name> DIAGNOSTIC_PASSED
Running a Board Diagnostic Test (diagnose <device_name>)
Programming the FPGA Offline or without a Host (program <device_name>)
<device_name> refers to the acl number (for example, acl0 to acl31) that corresponds to your FPGA device, and
<your_kernel_filename>.aocx is the executable file you use to program the hardware.
Programming the Flash Memory (flash <device_name>)
<device_name> refers to the acl number (for example, acl0 to acl31) that corresponds to your FPGA device, and
<your_kernel_filename>.aocx is the executable file you use to program the hardware.
Structuring Your OpenCL Kernel
Guidelines for Naming the Kernel
-
Begin a file name with an alphanumeric character.
If the file name of your OpenCL™ application begins with a nonalphanumeric character, compilation fails with the following error message:
Error: Quartus compilation FAILED See quartus_sh_compile.log for the output log.
-
Do not differentiate file names using nonalphanumeric characters.
The Intel® FPGA SDK for OpenCL™ Offline Compiler translates any nonalphanumeric character into an underscore ("_"). If you differentiate two file names by ending them with different nonalphanumeric characters only (for example, myKernel#.cl and myKernel&.cl), the offline compiler translates both file names to <your_kernel_filename>_.cl (for example, myKernel_.cl).
-
For Windows
systems,
ensure that the combined length of the kernel file name and its file path does
not exceed 260 characters.
64-bit Windows 7 and Windows 8.1 have a 260-character limit on the length of a file path. If the combined length of the kernel file name and its file path exceeds 260 characters, the offline compiler generates the following error message:
The filename or extension is too long. The system cannot find the path specified.
In addition to the compiler error message, the following error message appears in the <your_kernel_filename>/quartus_sh_compile.log file:
Error: Can’t copy <file_type> files: Can’t open <your_kernel_filename> for write: No such file or directory
For Windows 10, you can remove the 260-character limit. For more information, see your Windows 10 documentation.
-
Do not name your .cl
OpenCL kernel source file "kernel", "Verilog", or "VHDL" as they are reserved
keywords.
Naming the source file kernel.cl, Verilog.cl, or VHDL.cl causes the offline compiler to generate intermediate design files that have the same names as certain internal files, which leads to a compilation error.
Programming Strategies for Optimizing Data Processing Efficiency
Unrolling a Loop
Loop unrolling involves replicating a loop body multiple times, and reducing the trip count of a loop. Unroll loops to reduce or eliminate loop control overhead on the FPGA. In cases where there are no loop-carried dependencies and the offline compiler can perform loop iterations in parallel, unrolling loops can also reduce latency and overhead on the FPGA.
The Intel® FPGA SDK for OpenCL™ Offline Compiler might unroll simple loops even if they are not annotated by a pragma.-
Provide an unroll factor whenever possible. To specify an unroll
factor N, insert the #pragma unroll <N>
directive before a loop in your kernel code.
The offline compiler attempts to unroll the loop at most <N> times.Consider the code fragment below. By assigning a value of 2 as the unroll factor, you direct the offline compiler to unroll the loop twice.
#pragma unroll 2 for(size_t k = 0; k < 4; k++) { mac += data_in[(gid * 4) + k] * coeff[k]; }
-
To unroll a loop fully, you may omit the unroll factor by
simply inserting the #pragma unroll directive
before a loop in your kernel code.
The offline compiler attempts to unroll the loop fully if it understands the trip count. The offline compiler issues a warning if it cannot execute the unroll request.
- To prevent a loop from unrolling, specify an unroll factor of 1 (that is, #pragma unroll 1).
Coalescing Nested Loops
Coalescing nested loops also reduces the latency of the component, which could further reduce your kernel area usage. However, in some cases, coalescing loops might lengthen the critical loop initiation interval path, so coalescing loops might not be suitable for all kernels .
For NDRange kernels, the compiler automatically attempts to coalesce loops even if they are not annotated by the loop_coalesce pragma. Coalescing loops in NDRange kernels improves throughput as well as reducing kernel area usage. You can use the loop_coalesce pragma to prevent the automatic coalescing of loops in NDRange kernels.
#pragma loop_coalesce <loop_nesting_level>
The <loop_nesting_level> parameter is optional and is an integer that specifies how many nested loop levels that you want the compiler to attempt to coalesce. If you do not specify the <loop_nesting_level> parameter, the compiler attempts to coalesce all of the nested loops.
for (A) for (B) for (C) for (D) for (E)
- Loop (A) has a loop nesting level of 1.
- Loop (B) has a loop nesting level of 2.
- Loop (C) has a loop nesting level of 3.
- Loop (D) has a loop nesting level of 4.
- Loop (E) has a loop nesting level of 3.
- If you specify #pragma loop_coalesce 1 on loop (A), the compiler does not attempt to coalesce any of the nested loops.
- If you specify #pragma loop_coalesce 2 on loop (A), the compiler attempts to coalesce loops (A) and (B).
- If you specify #pragma loop_coalesce 3 on loop (A), the compiler attempts to coalesce loops (A), (B), (C), and (E).
- If you specify #pragma loop_coalesce 4 on loop (A), the compiler attempts to coalesce all of the loops [loop (A) - loop (E)].
Example
The following simple example shows how the compiler coalesces two loops into a single loop.
#pragma loop_coalesce for (int i = 0; i < N; i++) for (int j = 0; j < M; j++) sum[i][j] += i+j;
int i = 0; int j = 0; while(i < N){ sum[i][j] += i+j; j++; if (j == M){ j = 0; i++; } }
Specifying a Loop Initiation interval (II)
The ii pragma applies to single work-item kernels (that is, single-threaded kernels) in which loops are pipelined. Refer to the Single Work-Item Kernel versus NDRange Kernel section of the Intel® FPGA SDK for OpenCL™ Best Practices Guide for information on loop pipelining, and on kernel properties that drive the offline compiler's decision on whether to treat a kernel as single-threaded.
The higher the II value, the longer the wait before the subsequent loop iteration starts executing. Refer to the Reviewing Your Kernel's report.html File section of the Intel® FPGA SDK for OpenCL™ Best Practices Guide for information on II, and on the compiler reports that provide you with details on the performance implications of II on a specific loop.
For some loops in your kernel , specifying a higher II value with the ii pragma than the value the compiler chooses by default can increase the maximum operating frequency (fmax) of your kernel without a decrease in throughput.
- The loop is pipelined because the kernel is single-threaded.
- The loop is not critical to the throughput of your kernel .
- The running time of the loop is small compared to other loops it might contain.
#pragma ii <desired_initiation_interval>The <desired_initiation_interval> parameter is required and is an integer that specifies the number of clock cycles to wait between the beginning of execution of successive loop iterations.
Example
Consider a case where your kernel has two distinct, pipelineable loops: a short-running initialization loop that has a loop-carried dependence and a long-running loop that does the bulk of your processing. In this case, the compiler does not know that the initialization loop has a much smaller impact on the overall throughput of your design. If possible, the compiler attempts to pipeline both loops with an II of 1.
Because the initialization loop has a loop-carried dependence, it will have a feedback path in the generated hardware. To achieve an II with such a feedback path, some clock frequency might be sacrificed. Depending on the feedback path in the main loop, the rest of your design could have run at a higher operating frequency.
If you specify #pragma ii 2 on the initialization loop, you tell the compiler that it can be less aggressive in optimizing II for this loop. Less aggressive optimization allows the compiler to pipeline the path limiting the fmax and could allow your overall kernel design to achieve a higher fmax.
The initialization loop takes longer to run with its new II. However, the decrease in the running time of the long-running loop due to higher fmax compensates for the increased length in running time of the initialization loop.
Loop Concurrency (max_concurrency Pragma)
The max_concurrency pragma applies to single work-item kernels (that is, single-threaded kernels) in which loops are pipelined. Refer to the Single Work-Item Kernel versus NDRange Kernel section of the Intel® FPGA SDK for OpenCL™ Standard Edition Best Practices Guide for information on loop pipelining, and on kernel properties that drive the offline compiler's decision on whether to treat a kernel as single-threaded.
The max_concurrency pragma enables you to control the on-chip memory resources required to implement your loop. To achieve simultaneous execution of loop iterations, the offline compiler must create independent copies of any memory that is private to a single iteration. The greater the permitted concurrency, the more copies the compiler must make.
The kernel's HTML report (report.html) provides the following information pertaining to loop concurrency:
- Maximum concurrency that the offline compiler has chosen
This information is available in the Loop Analysis report. A message in the Details pane reports that the maximum number of simultaneous executions has been limited to N.
- Impact to memory usage
This information is available in the Area Analysis report. A message in the Details pane reports that the offline compiler has created N independent copies of the memory to enable simultaneous execution of N loop iterations.
If you want to exchange some performance for physical memory savings, apply #pragma max_concurrency <N> to the loop, as shown below. When you apply this pragma, the offline compiler limits the number of simultaneously-executed loop iterations to N. The number of independent copies of loop memories is also reduced to N.
#pragma max_concurrency 1 for (int i = 0; i < N; i++) { int arr[M]; // Doing work on arr }
Specifying Work-Group Sizes
- If your kernel contains a barrier, the offline compiler sets a default maximum scalarized work-group size of 256 work-items.
- If your kernel does not query any OpenCL intrinsics that allow different threads to behave differently (that is, local or global thread IDs, or work-group ID), the offline compiler infers a single-threaded execution mode and sets the maximum work-group size to (1,1,1). In this case, the OpenCL runtime also enforces a global enqueue size of (1,1,1), and loop pipelining optimizations are enabled within the offline compiler.
To specify the work-group size, modify your kernel code in the following manner:
-
To specify the maximum number of work-items that the offline
compiler
will provision
for
a work-group in a kernel, insert the max_work_group_size(X,
Y, Z)
attribute in your kernel source code.
For example:
__attribute__((max_work_group_size(512,1,1))) __kernel void sum (__global const float * restrict a, __global const float * restrict b, __global float * restrict answer) { size_t gid = get_global_id(0); answer[gid] = a[gid] + b[gid]; }
-
To specify the required number of work-items that the offline
compiler
provisions
for a work-group in a kernel, insert the reqd_work_group_size(X,
Y, Z) attribute
in
your kernel source code.
For example:
__attribute__((reqd_work_group_size(64,1,1))) __kernel void sum (__global const float * restrict a, __global const float * restrict b, __global float * restrict answer) { size_t gid = get_global_id(0); answer[gid] = a[gid] + b[gid]; }
Specifying Number of Compute Units
__attribute__((num_compute_units(2))) __kernel void test(__global const float * restrict a, __global const float * restrict b, __global float * restrict answer) { size_t gid = get_global_id(0); answer[gid] = a[gid] + b[gid]; }
Specifying Number of SIMD Work-Items
__attribute__((num_simd_work_items(4))) __attribute__((reqd_work_group_size(64,1,1))) __kernel void test(__global const float * restrict a, __global const float * restrict b, __global float * restrict answer) { size_t gid = get_global_id(0); answer[gid] = a[gid] + b[gid]; }
Programming Strategies for Optimizing Pointer-to-Local Memory Size
__kernel void myLocalMemoryPointer( __local float * A, __attribute__((local_mem_size(1024))) __local float * B, __attribute__((local_mem_size(32768))) __local float * C) { //statements }
In the myLocalMemoryPointer kernel, 16 kB of local memory (default) is allocated to pointer A, 1 kB is allocated to pointer B, and 32 kB is allocated to pointer C.
Implementing the Intel FPGA SDK for OpenCL Standard Edition Channels Extension
Overview of the Intel FPGA SDK for OpenCL Standard Edition Channels Extension
Implementation of channels decouples data movement between concurrently executing kernels from the host processor.
Channel Data Behavior
Data in channels does not persist between context, program, device, kernel, or platform releases, even if the OpenCL implementation performs optimizations that avoid reprogramming operations on a device. For example, if you run a host program twice using the same .aocx file, or if a host program releases and reacquires a context, the data in the channel might or might not persist across the operation. FPGA device reset operations might happen behind the scenes on object releases that purge data in any channels
Consider the following code example:
channel int c0; __kernel void producer() { for (int i = 0; i < 10; i++) { write_channel_intel (c0, i); } } __kernel void consumer (__global uint * restrict dst) { for (int i = 0; i < 5; i++) { dst[i] = read_channel_intel(c0); } }
The kernel producer writes ten elements ([0, 9]) to the channel. The kernel consumer does not contain any work-item identifier queries; therefore, it will receive an implicit reqd_work_group_size attribute of (1,1,1). The implied reqd_work_group_size(1,1,1) attribute means that consumer must be launched as a single work-item kernel. In the example above, consumer reads five elements from the channel per invocation. During the first invocation, the kernel consumer reads values 0 to 4 from the channel. Because the data persists across NDRange invocations, the second time you execute the kernel consumer, it reads values 5 to 9.
For this example, to avoid a deadlock from occurring, you need to invoke the kernel consumer twice for every invocation of the kernel producer. If you call consumer less than twice, producer stalls because the channel becomes full. If you call consumer more than twice, consumer stalls because there is insufficient data in the channel.
Multiple Work-Item Ordering for Channels
Multiple work-item accesses to a channel can be useful in some scenarios. For example, they are useful when data words in the channel are independent, or when the channel is implemented for control logic. The main concern regarding multiple work-item accesses to a channel is the order in which the kernel writes data to and reads data from the channel. If possible, the SDK's channels extension processes work-item read and write operations to the channel in a deterministic order. As such, the read and write operations remain consistent across kernel invocations.
Requirements for Deterministic Multiple Work-Item Ordering
To guarantee deterministic ordering, the SDK checks that a channel access is work-item invariant based on the following characteristics:
- All paths through the kernel must execute the channel access.
- If the first requirement is not satisfied, none of the branch conditions that reach the channel call should execute in a work-item-dependent manner.
- The kernel is not inferred as a single work-item kernel.
If the SDK cannot guarantee deterministic ordering of multiple work-item accesses to a channel, it warns you that the channels might not have well-defined ordering and therefore might exhibit nondeterministic execution. Primarily, the SDK fails to provide deterministic ordering if you have work-item-variant code on loop executions with channel calls, as illustrated below:
__kernel void ordering (__global int * restrict check, __global int * restrict data) { int condition = check[get_global_id(0)]; if (condition) { for (int i = 0; i < N, i++) { process(data); write_channel_intel (req, data[i]); } } else { process(data); } }
Work-Item Serial Execution of Channels
When you implement channels in a kernel, the Intel® FPGA SDK for OpenCL™ Offline Compiler enforces that kernel behavior is equivalent to having at most one work-group in flight within the compute unit at a time. The compiler also ensures that the kernel executes channels in work-item serial execution, where the kernel executes work-items with smaller IDs first. A work-item has the identifier (x, y, z, group), where x, y, z are the local 3D identifiers, and group is the work-group identifier.
The work-item ID (x0, y0, z0, group0) is considered to be smaller than the ID (x1, y1, z1, group1) if one of the following conditions is true:
- group0 < group1
- group0 = group1 and z0 < z1
- group0 = group1 and z0 = z1 and y0 < y1
- group0 = group1 and z0 = z1 and y0 = y1 and x0 < x1
Work-items with incremental IDs execute in a sequential order. For example, the work-item with an ID (x0, y0, z0, group0) executes the write channel call first. Then, the work-item with an ID (x1, y0, z0, group0) executes the call, and so on. Defining this order ensures that the system is verifiable with external models.
Channel Execution in Loop with Multiple Work-Items
When channels exist in the body of a loop with multiple work-items, as shown below, each loop iteration executes prior to subsequent iterations. This implies that loop iteration 0 of each work-item in a work-group executes before iteration 1 of each work-item in a work-group, and so on.
__kernel void ordering (__global int * data, int X) { int n = 0; while (n < X) { write_channel_intel (req, data[get_global_id(0)]); n++; } }
Restrictions in the Implementation of Intel FPGA SDK for OpenCL Standard Edition Channels Extension
Multiple Channel Call Site
__kernel void k1() { read_channel_intel (channel1); read_channel_intel (channel1); read_channel_intel (channel1); }
__kernel void k1(){ write_channel_intel (channel1, 1); } __kernel void k2() { write_channel_intel (channel1, 2); }
Feedback and Feed-Forward Channels
Channels within a kernel can be either read_only or write_only. Performance of a kernel that reads and writes to the same channel might be poor.
Static Indexing
The Intel® FPGA SDK for OpenCL™ channels extension does not support dynamic indexing into arrays of channel IDs because it leads to inefficient hardware.
Consider the following example:
channel int ch[WORKGROUP_SIZE]; __kernel void consumer() { int gid = get_global_id(0); int value = read_channel_intel(ch[gid]); //statements }
Compilation of this example kernel fails with the following error message:
Compiler Error: Indexing into channel array ch could not be resolved to all constant
To avoid this compilation error, index into arrays of channel IDs statically, as shown below:
channel int ch[WORKGROUP_SIZE]; __kernel void consumer() { int gid = get_global_id(0); int value; switch(gid) { case 0: value = read_channel_intel(ch[0]); break; case 1: value = read_channel_intel(ch[1]); break; case 2: value = read_channel_intel(ch[2]); break; case 3: value = read_channel_intel(ch[3]); break; //statements case WORKGROUP_SIZE-1:read_channel_intel(ch[WORKGROUP_SIZE-1]); break; } //statements }
Kernel Vectorization Support
You cannot vectorize kernels that use channels; that is, do not include the num_simd_work_items kernel attribute in your kernel code. Vectorizing a kernel that uses channels creates multiple channel accesses inside the same kernel and requires arbitration, which negates the advantages of vectorization. As a result, the SDK's channel extension does not support kernel vectorization.
Instruction-Level Parallelism on read_channel_intel and write_channel_intel Calls
If no data dependencies exist between read_channel_intel and write_channel_intel calls, the offline compiler attempts to execute these instructions in parallel. As a result, the offline compiler might execute these read_channel_intel and write_channel_intel calls in an order that does not follow the sequence expressed in the OpenCL kernel code.
Consider the following code sequence:
in_data1 = read_channel_intel(channel1); in_data2 = read_channel_intel(channel2); in_data3 = read_channel_intel(channel3);
Because there are no data dependencies between the read_channel_intel calls, the offline compiler can execute them in any order.
Enabling the Intel FPGA SDK for OpenCL Standard Edition Channels for OpenCL Kernel
To enable the channel extension, use the following pragma:
#pragma OPENCL EXTENSION cl_intel_channels : enableChannel declarations are unique within a given OpenCL kernel program. Also, channel instances are unique for every OpenCL kernel program device pair. If the runtime loads a single OpenCL kernel program onto multiple devices, each device will have a single copy of the channel. However, these channel copies are independent and do not share data across the devices.
Declaring the Channel Handle
To read from and write to a channel, the kernel must pass the channel variable to each of the corresponding API calls.
-
Declare the channel handle as a file scope variable in the
kernel source code
using
the following convention: channel <type>
<variable_name>
For example: channel int c;
-
The
Intel® FPGA SDK for OpenCL™
channel extension
supports simultaneous channel accesses by multiple variables declared in a data
structure. Declare a struct data structure for
a channel in the following manner:
typedef struct type_ { int a; int b; } type_t; channel type_t foo;
Implementing Blocking Channel Writes
Where:
channel_id identifies the buffer to which the channel connects, and it must match the channel_id of the corresponding read channel (read_channel_intel).
data is the data that the channel write operation writes to the channel.
<type> defines a channel data width. Follow the OpenCL™ conversion rules to ensure that data the kernel writes to a channel is convertible to <type>.
//Defines chan, a kernel file-scope channel variable. channel long chan; /*Defines the kernel which reads eight bytes (size of long) from global memory, and passes this data to the channel.*/ __kernel void kernel_write_channel( __global const long * src ) { for (int i = 0; i < N; i++) { //Writes the eight bytes to the channel. write_channel_intel(chan, src[i]); } }
Implementing Nonblocking Channel Writes
Consider a scenario where your application has one data producer with two identical workers. Assume the time each worker takes to process a message varies depending on the contents of the data. In this case, there might be situations where one worker is busy while the other is free. A nonblocking write can facilitate work distribution such that both workers are busy.
channel long worker0, worker1; __kernel void producer( __global const long * src ) { for(int i = 0; i < N; i++) { bool success = false; do { success = write_channel_nb_intel(worker0, src[i]); if(!success) { success = write_channel_nb_intel(worker1, src[i]); } } while(!success); } }
Implementing Blocking Channel Reads
Where:
channel_id identifies the buffer to which the channel connects, and it must match the channel_id of the corresponding write channel (write_channel_intel).
<type> defines a channel data width. Ensure that the variable the kernel assigns to read the channel data is convertible from <type>.
//Defines chan, a kernel file-scope channel variable. channel long chan; /*Defines the kernel, which reads eight bytes (size of long) from the channel and writes it back to global memory.*/ __kernel void kernel_read_channel (__global long * dst); { for (int i = 0; i < N; i++) { //Reads the eight bytes from the channel. dst[i] = read_channel_intel(chan); } }
Implementing Nonblocking Channel Reads
On a successful read (valid set to true), the value read from the channel is returned by the read_channel_nb_intel function. On a failed read (valid set to false), the return value of the read_channel_nb_intel function is not defined.
channel long chan; __kernel void kernel_read_channel (__global long * dst) { int i = 0; while (i < N) { bool valid0, valid1; long data0 = read_channel_nb_intel(chan, &valid0); long data1 = read_channel_nb_intel(chan, &valid1); if (valid0) { process(data0); } if (valid1) { process(data1); } } }
Implementing I/O Channels Using the io Channels Attribute
The io("chan_id") attribute specifies the I/O feature of an accelerator board with which a channel will be connected, where chan_id is the name of the I/O interface listed in the board_spec.xml file of your Custom Platform.
Because peripheral interface usage might differ for each device type, consult your board vendor's documentation when you implement I/O channels in your kernel program. Your OpenCL™ kernel code must be compatible with the type of data generated by the peripheral interfaces.
- Implicit data dependencies might exist for channels that connect to the board directly and communicate with peripheral devices via I/O channels. These implicit data dependencies might lead to unexpected behavior because the Intel® FPGA SDK for OpenCL™ Offline Compiler does not have visibility into these dependencies.
- External I/O channels communicating with the same peripherals do not obey any sequential ordering. Ensure that the external device does not require sequential ordering because unexpected behavior might occur.
-
Consult the board_spec.xml file in your Custom Platform to identify the input and output features available on your FPGA board.
For example, a board_spec.xml file might include the following information on I/O features:
<channels> <interface name="udp_0" port="udp0_out" type="streamsource" width="256" chan_id="eth0_in"/> <interface name="udp_0" port="udp0_in" type="streamsink" width="256" chan_id="eth0_out"/> <interface name="udp_0" port="udp1_out" type="streamsource" width="256" chan_id="eth1_in"/> <interface name="udp_0" port="udp1_in" type="streamsink" width="256" chan_id="eth1_out"/> </channels>
The width attribute of an interface element specifies the width, in bits, of the data type used by that channel. For the example above, both the uint and float data types are 32 bits wide. Other bigger or vectorized data types must match the appropriate bit width specified in the board_spec.xml file.
-
Implement the io channel attribute as demonstrated in the following code example. The
io channel attribute names must match those of the I/O
channels (chan_id) specified in the
board_spec.xml file.
channel QUDPWord udp_in_IO __attribute__((depth(0))) __attribute__((io("eth0_in"))); channel QUDPWord udp_out_IO __attribute__((depth(0))) __attribute__((io("eth0_out"))); __kernel void io_in_kernel (__global ulong4 *mem_read, uchar read_from, int size) { int index = 0; ulong4 data; int half_size = size >> 1; while (index < half_size) { if (read_from & 0x01) { data = read_channel_intel(udp_in_IO); } else { data = mem_read[index]; } write_channel_intel(udp_in, data); index++; } } __kernel void io_out_kernel (__global ulong2 *mem_write, uchar write_to, int size) { int index = 0; ulong4 data; int half_size = size >> 1; while (index < half_size) { ulong4 data = read_channel_intel(udp_out); if (write_to & 0x01) { write_channel_intel(udp_out_IO, data); } else { //only write data portion ulong2 udp_data; udp_data.s0 = data.s0; udp_data.s1 = data.s1; mem_write[index] = udp_data; } index++; } }
Attention: Declare a unique io("chan_id") handle for each I/O channel specified in the channels eXtensible Markup Language (XML) element within the board_spec.xml file.
Emulating I/O Channels
When you emulate a kernel that has a channel declared with the io attribute, I/O channel input is emulated by reading from a file, and channel output is emulated by writing to a file.
channel uint chanA __attribute__((io("myIOChannel")));
channel uint readChannel __attribute__((io("myIOChannel"))); channel uint writeChannel __attribute__((io("myIOChannel")));
Emulating Reading from an I/O Channel
- Non-blocking read
- If the file does not exist or there is insufficient data, the read attempt returns with a failure message.
- Blocking readl
- If the file does not exist or there is insufficient data, the read attempt blocks your program until the file is created on the disk, or the file contains sufficient data.
Emulating Writing to an I/O Channel
- Non-blocking write
- If the write attempt fails, an error is returned.
- Blocking write
- If the write attempt fails, further write attempts are made.
Use Models of Intel FPGA SDK for OpenCL Standard Edition Channels Implementation
The following use models provide an overview on how to exploit concurrent execution safely and efficiently.
Feed-Forward Design Model
Implement the feed-forward design model to send data from one kernel to the next without creating any cycles between them. Consider the following code example:
__kernel void producer (__global const uint * src, const uint iterations) { for (int i = 0; i < iterations; i++) { write_channel_intel(c0, src[2*i]); write_channel_intel(c1, src[2*i+1]); } } __kernel void consumer (__global uint * dst, const uint iterations) { for (int i = 0; i < iterations; i++) { dst[2*i] = read_channel_intel(c0); dst[2*i+1] = read_channel_intel(c1); } }
The producer kernel writes data to channels c0 and c1. The consumer kernel reads data from c0 and c1. The figure below illustrates the feed-forward data flow between the two kernels:
Buffer Management
In the feed-forward design model, data traverses between the producer and consumer kernels one word at a time. To facilitate the transfer of large data messages consisting of several words, you can implement a ping-pong buffer, which is a common design pattern found in applications for communication. The figure below illustrates the interactions between kernels and a ping-pong buffer:
The manager kernel manages circular buffer allocation and deallocation between the producer and consumer kernels. After the consumer kernel processes data, the manager receives memory regions that the consumer frees up and sends them to the producer for reuse. The manager also sends to the producer kernel the initial set of free locations, or tokens, to which the producer can write data.
The following figure illustrates the sequence of events that take place during buffer management:
- The manager kernel sends a set of tokens to the producer kernel to indicate initially which regions in memory are free for producer to use.
- After manager allocates the memory region, producer writes data to that region of the ping-pong buffer.
- After producer completes the write operation, it sends a
synchronization token to the
consumer kernel to indicate what memory region
contains data for processing. The
consumer kernel then reads data from that region of the
ping-pong buffer.
Note: When consumer is performing the read operation, producer can write to other free memory locations for processing because of the concurrent execution of the producer, consumer, and manager kernels.
- After consumer completes the read operation, it releases the memory region and sends a token back to the manager kernel. The manager kernel then recycles that region for producer to use.
Implementation of Buffer Management for OpenCL Kernels
To ensure that the SDK implements buffer management properly, the ordering of channel read and write operations is important. Consider the following kernel example:
__kernel void producer (__global const uint * restrict src, __global volatile uint * restrict shared_mem, const uint iterations) { int base_offset; for (uint gID = 0; gID < iterations; gID++) { // Assume each block of memory is 256 words uint lID = 0x0ff & gID; if (lID == 0) { base_offset = read_channel_intel(req); } shared_mem[base_offset + lID] = src[gID]; // Make sure all memory operations are committed before // sending token to the consumer mem_fence(CLK_GLOBAL_MEM_FENCE | CLK_CHANNEL_MEM_FENCE); if (lID == 255) { write_channel_intel(c, base_offset); } } }
In this kernel, because the following lines of code are independent, the Intel® FPGA SDK for OpenCL™ Offline Compiler can schedule them to execute concurrently:
shared_mem[base_offset + lID] = src[gID];
and
write_channel_intel(c, base_offset);
Writing data to base_offset and then writing base_offset to a channel might be much faster than writing data to global memory. The consumer kernel might then read base_offset from the channel and use it as an index to read from global memory. Without synchronization, consumer might read data from producer before shared_mem[base_offset + lID] = src[gID]; finishes executing. As a result, consumer reads in invalid data. To avoid this scenario, the synchronization token must occur after the producer kernel commits data to memory. In other words, a consumer kernel cannot consume data from the producer kernel until producer stores its data in global memory successfully.
To preserve this ordering, include an OpenCL mem_fence token in your kernels. The mem_fence construct takes two flags: CLK_GLOBAL_MEM_FENCE and CLK_CHANNEL_MEM_FENCE. The mem_fence effectively creates a control flow dependence between operations that occur before and after the mem_fence call. The CLK_GLOBAL_MEM_FENCE flag indicates that global memory operations must obey the control flow. The CLK_CHANNEL_MEM_FENCE indicates that channel operations must obey the control flow. As a result, the write_channel_intel call in the example cannot start until the global memory operation is committed to the shared memory buffer.
Implementing Buffered Channels Using the depth Channels Attribute
You may use a buffered channel to control data traffic, such as limiting throughput or synchronizing accesses to shared memory. In an unbuffered channel, a write operation cannot proceed until the read operation reads a data value. In a buffered channel, a write operation cannot proceed until the data value is copied to the buffer. If the buffer is full, the operation cannot proceed until the read operation reads a piece of data and removes it from the channel.
channel int c __attribute__((depth(10))); __kernel void producer (__global int * in_data) { for (int i = 0; i < N; i++) { if (in_data[i]) { write_channel_intel(c, in_data[i]); } } } __kernel void consumer (__global int * restrict check_data, __global int * restrict out_data) { int last_val = 0; for (int i = 0; i < N, i++) { if (check_data[i]) { last_val = read_channel_intel(c); } out_data[i] = last_val; } }
In this example, the write operation can write ten data values to the channel without blocking. Once the channel is full, the write operation cannot proceed until an associated read operation to the channel occurs.
Because the channel read and write calls are conditional statements, the channel might experience an imbalance between read and write calls. You may add a buffer capacity to the channel to ensure that the producer and consumer kernels are decoupled. This step is particularly important if the producer kernel is writing data to the channel when the consumer kernel is not reading from it.
Enforcing the Order of Channel Calls
When the Intel® FPGA SDK for OpenCL™ Offline Compiler generates a compute unit, it does not always create instruction-level parallelism on all instructions that are independent of each other. As a result, channel read and write operations might not execute independently of each other even if there is no control or data dependence between them. When channel calls interact with each other, or when channels write data to external devices, deadlocks might occur.
For example, the code snippet below consists of a producer kernel and a consumer kernel. Channels c0 and c1 are unbuffered channels. The schedule of the channel read operations from c0 and c1 might occur in the reversed order as the channel write operations to c0 and c1. That is, the producer kernel writes to c0 but the consumer kernel might read from c1 first. This rescheduling of channel calls might cause a deadlock because the consumer kernel is reading from an empty channel.
__kernel void producer (__global const uint * src, const uint iterations) { for (int i = 0; i < iterations; i++) { write_channel_intel(c0, src[2*i]); write_channel_intel(c1, src[2*i+1]); } } __kernel void consumer (__global uint * dst, const uint iterations) { for (int i = 0; i < iterations; i++) { /*During compilation, the AOC might reorder the way the consumer kernel writes to memory to optimize memory access. Therefore, c1 might be read before c0, which is the reverse of what appears in code.*/ dst[2*i+1] = read_channel_intel(c0); dst[2*i] = read_channel_intel(c1); } }
channel uint c0 __attribute__((depth(0))); channel uint c1 __attribute__((depth(0))); __kernel void producer (__global const uint * src, const uint iterations) { for (int i = 0; i < iterations; i++) { write_channel_intel(c0, src[2*i]); mem_fence(CLK_CHANNEL_MEM_FENCE); write_channel_intel(c1, src[2*i+1]); } } __kernel void consumer (__global uint * dst; const uint iterations) { for (int i = 0; i < iterations; i++) { dst[2*i+1] = read_channel_intel(c0); mem_fence(CLK_CHANNEL_MEM_FENCE); dst[2*i] = read_channel_intel(c1); } }
In this example, mem_fence in the producer kernel ensures that the channel write operation to c0 occurs before that to c1. Similarly, mem_fence in the consumer kernel ensures that the channel read operation from c0 occurs before that from c1.
Defining Memory Consistency Across Kernels When Using Channels
__kernel void producer( __global const uint * src, const uint iterations ) { for(int i=0; i < iterations; i++) { write_channel_intel(c0, src[2*i]); mem_fence(CLK_CHANNEL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE); write_channel_intel(c1, src[2*i+1]); } }
In this kernel, the mem_fence function ensures that the write operation to c0 and memory access to src[2*i] occur before the write operation to c1 and memory access to src[2*i+1]. This allows data written to c0 to be visible to the read channel before data is written to c1.
Implementing OpenCL Pipes
Implement pipes if it is important that your OpenCL kernel is compatible with other SDKs.
Refer to the OpenCL Specification version 2.0 for OpenCL C programming language specification and general information about pipes.
The Intel® FPGA SDK for OpenCL™ implementation of pipes does not encompass the entire pipes specification. As such, it is not fully conformant to the OpenCL Specification version 2.0. The goal of the SDK's pipes implementation is to provide a solution that works seamlessly on a different OpenCL 2.0-conformant device. To enable pipes for Intel® FPGA products, your design must satisfy certain additional requirements.
Overview of the OpenCL Pipe Functions
Implementation of pipes decouples kernel execution from the host processor. The foundation of the Intel® FPGA SDK for OpenCL™ pipes support is the SDK's channels extension. However, the syntax for pipe functions differs from the channels syntax.
For more information on blocking and nonblocking functions, refer to the corresponding documentation on channels.
Pipe Data Behavior
Consider the following code example:
__kernel void producer (write_only pipe uint __attribute__((blocking)) c0) { for (uint i = 0; i < 10; i++) { write_pipe (c0, &i); } } __kernel void consumer (__global uint * restrict dst, read_only pipe uint __attribute__((blocking)) __attribute__((depth(10))) c0) { for (int i = 0; i < 5; i++) { read_pipe (c0, &dst[i]); } }
A read operation to a pipe reads the least recent piece of data written to the pipe first. Pipe data maintains a FIFO ordering within the pipe.
The kernel producer writes ten elements ([0, 9]) to the pipe. The kernel consumer reads five elements from the pipe per NDRange invocation. During the first invocation, the kernel consumer reads values 0 to 4 from the pipe. Because the data persists across NDRange invocations, the second time you execute the kernel consumer, it reads values 5 to 9.
For this example, to avoid a deadlock from occurring, you need to invoke the kernel consumer twice for every invocation of the kernel producer. If you call consumer less than twice, producer stalls because the pipe becomes full. If you call consumer more than twice, consumer stalls because there is insufficient data in the pipe.
Multiple Work-Item Ordering for Pipes
Multiple work-item accesses to a pipe can be useful in some scenarios. For example, they are useful when data words in the pipe are independent, or when the pipe is implemented for control logic. The main concern regarding multiple work-item accesses to a pipe is the order in which the kernel writes data to and reads data from the pipe. If possible, the OpenCL pipes process work-items read and write operations to a pipe in a deterministic order. As such, the read and write operations remain consistent across kernel invocations.
Requirements for Deterministic Multiple Work-Item Ordering
To guarantee deterministic ordering, the SDK checks that the pipe call is work-item invariant based on the following characteristics:
- All paths through the kernel must execute the pipe call.
- If the first requirement is not satisfied, none of the branch conditions that reach the pipe call should execute in a work-item-dependent manner.
If the SDK cannot guarantee deterministic ordering of multiple work-item accesses to a pipe, it warns you that the pipes might not have well-defined ordering with nondeterministic execution. Primarily, the SDK fails to provide deterministic ordering if you have work-item-variant code on loop executions with pipe calls, as illustrated below:
__kernel void ordering (__global int * check, global int * data, write_only pipe int __attribute__((blocking)) req) { int condition = check[get_global_id(0)]; if (condition) { for (int i = 0; i < N; i++) { process(data); write_pipe (req, &data[i]); } } else { process(data); } }
Work-Item Serial Execution of Pipes
When you implement pipes in a kernel, the Intel® FPGA SDK for OpenCL™ Offline Compiler enforces that kernel behavior is equivalent to having at most one work-group in flight. The offline compiler also ensures that the kernel executes pipes in work-item serial execution, where the kernel executes work-items with smaller IDs first. A work-item has the identifier (x, y, z, group), where x, y, z are the local 3D identifiers, and group is the work-group identifier.
The work-item ID (x0, y0, z0, group0) is considered to be smaller than the ID (x1, y1, z1, group1) if one of the following conditions is true:
- group0 < group1
- group0 = group1 and z0 < z1
- group0 = group1 and z0 = z1 and y0 < y1
- group0 = group1 and z0 = z1 and y0 = y1 and x0 < x1
Work-items with incremental IDs execute in a sequential order. For example, the work-item with an ID (x0, y0, z0, group0) executes the write channel call first. Then, the work-item with an ID (x1, y0, z0, group0) executes the call, and so on. Defining this order ensures that the system is verifiable with external models.
Pipe Execution in Loop with Multiple Work-Items
When pipes exist in the body of a loop with multiple work-items, as shown below, each loop iteration executes prior to subsequent iterations. This implies that loop iteration 0 of each work-item in a work-group executes before iteration 1 of each work-item in a work-group, and so on.
__kernel void ordering (__global int * data, write_only pipe int __attribute__((blocking)) req) { write_pipe (req, &data[get_global_id(0)]); }
Restrictions in OpenCL Pipes Implementation
Default Behavior
Emulation Support
The Intel® FPGA SDK for OpenCL™ Emulator supports emulation of kernels that contain pipes. The level of Emulator support aligns with the subset of OpenCL pipes support that is implemented for the FPGA hardware.
Pipes API Support
Currently, the SDK's implementation of pipes does not support all the built-in pipe functions in the OpenCL Specification version 2.0. For a list of supported and unsupported pipe APIs, refer to OpenCL 2.0 C Programming Language Restrictions for Pipes.
Single Call Site
Because the pipe read and write operations do not function deterministically, for a given kernel, you can only assign one call site per pipe ID. For example, the Intel® FPGA SDK for OpenCL™ Offline Compiler cannot compile the following code example:
read_pipe(pipe1, &in_data1); read_pipe(pipe2, &in_data2); read_pipe(pipe1, &in_data3);
The second read_pipe call to pipe1 causes compilation failure because it creates a second call site to pipe1.
To gather multiple data from a given pipe, divide the pipe into multiple pipes, as shown below:
read_pipe(pipe1, &in_data1); read_pipe(pipe2, &in_data2); read_pipe(pipe3, &in_data3);
Because you can only assign a single call site per pipe ID, you cannot unroll loops containing pipes. Consider the following code:
#pragma unroll 4 for (int i = 0; i < 4; i++) { read_pipe (pipe1, &in_data1); }
The offline compiler issues the following warning message during compilation:
Compiler Warning: Unroll is required but the loop cannot be unrolled.
Feedback and Feed-Forward Pipes
Pipes within a kernel can be either read_only or write_only. Performance of a kernel that reads and writes to the same pipe is poor.
Kernel Vectorization Support
You cannot vectorize kernels that use pipes; that is, do not include the num_simd_work_items kernel attribute in your kernel code. Vectorizing a kernel that uses pipes creates multiple pipe masters and requires arbitration, which OpenCL pipes specification does not support.
Instruction-Level Parallelism on read_pipe and write_pipe Calls
If no data dependencies exist between read_pipe and write_pipe calls, the offline compiler attempts to execute these instructions in parallel. As a result, the offline compiler might execute these read_pipe and write_pipe calls in an order that does not follow the sequence expressed in the OpenCL kernel code.
Consider the following code sequence:
in_data1 = read_pipe(pipe1); in_data2 = read_pipe(pipe2); in_data3 = read_pipe(pipe3);
Because there are no data dependencies between the read_pipe calls, the offline compiler can execute them in any order.
Enabling OpenCL Pipes for Kernels
Pipes declarations are unique within a given OpenCL kernel program. Also, pipe instances are unique for every OpenCL kernel program-device pair. If the runtime loads a single OpenCL kernel program onto multiple devices, each device will have a single copy of each pipe. However, these pipe copies are independent and do not share data across the devices.
Ensuring Compatibility with Other OpenCL SDKs
Host Code Modification
Below is an example of a modified host application:
#include <stdio.h> #include <stdlib.h> #include <string.h> #include "CL/opencl.h" #define SIZE 1000 const char *kernel_source = "__kernel void pipe_writer(__global int *in," " write_only pipe int p_in)\n" "{\n" " int gid = get_global_id(0);\n" " write_pipe(p_in, &in[gid]);\n" "}\n" "__kernel void pipe_reader(__global int *out," " read_only pipe int p_out)\n" "{\n" " int gid = get_global_id(0);\n" " read_pipe(p_out, &out[gid]);\n" "}\n"; int main() { int *input = (int *)malloc(sizeof(int) * SIZE); int *output = (int *)malloc(sizeof(int) * SIZE); memset(output, 0, sizeof(int) * SIZE); for (int i = 0; i != SIZE; ++i) { input[i] = rand(); } cl_int status; cl_platform_id platform; cl_uint num_platforms; status = clGetPlatformIDs(1, &platform, &num_platforms); cl_device_id device; cl_uint num_devices; status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, &num_devices); cl_context context = clCreateContext(0, 1, &device, NULL, NULL, &status); cl_command_queue queue = clCreateCommandQueue(context, device, 0, &status); size_t len = strlen(kernel_source); cl_program program = clCreateProgramWithSource(context, 1, (const char **)&kernel_source, &len, &status); status = clBuildProgram(program, num_devices, &device, "", NULL, NULL); cl_kernel pipe_writer = clCreateKernel(program, "pipe_writer", &status); cl_kernel pipe_reader = clCreateKernel(program, "pipe_reader", &status); cl_mem in_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, input, &status); cl_mem out_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(int) * SIZE, NULL, &status); cl_mem pipe = clCreatePipe(context, 0, sizeof(cl_int), SIZE, NULL, &status); status = clSetKernelArg(pipe_writer, 0, sizeof(cl_mem), &in_buffer); status = clSetKernelArg(pipe_writer, 1, sizeof(cl_mem), &pipe); status = clSetKernelArg(pipe_reader, 0, sizeof(cl_mem), &out_buffer); status = clSetKernelArg(pipe_reader, 1, sizeof(cl_mem), &pipe); size_t size = SIZE; cl_event sync; status = clEnqueueNDRangeKernel(queue, pipe_writer, 1, NULL, &size, &size, 0, NULL, &sync); status = clEnqueueNDRangeKernel(queue, pipe_reader, 1, NULL, &size, &size, 1, &sync, NULL); status = clFinish(queue); status = clEnqueueReadBuffer(queue, out_buffer, CL_TRUE, 0, sizeof(int) * SIZE, output, 0, NULL, NULL); int golden = 0, result = 0; for (int i = 0; i != SIZE; ++i) { golden += input[i]; result += output[i]; } int ret = 0; if (golden != result) { printf("FAILED!"); ret = 1; } else { printf("PASSED!"); } printf("\n"); return ret; }
Kernel Code Modification
If your kernel code runs on OpenCL SDKs that conforms to the OpenCL Specification version 2.0, you must modify it before running it on the Intel® FPGA SDK for OpenCL™ . To modify the kernel code, perform the following modifications:
- Rename the pipe arguments so that they are the same in both kernels. For example, rename p_in and p_out to p.
- Specify the depth attribute for the pipe arguments. Assign a depth attribute value that equals to the maximum number of packets that the pipe creates to hold in the host.
- Execute the kernel program in the offline compilation mode because the Intel® FPGA SDK for OpenCL™ has an offline compiler.
The modified kernel code appears as follows:
#define SIZE 1000 __kernel void pipe_writer(__global int *in, write_only pipe int __attribute__((depth(SIZE))) p) { int gid = get_global_id(0); write_pipe(p, &in[gid]); } __kernel void pipe_reader(__global int *out, read_only pipe int __attribute__((depth(SIZE))) p) { int gid = get_global_id(0); read_pipe(p, &out[gid]); }
Declaring the Pipe Handle
To read from and write to a pipe, the kernel must pass the pipe variable to each of the corresponding API call.
The <type> of the pipe may be any OpenCL™ built-in scalar or vector data type with a scalar size of 1024 bits or less. It may also be any user-defined type that is comprised of scalar or vector data type with a scalar size of 1024 bits or less.
Consider the following pipe handle declarations:
__kernel void first (pipe int c)
__kernel void second (write_only pipe int c)
The first example declares a read-only pipe handle of type int in the kernel first. The second example declares a write-only pipe in the kernel second. The kernel first may only read from pipe c, and the kernel second may only write to pipe c.
In an Intel® OpenCL system, only one kernel may read to a pipe. Similarly, only one kernel may write to a pipe. If a non-I/O pipe does not have at least one corresponding reading operation and one writing operation, the offline compiler issues an error.
For more information in the Intel® FPGA SDK for OpenCL™ I/O pipe implementation, refer to Implementing I/O Pipes Using the io Attribute.
Implementing Pipe Writes
Intel® only supports the convenience version of the write_pipe function. By default, write_pipe calls are nonblocking. Pipe write operations are successful only if there is capacity in the pipe to hold the incoming packet.
Where:
pipe_id identifies the buffer to which the pipe connects, and it must match the pipe_id of the corresponding read pipe (read_pipe).
data is the data that the pipe write operation writes to the pipe. It is a pointer to the packet type of the pipe. Note that writing to the pipe might lead to a global or local memory load, depending on the source address space of the data pointer.
<type> defines a pipe data width. The return value indicates whether the pipe write operation is successful. If successful, the return value is 0. If pipe write is unsuccessful, the return value is -1.
/*Declares the writable nonblocking pipe, p, which contains packets of type int*/ __kernel void kernel_write_pipe (__global const long *src, write_only pipe int p) { for (int i = 0; i < N; i++) { //Performs the actual writing //Emulates blocking behavior via the use of a while loop while (write_pipe(p, &src[i]) < 0) { } } }
The while loop is unnecessary if you specify a blocking attribute. To facilitate better hardware implementations, Intel® provides facility for blocking write_pipe calls by specifying the blocking attribute (that is, __attribute__((blocking))) on the pipe argument declaration for the kernel. Blocking write_pipe calls always return success.
Implementing Pipe Reads
Intel® only supports the convenience version of the read_pipe function. By default, read_pipe calls are nonblocking.
Where:
pipe_id identifies the buffer to which the pipe connects, and it must match the pipe_id of the corresponding pipe write operation (write_pipe).
data is the data that the pipe read operation reads from the pipe. It is a pointer to the location of the data. Note that write_pipe call might lead to a global or local memory load, depending on the source address space of the data pointer.
<type> defines the packet size of the data.
/*Declares the read_only_pipe that contains packets of type long.*/ /*Declares that read_pipe calls within the kernel will exhibit blocking behavior*/ __kernel void kernel_read_pipe(__global long *dst, read_only pipe long __attribute__((blocking)) p) { for (int i = 0; i < N; i++) { /*Reads from a long from the pipe and stores it into global memory at the specified location*/ read_pipe(p, &dst[i]); } }
To facilitate better hardware implementations, Intel® provides facility for blocking write_pipe calls by specifying the blocking attribute (that is, __attribute__((blocking))) on the pipe argument declaration for the kernel. Blocking write_pipe calls always return success.
Implementing Buffered Pipes Using the depth Attribute
You may use a buffered pipe to control data traffic, such as limiting throughput or synchronizing accesses to shared memory. In an unbuffered pipe, a write operation can only proceed when the read operation is expecting to read data. Use unbuffered pipes in conjunction with blocking read and write behaviors in kernels that execute concurrently. The unbuffered pipes provide self-synchronizing data transfers efficiently.
In a buffered pipe, a write operation can only proceed if there is capacity in the pipe to hold the incoming packet. A read operation can only proceed if there is at least one packet in the pipe.
Use buffered pipes if pipe calls are predicated differently in the writer and reader kernels, and the kernels do not execute concurrently.
__kernel void producer (__global int *in_data, write_only pipe int __attribute__((blocking)) __attribute__((depth(10))) c) { for (i = 0; i < N; i++) { if (in_data[i]) { write_pipe( c, &in_data[i] ); } } } __kernel void consumer (__global int *check_data, __global int *out_data, read_only pipe int __attribute__((blocking)) c ) { int last_val = 0; for (i = 0; i < N; i++) { if (check_data[i]) { read_pipe( c, &last_val ); } out_data[i] = last_val; } }
In this example, the write operation can write ten data values to the pipe successfully. After the pipe is full, the write kernel returns failure until a read kernel consumes some of the data in the pipe.
Because the pipe read and write calls are conditional statements, the pipe might experience an imbalance between read and write calls. You may add a buffer capacity to the pipe to ensure that the producer and consumer kernels are decoupled. This step is particularly important if the producer kernel is writing data to the pipe when the consumer kernel is not reading from it.
Implementing I/O Pipes Using the io Attribute
In the Intel® FPGA SDK for OpenCL™ channels extension, the io("chan_id") attribute specifies the I/O feature of an accelerator board with which a channel interfaces. The chan_id argument is the name of the I/O interface listed in the board_spec.xml file of your Custom Platform. The same I/O features can be used to identify I/O pipes.
Because peripheral interface usage might differ for each device type, consult your board vendor's documentation when you implement I/O pipes in your kernel program. Your OpenCL™ kernel code must be compatible with the type of data generated by the peripheral interfaces. If there is a difference in the byte ordering between the external I/O pipes and the kernel, the Intel® FPGA SDK for OpenCL™ Offline Compiler converts the byte ordering seamlessly upon entry and exit.
- Implicit data dependencies might exist for pipes that connect to the board directly and communicate with peripheral devices via I/O pipes. These implicit data dependencies might lead to compilation issues because the offline compiler cannot identify these dependencies.
- External I/O pipes communicating with the same peripherals do not obey any sequential ordering. Ensure that the external device does not require sequential ordering because unexpected behavior might occur.
-
Consult the board_spec.xml file in
your Custom Platform to identify the input and output features available on your FPGA
board.
For example, a board_spec.xml file might include the following information on I/O features:
<channels> <interface name="udp_0" port="udp0_out" type="streamsource" width="256" chan_id="eth0_in"/> <interface name="udp_0" port="udp0_in" type="streamsink" width="256" chan_id="eth0_out"/> <interface name="udp_0" port="udp1_out" type="streamsource" width="256" chan_id="eth1_in"/> <interface name="udp_0" port="udp1_in" type="streamsink" width="256" chan_id="eth1_out"/> </channels>
The width attribute of an interface element specifies the width, in bits, of the data type used by that pipe. For the example above, both the uint and float data types are 32 bits wide. Other bigger or vectorized data types must match the appropriate bit width specified in the board_spec.xml file.
-
Implement the io attribute as
demonstrated in the following code example. The io
attribute names must match those of the I/O channels (chan_id) specified in the board_spec.xml
file.
__kernel void test (pipe uint pkt __attribute__((io(“enet”))),; pipe float data __attribute__((io(“pcie”))));
Attention: Declare a unique io("chan_id") handle for each I/O pipe specified in the channels XML element within the board_spec.xml file.
Enforcing the Order of Pipe Calls
When the Intel® FPGA SDK for OpenCL™ Offline Compiler generates a compute unit, it does not create instruction-level parallelism on all instructions that are independent of each other. As a result, pipe read and write operations might not execute independently of each other even if there is no control or data dependence between them. When pipe calls interact with each other, or when pipes write data to external devices, deadlocks might occur.
For example, the code snippet below consists of a producer kernel and a consumer kernel. Pipes c0 and c1 are unbuffered pipes. The schedule of the pipe read operations from c0 and c1 might occur in the reversed order as the pipe write operations to c0 and c1. That is, the producer kernel writes to c0 but the consumer kernel might read from c1 first. This rescheduling of pipe calls might cause a deadlock because the consumer kernel is reading from an empty pipe.
__kernel void producer (__global const uint * restrict src, const uint iterations, write_only pipe uint __attribute__((blocking)) c0, write_only pipe uint __attribute__((blocking)) c1) { for (int i = 0; i < iterations; i++) { write_pipe (c0, &src[2*i ]); write_pipe (c1, &src[2*i+1]); } } __kernel void consumer (__global uint * restrict dst, const uint iterations, read_only pipe uint __attribute__((blocking)) c0, read_only pipe uint __attribute__((blocking)) c1) { for (int i = 0; i < iterations; i++) { read_pipe (c0, &dst[2*i+1]); read_pipe( c1, &dst[2*i]); } }
__kernel void producer (__global const uint * src, const uint iterations, write_only_pipe uint __attribute__((blocking)) c0, write_only_pipe uint __attribute__((blocking)) c1) { for (int i = 0; i < iterations; i++) { write_pipe(c0, &src[2*i ]); mem_fence(CLK_CHANNEL_MEM_FENCE); write_pipe(c1, &src[2*i+1]); } } __kernel void consumer (__global uint * dst; const uint iterations, read_only_pipe uint __attribute__((blocking)) c0, read_only_pipe uint __attribute__((blocking)) c1) { for(int i = 0; i < iterations; i++) { read_pipe(c0, &dst[2*i ]); mem_fence(CLK_CHANNEL_MEM_FENCE); read_pipe(c1, &dst[2*i+1]); } }
In this example, mem_fence in the producer kernel ensures that the pipe write operation to c0 occurs before that to c1. Similarly, mem_fence in the consumer kernel ensures that the pipe read operation from c0 occurs before that from c1.
Defining Memory Consistency Across Kernels When Using Pipes
__kernel void producer (__global const uint * restrict src, const uint iterations, write_only pipe uint __attribute__((blocking)) c0, write_only pipe uint __attribute__((blocking)) c1) { for (int i = 0; i < iterations; i++) { write_pipe(c0, &src[2*i]); mem_fence(CLK_CHANNEL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE); write_pipe(c1, &src[2*i+1]); } }
In this kernel, the mem_fence function ensures that the write operation to c0 and memory access to src[2*i] occur before the write operation to c1 and memory access to src[2*i+1]. This allows data written to c0 to be visible to the read pipe before data is written to c1.
Direct Communication with Kernels via Host Pipes
The extension legalizes two new values in the flags argument of clCreatePipe to make a pipe host accessible, and adds four new API functions (clReadPipeIntelFPGA, clWritePipeIntelFPGA, clMapHostPipeIntelFPGA, clUnmapHostPipeIntelFPGA) to allow the host to read from and write to a pipe that was created with host access enabled. A new optional kernel argument attribute is added to specify in the kernel language that the opposing end of a pipe kernel argument will be the host program, and consequently that the pipe will not be connected to another kernel. A pipe kernel argument is specialized in the kernel definition to connect to either a host pipe or another kernel, and cannot dynamically switch between the two at runtime.
When a pipe kernel argument is marked for host accessibility, the kernel language pipe accessors are restricted to a subset of the 2.x functions (reservations are not supported), and memory consistency or visibility guarantees are made beyond OpenCL synchronization points.
Optional intel_host_accessible Kernel Argument Attribute
__attribute__((intel_host_accessible))
API Functions for Interacting with cl_mem Pipe Objects Bound to Host-Accessible Pipe Kernel Arguments
- clReadPipeIntelFPGA and clWritePipeIntelFPGA functions operate on single words of the pipe’s width.
- clMapHostPipeIntelFPGA function is an advanced mechanism to reduce latency and overhead when performing many word reads or writes on a host pipe.
- clUnmapHostPipeIntelFPGA function allows the host program to signal to the OpenCL runtime that it has written to or read from either a portion of or the entire mapped region that was created through a previous clMapHostPipeIntelFPGA function call.
Function | Description |
---|---|
cl_int clReadPipeIntelFPGA (cl_mem pipe, gentype *ptr); |
Reads a data packet from a pipe with the following characteristics:
Each clReadPipeIntelFPGA function call reads one packet from the pipe. The operation is non-blocking; it does not wait until data is available in the pipe to successfully read before returning. |
cl_int clWritePipeIntelFPGA (cl_mem pipe, gentype *ptr); |
Writes a data packet to a pipe with the following characteristics:
Each clWritePipeIntelFPGA function call writes one packet to the pipe. The operation is non-blocking; it does not wait until there is a capacity in the pipe to successfully write before returning. A return status of CL_SUCCESS does not imply that data is available to the kernel for reading. The data will eventually be available for reading by the kernel, assuming that any previously mapped buffers on the host pipe are unmapped. |
void * clMapHostPipeIntelFPGA (cl_mem pipe, cl_map_flags map_flags, size_t requested_size, size_t * mapped_size, cl_int * errcode_ret); |
Returns a void * in the host address space. The pipe can write data to this address space if it was created with the CL_MEM_HOST_WRITE_ONLY flag. The pipe can read data from this address space if it was created with the CL_MEM_HOST_READ_ONLY flag. The mapped_size argument specifies the maximum number of bytes that the host can access, as determined by the runtime in the memory. The value specified by mapped_size is less than or equal to the value of the requested_size argument that the caller specifies. After writing to or reading from the returned void *, the host must execute one or more clUnmapHostPipeIntelFPGA function calls to signal to the runtime that data is ready for transfer to the device (on a write), and that the runtime can reclaim the memory for reuse (on a read or write). If the clMapHostPipeIntelFPGA function is called before the clUnmapHostPipeIntelFPGA function unmaps all memory mapped by a previous clMapHostPipeIntelFPGA function call, the buffer returned by the second clMapHostPipeIntelFPGA call will not overlap with that returned by the first call. |
cl_int clUnmapHostPipeIntelFPGA ( cl_mem pipe, void * mapped_ptr, size_t size_to_unmap, size_t * unmapped_size ); |
Signals to the runtime that the hostno longer uses size_to_unmap bytes of a host-addressable buffer that the clMapHostPipeIntelFPGA function has returned previously. In the case of a writeable host pipe, calling clUnmapHostPipeIntelFPGA allows the unmapped data to become available to the kernel. If the size_to_unmap value is smaller than the mapped_size value specified by the clMapHostPipeIntelFPGA function, then multiple clUnmapHostPipeIntelFPGA function calls are necessary to unmap the full capacity of the buffer. You may include multiple clUnmapHostPipeIntelFPGA function calls to unmap successive bytes in the buffer returned by a clMapHostPipeIntelFPGA function call, up to the mapped_size value defined by the clMapHostPipeIntelFPGA call. |
Creating a Host Accessible Pipe
To enable host access (reading or writing) to pipes, the cl_intel_fpga_host_pipe extension legalizes the following two flags values to clCreatePipe:
- CL_MEM_HOST_READ_ONLY
- CL_MEM_HOST_WRITE_ONLY
When one of these flags is passed to the clCreatePipe function, the corresponding cl_mem object can be passed as the first argument to clReadPipeIntelFPGA and clWritePipeIntelFPGA functions. Throughout the remainder of the cl_intel_fpga_host_pipe extension, such a pipe is referred to as a host pipe.
Example Use of the cl_intel_fpga_host_pipe Extension
Kernel Code
#pragma OPENCL EXTENSION cl_intel_fpga_host_pipe : enable kernel void reader(__attribute__((intel_host_accessible)) __read_only pipe ulong4 host_in) { ulong4 val; if (read_pipe(host_in, &val)) { .... } .... } kernel void writer(__attribute__((intel_host_accessible)) __write_only pipe ulong4 device_out) { ulong4 val; .... if (write_pipe(device_out, &val)) { .... } }
Host Code
.... cl_kernel read_kern = clCreateKernel(program, "reader", NULL); cl_kernel write_kern = clCreateKernel(program, "writer", NULL); cl_mem read_pipe = clCreatePipe(context, CL_MEM_HOST_READ_ONLY, sizeof( cl_ulong4 ), 128, // Number of packets that can be buffered NULL, &error); cl_mem write_pipe = clCreatePipe(context, CL_MEM_HOST_WRITE_ONLY, sizeof( cl_ulong4 ), 64, // Number of packets that can be buffered NULL, &error); // Bind pipes to kernels clSetKernelArg(read_kern, 0, sizeof(cl_mem), (void *)&write_pipe); clSetKernelArg(write_kern, 0, sizeof(cl_mem), (void *)&read_pipe); // Enqueue kernels .... cl_ulong4 val if (!clReadPipeIntelFPGA (read_pipe, &val)) { cl_int result = clWritePipeIntelFPGA (write_pipe, &val); // Check write success/failure and handle .... } ....
Implementing Arbitrary Precision Integers
Use the Intel® FPGA SDK for OpenCL™ arbitrary precision integer extension to define integers with a custom bit-width. You can define integer custom bit-widths up to and including 64 bits.
#include "ihc_apint.h"
aoc <other command options> -I $INTELFPGAOCLSDKROOT/include/kernel_headers <my_kernel_file>
#define ap_int<d> intd_t #define ap_uint<d> uintd_t
int10_t x_signed; uint10_t x_unsigned;
You can declare arbitrary precision integers with widths up to 64 bits.
#pragma OPENCL EXTENSION cl_intel_arbitrary_precision_integers : enable
ap_int<d> intd_t my_signed_integer ap_uint<d> uintd_t my_unsigned_integer
If you do operations where the bit width of the result is larger than the bit widths of the arguments, you must explicitly cast one of the arguments to the resulting bit width.
int10_t a; int10_t b; int20_t res; res = a * b;
In the example, the compiler attempts to instantiate a multiplier that multiplies two 10-bit integers and put the results into another 10-bit integer. The result is then sign extended or zero extended up to 20-bits.
res = ((int20_t)a) * b
When you compile a program for x86-64 platforms, the bit widths for arbitrary precisions integers are rounded up to either 32 bits or 64 bits. When you compile a kernel for an FPGA platform, the bit widths are not rounded up and the arbitrary precision integers remain at their declared bit width.
As a result, an operation that appears to work correctly in an x86-64 program can overflow and lose precision when you compile that same operation in an FPGA kernel. The additional precision provided by bit-width rounding on x86-64 platforms masks possible overflow and precision-loss problems you might encounter when your compile your FPGA kernel.
Using Predefined Preprocessor Macros in Conditional Compilation
-
To include device-specific (for example, FPGA_board_1) code in your kernel program, structure your kernel program in the following manner:
#if defined(AOCL_BOARD_FPGA_board_1) //FPGA_board_1-specific statements #else //FPGA_board_2-specific statements #endif
When you target your kernel compilation to a specific board, it sets the predefined preprocessor macro AOCL_BOARD_<board_name> to 1. If <board_name> is FPGA_board_1, the Intel® FPGA SDK for OpenCL™ Offline Compiler will compile the FPGA_board_1-specific parameters and features. -
To introduce
Intel® FPGA SDK for OpenCL™ Offline Compiler-specific compiler
features and optimizations, structure your kernel program in the following manner:
#if defined(INTELFPGA_CL) //statements #else //statements #endif
Where INTELFPGA_CL is the Intel® predefined preprocessor macro for the offline compiler.
Declaring __constant Address Space Qualifiers
Function Scope __constant Variables
The Intel® FPGA SDK for OpenCL™ Offline Compiler does not support function scope __constant variables. Replace function scope __constant variables with file scope constant variables. You can also replace function scope __constant variables with __constant buffers that the host passes to the kernel.
File Scope __constant Variables
If the host always passes the same constant data to your kernel, consider declaring that data as a constant preinitialized file scope array within the kernel file. Declaration of a constant preinitialized file scope array creates a ROM directly in the hardware to store the data. This ROM is available to all work-items in the NDRange.
For example:
__constant int my_array[8] = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7};
__kernel void my_kernel (__global int * my_buffer)
{
size_t gid = get_global_id(0);
my_buffer[gid] += my_array[gid % 8];
}
In this case, the offline compiler sets the values for my_array in a ROM because the file scope constant data does not change between kernel invocations.
Pointers to __constant Parameters from the Host
You can replace file scope constant data with a pointer to a __constant parameter in your kernel code if the data is not fixed across kernel invocations. You must then modify your host application in the following manner:
- Create cl_mem memory objects associated with the pointers in global memory.
- Load constant data into cl_mem objects with clEnqueueWriteBuffer prior to kernel execution.
- Pass the cl_mem objects to the kernel as arguments with the clSetKernelArg function.
For simplicity, if a constant variable is of a complex type, use a typedef argument, as shown in the table below:
If your source code is structured as follows: | Rewrite your code to resemble the following syntax: |
---|---|
__constant int Payoff[2][2] = {{ 1, 3}, {5, 3}}; __kernel void original(__global int * A) { *A = Payoff[1][2]; // and so on } |
__kernel void modified(__global int * A, __constant Payoff_type * PayoffPtr ) { *A = (PayoffPtr)[1][2]; // and so on } |
Including Structure Data Types as Arguments in OpenCL Kernels
The table below describes how you can convert structure parameters:
If your source code is structured as follows: | Rewrite your code to resemble the following syntax: |
---|---|
struct Context
{
float param1;
float param2;
int param3;
uint param4;
};
__kernel void algorithm(__global float * A,
struct Context c)
{
if (c.param3)
{
// statements
}
}
|
struct Context
{
float param1;
float param2;
int param3;
uint param4;
};
__kernel void algorithm(__global float * A,
__global struct Context * restrict c)
{
if (c->param3)
{
// Dereference through a
// pointer and so on
}
}
|
Matching Data Layouts of Host and Kernel Structure Data Types
To match member data types, use the cl_ version of the data type in your host application that corresponds to the data type in the kernel code. The cl_ version of the data type is available in the opencl.h header file. For example, if you have a data member of type float4 in your kernel code, the corresponding data member you declare in the host application is cl_float4.
Align the structures and align the struct data members between the host and kernel applications. Manage the alignments carefully because of the variability among different host compilers.
For example, if you have float4 OpenCL data types in the struct, the alignments of these data items must satisfy the OpenCL specification (that is, 16-byte alignment for float4).
The following rules apply when the Intel® FPGA SDK for OpenCL™ Offline Compiler compiles your OpenCL kernels:
- Alignment of built-in scalar and vector types follow the rules outlined in
Section 6.1.5 of the OpenCL Specification version 1.0.
The offline compiler usually aligns a data type based on its size. However, the compiler aligns a value of a three-element vector the same way it aligns a four-element vector.
- An array has the same alignment as one of its elements.
- A struct (or a union) has the same alignment as the maximum alignment necessary for any of its
data members.
Consider the following example:
struct my_struct { char data[3]; float4 f4; int index; };
The offline compiler aligns the struct elements above at 16-byte boundaries because of the float4 data type. As a result, both data and index also have 16-byte alignment boundaries.
- The offline compiler does not reorder data members of a struct.
- Normally, the offline compiler inserts a minimum amount of data structure
padding between data members of a struct to satisfy the
alignment requirements for each data member.
- In your OpenCL kernel code, you may specify data packing (that is, no insertion of data structure padding) by applying the packed attribute to the struct declaration. If you impose data packing, ensure that the alignment of data members satisfies the OpenCL alignment requirements. The Intel® FPGA SDK for OpenCL™ does not enforce these alignment requirements. Ensure that your host compiler respects the kernel attribute and sets the appropriate alignments.
- In your OpenCL kernel code, you may specify the amount of data
structure padding by applying the aligned(N) attribute to a data member, where N is the amount of padding. The SDK does not enforce these alignment requirements. Ensure
that your host compiler respects the kernel attribute and sets the appropriate
alignments.
For Windows systems, some versions of the Microsoft Visual Studio compiler pack structure data types by default. If you do not want to apply data packing, specify an amount of data structure padding as shown below:
struct my_struct { __declspec(align(16)) char data[3]; /*Note that cl_float4 is the only known float4 definition on the host*/ __declspec(align(16)) cl_float4 f4; __declspec(align(16)) int index; };
Tip: An alternative way of adding data structure padding is to insert dummy struct members of type char or array of char.
Disabling Insertion of Data Structure Padding
struct __attribute__((packed)) Context { float param1; float param2; int param3; uint param4; }; __kernel void algorithm(__global float * restrict A, __global struct Context * restrict c) { if ( c->param3 ) { // Dereference through a pointer and so on } }
Specifying the Alignment of a Struct
struct __attribute__((aligned(2))) Context { float param1; float param2; int param3; uint param4; }; __kernel void algorithm(__global float * A, __global struct Context * restrict c) { if ( c->param3 ) { // Dereference through a pointer and so on } }
Inferring a Register
The offline compiler infers private arrays as registers either as single values or in a piecewise fashion. Piecewise implementation results in very efficient hardware; however, the offline compiler must be able to determine data accesses statically. To facilitate piecewise implementation, hardcode the access points into the array. You can also facilitate register inference by unrolling loops that access the array.
If array accesses are not inferable statically, the offline compiler might infer the array as registers. However, the offline compiler limits the size of these arrays to 64 bytes in length for single work-item kernels. There is effectively no size limit for kernels with multiple work-items.
Consider the following code example:
int array[SIZE]; for (int j = 0; j < N; ++j) { for (int i = 0; i < SIZE - 1; ++i) { array[i] = array[i + 1]; } }
The indexing into array[i] is not inferable statically because the loop is not unrolled. If the size of array[SIZE] is less than or equal to 64 bytes for single work-item kernels, the offline compiler implements array[SIZE] into registers as a single value. If the size of array[SIZE] is greater than 64 bytes for single work-item kernels, the offline compiler implements the entire array in block RAMs. For multiple work-item kernels, the offline compiler implements array[SIZE] into registers as a single value as long as its size is less than 1 kilobyte (KB).
Inferring a Shift Register
Consider the following code example:
channel int in, out; #define SIZE 512 //Shift register size must be statically determinable __kernel void foo() { int shift_reg[SIZE]; //The key is that the array size is a compile time constant // Initialization loop #pragma unroll for (int i=0; i < SIZE; i++) { //All elements of the array should be initialized to the same value shift_reg[i] = 0; } while(1) { // Fully unrolling the shifting loop produces constant accesses #pragma unroll for (int j=0; j < SIZE–1; j++) { shift_reg[j] = shift_reg[j + 1]; } shift_reg[SIZE – 1] = read_channel_intel(in); // Using fixed access points of the shift register int res = (shift_reg[0] + shift_reg[1]) / 2; // ‘out’ channel will have running average of the input channel write_channel_intel(out, res); } }
In each clock cycle, the kernel shifts a new value into the array. By placing this shift register into a block RAM, the Intel® FPGA SDK for OpenCL™ Offline Compiler can efficiently handle multiple access points into the array. The shift register design pattern is ideal for implementing filters (for example, image filters like a Sobel filter or time-delay filters like a finite impulse response (FIR) filter).
When implementing a shift register in your kernel code, keep in mind the following key points:
- Unroll the shifting loop so that it can access every element of the array.
- All access points must have constant data accesses. For example, if you write a calculation in nested loops using multiple access points, unroll these loops to establish the constant access points.
- Initialize all elements of the array to the same value. Alternatively, you may leave the elements uninitialized if you do not require a specific initial value.
- If some accesses to a large array are not inferable statically, they force the offline compiler to create inefficient hardware. If these accesses are necessary, use __local memory instead of __private memory.
- Do not shift a large shift register conditionally. The shifting must occur in very loop iteration that contains the shifting code to avoid creating inefficient hardware.
Enabling Double Precision Floating-Point Operations
Before declaring any double precision floating-point data type in your OpenCL kernel, include the following OPENCL EXTENSION pragma in your kernel code:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
Single-Cycle Floating-Point Accumulator for Single Work-Item Kernels
The offline compiler supports an accumulator that adds or subtracts a value. To leverage this feature, describe the accumulation in a way that allows the offline compiler to infer the accumulator.
- The accumulator is only available on Arria® 10 devices.
- The accumulator must be part of a loop.
- The accumulator must have an initial value of 0.
- The accumulator cannot be conditional.
Below are examples of a description that results in the correct inference of the accumulator by the offline compiler.
channel float4 RANDOM_STREAM; __kernel void acc_test(__global float *a, int k) { // Simplest example of an accumulator. // In this loop, the accumulator acc is incremented by 5. int i; float acc = 0.0f; for (i = 0; i < k; i++) { acc+=5; } a[0] = acc; } __kernel void acc_test2(__global float *a, int k) { // Extended example showing that an accumulator can be // conditionally incremented. The key here is to describe the increment // as conditional, not the accumulation itself. int i; float acc = 0.0f; for (i = 0; i < k; i++) { acc += ((i < 30) ? 5 : 0); } a[0] = acc; } __kernel void acc_test3(__global float *a, int k) { // A more complex case where the accumulator is fed // by a dot product. int i; float acc = 0.0f; for (i = 0; i < k; i++ ){ float4 v = read_channel_intel(RANDOM_STREAM); float x1 = v.x; float x2 = v.y; float y1 = v.z; float y2 = v.w; acc += (x1*y1+x2*y2); } a[0] = acc; } __kernel void loader(__global float *a, int k) { int i; float4 my_val = 0; for(i = 0; i < k; i++) { if ((i%4) == 0) write_channel_intel(RANDOM_STREAM, my_val); if ((i%4) == 0) my_val.x = a[i]; if ((i%4) == 1) my_val.y = a[i]; if ((i%4) == 2) my_val.z = a[i]; if ((i%4) == 3) my_val.w = a[i]; } }
Programming Strategies for Inferring the Accumulator
Describing an Accumulator Using Multiple Loops
Consider a case where you want to describe an accumulator using multiple loops, with some of the loops being unrolled:
float acc = 0.0f; for (i = 0; i < k; i++) { #pragma unroll for(j=0;j < 16; j++) acc += (x[i+j]*y[i+j]); }
In this situation, it is important to compile the kernel with the -fp-relaxed Intel® FPGA SDK for OpenCL™ Offline Compiler command option to enable the offline compiler to rearrange the operations in a way that exposes the accumulation. If you do not compile the kernel with -fp-relaxed, the resulting accumulator structure will have a high initiation interval (II). II is the number of cycles between launching successive loop iterations. The higher the II value, the longer the accumulator structure must wait before it can process the next loop iteration.
Modifying a Multi-Loop Accumulator Description
In cases where you cannot compile an accumulator description using the -fp-relaxed offline compiler command option, rewrite the code to expose the accumulation.
For the code example above, rewrite it in the following manner:
float acc = 0.0f; for (i = 0; i < k; i++) { float my_dot = 0.0f; #pragma unroll for(j=0;j < 16; j++) my_dot += (x[i+j]*y[i+j]); acc += my_dot; }
Modifying an Accumulator Description Containing a Variable or Non-Zero Initial Value
Consider a situation where you might want to apply an offset to a description of an accumulator that begins with a non-zero value:
float acc = array[0]; for (i = 0; i < k; i++) { acc += x[i]; }
Because the accumulator hardware does not support variable or non-zero initial values in a description, you must rewrite the description.
float acc = 0.0f; for (i = 0; i < k; i++) { acc += x[i]; } acc += array[0];
Rewriting the description in the above manner enables the kernel to use an accumulator in a loop. The loop structure is then followed by an increment of array[0].
Designing Your Host Application
Host Programming Requirements
Host Machine Memory Requirements
The host machine must support the following components:
- The host application and operating system.
- The working set for the host application.
- The maximum amount of OpenCL™ memory buffers that can be allocated at once. Every device-side cl_mem buffer is associated with a corresponding storage area in the host process. Therefore, the amount of host memory necessary might be as large as the amount of external memory supported by the FPGA.
Host Binary Requirement
Multiple Host Threads
All OpenCL APIs are thread safe except the clSetKernelArg function.
It is safe to call clSetKernelArg from any host thread or in a reentrant way as long as concurrent calls to clSetKernelArg operate on different cl_kernel objects.
Out-of-Order Command Queues
Requirement for Multiple Command Queues to Execute Kernels Concurrently
Allocating OpenCL Buffers for Manual Partitioning of Global Memory
Manual partitioning of global memory buffers allows you to control memory accesses across buffers to maximize the memory bandwidth. You can partition buffers across interfaces of the same memory type or across interfaces of different memory types.
Partitioning Buffers Across Multiple Interfaces of the Same Memory Type
The figure below illustrates the differences between burst-interleaved and non-interleaved memory partitions.
To manually partition some or all of the available global memory types, perform the following tasks:
- Compile your OpenCL kernel using the -no-interleaving=<global_memory_type> flag to configure the memory bank(s) of the specified memory type as separate addresses. For more information on the usage of the -no-interleaving=<global_memory_type> flag, refer to the Disabling Burst-Interleaving of Global Memory (-no-interleaving=<global_memory_type>) section.
-
Create an OpenCL buffer in your host application, and allocate
the buffer to one of the banks using the CL_CHANNEL flags.
- Specify CL_CHANNEL_1_INTELFPGA to allocate the buffer to the lowest available memory region.
- Specify CL_CHANNEL_2_INTELFPGA to allocation memory to the second bank (if available).
Attention: Allocate each buffer to a single memory bank only. If the second bank is not available at runtime, the memory is allocated to the first bank. If no global memory is available, the clCreateBuffer call fails with the error message CL_MEM_OBJECT_ALLOCATION_FAILURE.
Partitioning Buffers Across Different Memory Types (Heterogeneous Memory)
To use the heterogeneous memory, modify the code in your .cl file as follows:
-
Determine the names of the global memory types available on your FPGA board in
one of the following ways:
- Refer to the board vendor's documentation for more information.
- Find the names in the board_spec.xml file of your board Custom Platform. For each global memory type, the name is the unique string assigned to the name attribute of the global_mem element.
-
To instruct the host to allocate a buffer to a specific global memory type,
insert the
buffer_location("<memory_type>")
attribute, where <memory_type> is the
name of the global memory type provided by your board vendor.
For example:
__kernel void foo(__global __attribute__((buffer_location("DDR"))) int *x, __global __attribute__((buffer_location("QDR"))) int *y)
If you do not specify the buffer_location attribute, the host allocates the buffer to the default memory type automatically. To determine the default memory type, consult the documentation provided by your board vendor. Alternatively, in the board_spec.xml file of your Custom Platform, search for the memory type that is defined first or has the attribute default=1 assigned to it.Intel® recommends that you define the buffer_location attribute in a preprocessor macro for ease of reuse, as follows:#define QDR\ __global __attribute__((buffer_location("QDR"))) #define DDR\ __global __attribute__((buffer_location("DDR"))) __kernel void foo (QDR uint * data, DDR uint * lup) { //statements }
Attention: If you assign a kernel argument to a non-default memory (for example, QDR uint * data and DDR uint * lup from the code above), you cannot declare that argument using the constant keyword. In addition, you cannot perform atomic operations with pointers derived from that argument.
By default, the host allocates buffers into the main memory when you load kernels into the OpenCL runtime via the clCreateProgramWithBinary function. During kernel invocation, the host automatically relocates heterogeneous memory buffers that are bound to kernel arguments to the main memory.
-
To avoid the initial allocation of heterogeneous memory buffers in the main
memory, include the CL_MEM_HETEROGENEOUS_INTEL
flag when you call the clCreateBuffer function.
Also, bind the cl_mem buffer to the argument that used the
buffer_location attribute using
clSetKernelArg before doing any reads or writes from that
buffer, as follows:
mem = clCreateBuffer(context, flags|CL_MEM_HETEROGENEOUS_INTEL, memSize, NULL, &errNum); clSetKernelArg(kernel, 0, sizeof(cl_mem), &mem); clEnqueueWriteBuffer(queue, mem, CL_FALSE, 0, N, 0, NULL, &write_event); clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, &kernel_event);
For example, the following clCreateBuffer call allocates memory into the lowest available memory region of a nondefault memory bank:
mem = clCreateBuffer(context, (CL_MEM_HETEROGENEOUS_INTEL|CL_CHANNEL_1_INTELFPGA), memSize, NULL, &errNum);
The clCreateBuffer call allocates memory into a certain global memory type based on what you specify in the kernel argument. If a memory (cl_mem) object residing in a memory type is set as a kernel argument that corresponds to a different memory technology, the host moves the memory object automatically when it queues the kernel. Do not pass a buffer as kernel arguments that associate it with multiple memory technologies.
For more information on optimizing heterogeneous global memory accesses, refer to the Heterogeneous Memory Buffers and the Manual Partitioning of Global Memory sections of the Intel® FPGA SDK for OpenCL™ Best Practices Guide.
Creating a Pipe Object in Your Host Application
An SDK-specific pipe object is not a true OpenCL pipe object as described in the OpenCL Specification version 2.0. This implementation allows you to migrate away from Intel® FPGA products with a conformant solution. The SDK-specific pipe object is a memory object (cl_mem); however, the host does not allocate any memory for the pipe itself.
The following clCreatePipe host API creates a pipe object:
cl_mem clCreatePipe(cl_context context, cl_mem_flags flags, cl_uint pipe_packet_size, cl_uint pipe_max_packets, const cl_pipe_properties *properties, cl_int *errcode_ret)
For more information on the clCreatePipe host API function, refer to section 5.4.1 of the OpenCL Specification version 2.0.
Below is an example syntax of the clCreatePipe host API function:
cl_int status; cl_mem c0_pipe = clCreatePipe(context, 0, sizeof(int), 1, NULL, &status); status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &c0_pipe);
Collecting Profile Data During Kernel Execution
extern CL_API_ENTRY cl_int CL_API_CALL clGetProfileInfoIntelFPGA(cl_event);
where cl_event is the kernel event. The kernel event you pass to this host library call must be the same one you pass to the clEnqueueNDRangeKernel call.
int main() { ... clEnqueueNDRangeKernel(queue, kernel, ..., NULL); ... clEnqueueNDRangeKernel(queue, kernel, .. , NULL); ... }
This host application runs on the assumption that a kernel launches twice and then completes. In the profile.mon file, there will be two sets of profile data, one for each kernel invocation. To collect profile data while the kernel is running, modify the host code in the following manner:
int main() { ... clEnqueueNDRangeKernel(queue, kernel, ..., &event); //Get the profile data before the kernel completes clGetProfileInfoIntelFPGA(event); //Wait until the kernel completes clFinish(queue); ... clEnqueueNDRangeKernel(queue, kernel, ..., NULL); ... }
The call to clGetProfileInfoIntelFPGA adds a new entry in the profile.mon file. The Intel® FPGA Dynamic Profiler for OpenCL™ GUI then parses this entry in the report.
- Profile Your Kernel to Identify Performance Bottlenecks in the Intel® FPGA SDK for OpenCL™ Best Practices Guide
- Profiling Your OpenCL Kernel
Profiling Enqueued and Autorun Kernels
Following is the code snippet of the clGetProfileDataDeviceIntelFPGA host library call:
cl_int clGetProfileDataDeviceIntelFPGA (cl_device_id device_id, cl_program program, cl_bool read_enqueue_kernels, cl_bool read_auto_enqueued, cl_bool clear_counters_after_readback, size_t param_value_size, void *param_value, size_t *param_value_size_ret, cl_int *errcode_ret);
where,
- read_enqueue_kernels parameter profiles enqueued kernels. In 17.1.1 release, this parameter has no effect.
- read_auto_enqueued parameter profiles autorun kernels.
- Following are the placeholder parameters for the future releases:
- clear_counters_after_readback
- param_value_size
- param_value
- param_value_size_ret
- errcode_ret
In 17.1.1, only autorun kernels are supported by this host library call. You can enter TRUE for the read_enqueue_kernels parameter, but the boolean is ignored. This does not mean that enqueued kernels are not profiled. If the compiler profile flag is set to include enqueued kernels, the profile data is captured normally at the end of execution. The only difference is that the clGetProfileDataDeviceIntelFPGA host library call does not profile enqueued kernels in addition to the profiling already done automatically for the enqueued kernels.
The clGetProfileDataDeviceIntelFPGA host library call returns CL_SUCCESS on success. Else, it returns one of the following errors:
- CL_INVALID_DEVICE if the device is not a valid device.
- CL_INVALID_PROGRAM if the program is not a valid program.
Profile Data Acquisition
Pausing data acquisition is not synchronized exactly across all kernels. The skew between halting profile data acquisition across kernels is dependent on the communication link with the device, driver overhead, and congestion on communication buses. Exact synchronized snapshotting of profile data between kernels should not be relied upon.
Multiple Autorun Profiling Calls
Accessing Custom Platform-Specific Functions
The clGetBoardExtensionFunctionAddressIntelFPGA extension specifies an API that retrieves a pointer to a user-accessible function from the Custom Platform.
Definitions of the extension interfaces are available in the INTELFPGAOCLSDKROOT/host/include/CL/cl_ext.h file.
void* clGetBoardExtensionFunctionAddressIntelFPGA ( const char* function_name, cl_device_id device );
Where:
function_name is the name of the user-accessible function that your Custom Platform vendor provides,
and
device is the device ID returned by the clGetDeviceIDs function.
To access the clGetBoardExtensionFunctionAddressIntelFPGA API via the Installable Client Driver (ICD), ensure that the ICD extension API clGetExtensionFunctionAddressIntelFPGA retrieves the pointer to the clGetBoardExtensionFunctionAddressIntelFPGA API first.
The following code example shows how you can access the Custom Platform-specific function via ICD:
clGetBoardExtensionFunctionAddressIntelFPGA_fn clGetBoardExtensionFunctionAddressIntelFPGA = (clGetBoardExtensionFunctionAddressIntelFPGA_fn) clGetExtensionFunctionAddressForPlatform (platform, "clGetBoardExtensionFunctionAddressIntelFPGA"); if (clGetBoardExtensionFunctionAddressIntelFPGA == NULL){ printf ("Failed to get clGetBoardExtensionFunctionAddressIntelFPGA\n"); } void * board_extension_function_ptr = clGetBoardExtensionFunctionAddressIntelFPGA("function_name",device_id);
Modifying Host Program for Structure Parameter Conversion
Perform the following changes to your host application:
-
Allocate a
cl_mem buffer to store the
structure contents.
Attention: You need a separate cl_mem buffer for every kernel that uses a different structure value.
- Set the structure kernel argument with a pointer to the structure buffer, not with a pointer to the structure contents.
-
Populate the
structure buffer contents before queuing the kernel. Perform one of the
following steps to ensure that the structure buffer is populated before the
kernel launches:
- Queue the structure buffer on the same command queue as the kernel queue.
- Synchronize separate kernel queues and structure buffer queues with an event.
- When your application no longer needs to call a kernel that uses the structure buffer, release the cl_mem buffer.
Managing Host Application
For Linux systems, if you debug your host application using the GNU Project Debugger (GDB), invoke the following command prior to running the host application:
handle SIG44 nostop
Without this command, the GDB debugging process terminates with the following error message:
Program received signal SIG44, Real-time event 44.
Displaying Example Makefile Fragments (example-makefile or makefile)
The following are example Makefile fragments for compiling and linking
a host program against the host runtime libraries included with the
Intel® FPGA SDK for OpenCL™
.
Example GNU makefile on Linux, with GCC toolchain:
AOCL_COMPILE_CONFIG=$(shell aocl compile-config)
AOCL_LINK_CONFIG=$(shell aocl link-config)
host_prog : host_prog.o
g++ -o host_prog host_prog.o $(AOCL_LINK_CONFIG)
host_prog.o : host_prog.cpp
g++ -c host_prog.cpp $(AOCL_COMPILE_CONFIG)
Example GNU makefile on Windows, with Microsoft Visual C++ command line compiler:
AOCL_COMPILE_CONFIG=$(shell aocl compile-config)
AOCL_LINK_CONFIG=$(shell aocl link-config)
host_prog.exe : host_prog.obj
link -nologo /OUT:host_prog.exe host_prog.obj $(AOCL_LINK_CONFIG)
host_prog.obj : host_prog.cpp
cl /MD /Fohost_prog.obj -c host_prog.cpp $(AOCL_COMPILE_CONFIG)
Example GNU makefile cross-compiling to ARM SoC from Linux or Windows, with
Linaro GCC cross-compiler toolchain:
CROSS-COMPILER=arm-linux-gnueabihf-
AOCL_COMPILE_CONFIG=$(shell aocl compile-config --arm)
AOCL_LINK_CONFIG=$(shell aocl link-config --arm)
host_prog : host_prog.o
$(CROSS-COMPILER)g++ -o host_prog host_prog.o $(AOCL_LINK_CONFIG)
host_prog.o : host_prog.cpp
$(CROSS-COMPILER)g++ -c host_prog.cpp $(AOCL_COMPILE_CONFIG)
Compiling and Linking Your Host Application
Displaying Flags for Compiling Host Application (compile-config)
-
At a command prompt, invoke the
aocl
compile-config
utility command.
The software displays the path to the folder or directory in which the OpenCL™ API header files reside. For example:
- For Windows systems, the path is -I%INTELFPGAOCLSDKROOT%/host/include
- For Linux systems, the path is -I$INTELFPGAOCLSDKROOT/host/include
where INTELFPGAOCLSDKROOT points to the location of the software installation.
- Add this path to your C preprocessor.
Displaying Paths to OpenCL Host Runtime and MMD Libraries (ldflags)
- The OpenCL™ host runtime libraries that provide OpenCL platform and runtime APIs. The OpenCL host runtime libraries are available in the INTELFPGAOCLSDKROOT/host/<OS_platform>/lib directory.
- The path to the Custom Platform-specific MMD libraries. The MMD libraries are available in the <board_family_name>/<OS_platform>/lib directory of your Custom Platform.
Listing OpenCL Host Runtime and MMD Libraries (ldlibs)
The software lists the OpenCL host runtime libraries residing in the INTELFPGAOCLSDKROOT/host/<OS_platform>/lib directory. It also lists the Custom Platform-specific MMD libraries residing in the /<board_family_name>/<OS_platform>/lib directory of your Custom Platform.
- For Windows systems, the output might resemble the
following
example:
alterahalmmd.lib <board_vendor_name>_<board_family_name>_mmd.[lib|so|a|dll] alteracl.lib acl_emulator_kernel_rt.lib pkg_editor.lib libelf.lib acl_hostxml.lib
If you set up FCD correctly, the output will be OpenCL.lib.
- For Linux systems, the output might resemble the
following
example:
-lalteracl -ldl -lacl_emulator_kernel_rt -lalterahalmmd -l<board_vendor_name>_<board_family_name>_mmd -lelf -lrt -lstdc++
If you set up FCD correctly, the output will be -lOpenCL.
Displaying Information on OpenCL Host Runtime and MMD Libraries (link-config or linkflags)
- The path to and the names of OpenCL™ host runtime libraries that provide OpenCL platform and runtime APIs. The OpenCL host runtime libraries are available in the INTELFPGAOCLSDKROOT/host/<OS_platform>/lib directory .
- The path to and the names of the Custom Platform-specific MMD libraries. The MMD
libraries are available in the
<board_family_name>/<OS_platform>/lib
directory of your Custom
Platform.Note: If you set up FCD correctly, the software will not print the path and names of the MMD libraries because the host no longer needs to link to the MMD libraries directly. The MMD libraries will be loaded during runtime through the FCD loader.
- For Windows systems, the link options might resemble the following example
output:
/libpath:%INTELFPGAOCLSDKROOT%/board/<board_name>/windows64/lib /libpath:%INTELFPGAOCLSDKROOT%/host/windows64/lib alterahalmmd.lib <board_vendor_name>_<board_family_name>_mmd.[lib|so|a|dll] alteracl.lib acl_emulator_kernel_rt.lib pkg_editor.lib libelf.lib acl_hostxml.lib
If you set up FCD correctly, the output will be /libpath:%INTELFPGAOCLSDKROOT%/host/windows64/lib OpenCL.lib
- For Linux systems, the link options might resemble the following example output:
-L/$INTELFPGAOCLSDKROOT/board/<board_name>/linux64/lib -L/$INTELFPGAOCLSDKROOT/host/linux64/lib -lalterac -ldl -lacl_emulator_kernel_rt -lalterahalmmd -l<board_vendor_name>_<board_family_name>_mmd -lelf -lrt -lstdc++
If you set up FCD correctly, the output will be -L/$INTELFPGAOCLSDKROOT/host/[linux64|arm32]/lib -lOpenCL
Linking Your Host Application to the Khronos ICD Loader Library
In addition to the SDK's host runtime libraries, Intel® supplies a version of the ICD Loader Library that supports the OpenCL Specification version 1.0 and the implemented APIs from the OpenCL Specification versions 1.1, 1.2, and 2.0. To use an ICD library from another vendor, consult the vendor's documentation on how to link to their ICD library.
Linking to the ICD Loader Library on Windows
- Consult with your board vendor to identify the libraries that the FCD requires. Alternatively, you can invoke the aocl ldlibs command and identify the libraries that your OpenCL application requires.
-
Specify the libraries in the registry key HKEY_LOCAL_MACHINE\SOFTWARE\
Intel®
\OpenCL\Boards. Specify the value
name to be the path to the library, and specify the data to be a
DWORD that is set to 0.
Attention: If your board vendor provides multiple libraries, you might need to load them in a particular order. Consult with your board vendor to determine the correct order to load the libraries. List the libraries in the registry in their loading order.
To enumerate board vendor-specific libraries, FCD scans the values in the HKEY_LOCAL_MACHINE\SOFTWARE\ Intel® \OpenCL\Boards registry key. For each DWORD value that is set to 0, the FCD Loader opens the corresponding DLL that is specified in the value name.
Consider the following registry key value:
[HKEY_LOCAL_MACHINE\SOFTWARE\ Intel® \OpenCL\Boards] "c:\board_vendor a\my_board_mmd.dll"=dword:00000000
The FCD loader scans this value and loads the library my_board_mmd.dll from the board_vendor a folder.
If your host application fails to run while it is linking to the ICD, ensure that the HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors registry key contains the following value:
[HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors] "alteracl_icd.dll"=dword:00000000
Linking to the ICD Loader Library on Linux
-
If you need to manually set up FCD support for your Custom
Platform, perform the following tasks:
- Consult with your board vendor to identify the libraries that the FCD requires. Alternatively, you can invoke the aocl ldlibs command and identify the libraries that your OpenCL application requires.
-
Ensure that the file
/opt/Intel/OpenCL/Boards/my_board.fcd exists in
your Custom Platform and contains the name of the vendor-specific
libraries (for example,
/data/board_vendor_a/libmy_board_mmd.so).
The FCD loader scans the contents of the .fcd file and then loads the library libmy_board_mmd.so file from the board_vendor_a folder.
-
For Cyclone V SoC boards, when you build the SD flash card
image for your Custom Platform, create an Altera.icd file containing the text libalteracl.so. Store the Altera.icd file in the /etc/OpenCL/vendors directory of your Custom Platform.
Refer to Building an SD Flash Card Image section of the Intel® FPGA SDK for OpenCL™ Cyclone V SoC Development Kit Reference Platform Porting Guide for more information.
Programming an FPGA via the Host
- Compile your OpenCL kernel with the offline compiler to create the .aocx file.
- Include the clCreateProgramWithBinary function in your host application to create the cl_program OpenCL program objects from the .aocx file.
-
Include the clBuildProgram
function in your host application to create the program executable for the
specified device.
Below is an example host code on using clCreateProgramWithBinary to program an FPGA device:
size_t lengths[1]; unsigned char* binaries[1] ={NULL}; cl_int status[1]; cl_int error; cl_program program; const char options[] = ""; FILE *fp = fopen("program.aocx","rb"); fseek(fp,0,SEEK_END); lengths[0] = ftell(fp); binaries[0] = (unsigned char*)malloc(sizeof(unsigned char)*lengths[0]); rewind(fp); fread(binaries[0],lengths[0],1,fp); fclose(fp); program = clCreateProgramWithBinary(context, 1, device_list, lengths, (const unsigned char **)binaries, status, &error); clBuildProgram(program,1,device_list,options,NULL,NULL);
If the clBuildProgram function executes successfully, it returns CL_SUCCESS. - Create kernel objects from the program executable using the clCreateKernelsInProgram or clCreateKernel function.
-
Include the kernel execution function to instruct the host
runtime to execute the scheduled kernel(s) on the FPGA.
- To enqueue a command to execute an NDRange kernel, use clEnqueueNDRangeKernel.
- To enqueue a single work-item kernel, use clEnqueueTask.
Attention:Intel® recommends that you release an event object when it is not in use. The SDK keeps an event object live until you explicitly instruct it to release the event object. Keeping an unused event object live causes unnecessary memory usage.
To release an event object, call the clReleaseEvent function.
You can load multiple FPGA programs into memory, which the host then uses to reprogram the FPGA as required.
Programming Multiple FPGA Devices
Linking your host application to FCD allows you to target multiple FPGA devices from different Custom Platforms. However, this feature has limited support for Custom Platforms that are compatible with SDK versions prior to 16.1.
You can present up to 32 FPGA devices to your system in the following manner:
- Multiple FPGA accelerator boards, each consisting of a single FPGA.
- Multiple FPGAs on a single accelerator board that connects to the host system via a PCIe® switch.
- Combinations of the above.
The host runtime can load kernels onto each and every one of the FPGA devices. The FPGA devices can then operate in a parallel fashion.
Probing the OpenCL FPGA Devices
- To query a list of FPGA devices installed in your machine, invoke the aocl diagnose command.
-
To direct the host to identify the number of OpenCL FPGA
devices, add the following lines of code to your host application:
//Get the platform ciErrNum = clGetPlatformID(&cpPlatform); //Get the devices ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_ALL, 0, NULL, &ciDeviceCount); cdDevices = (cl_device_id * )malloc(ciDeviceCount * sizeof(cl_device_id)); ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_ALL, ciDeviceCount, cdDevices, NULL);
Querying Device Information
char buf[1024]; for (unsigned i = 0; i < ciDeviceCount; i++); { clGetDeviceInfo(cdDevices[i], CL_DEVICE_NAME, 1023, buf, 0); printf("Device %d: '%s'\n", i, buf); }
Device <N>: <board_name>: <name_of_FPGA_board>
Where:
- <N> is the device number.
- <board_name> is the board designation you use to target your FPGA device when you invoke the aoc command.
- <name_of_FPGA_board> is the advertised name of the FPGA board.
For example, if you have two identical FPGA boards on your system, the host generates an output that resembles the following:
Device 0: board_1: Stratix V FPGA Board Device 1: board_1: Stratix V FPGA Board
Loading Kernels for Multiple FPGA Devices
The following host code demonstrates the usage of the clCreateProgramWithBinary and createMultiDeviceProgram functions to program multiple FPGA devices:
cl_program createMultiDeviceProgram(cl_context context, const cl_device_id *device_list, cl_uint num_devices, const char *aocx_name); // Utility function for loading file into Binary String // unsigned char* load_file(const char* filename, size_t *size_ret) { FILE *fp = fopen(aocx_name,"rb"); fseek(fp,0,SEEK_END); size_t len = ftell(fp); char *result = (unsigned char*)malloc(sizeof(unsigned char)*len); rewind(fp); fread(result,len,1,fp); fclose(fp); *size_ret = len; return result; } //Create a Program that is compiled for the devices in the "device_list" // cl_program createMultiDeviceProgram(cl_context context, const cl_device_id *device_list, cl_uint num_devices, const char *aocx_name) { printf("creating multi device program %s for %d devices\n", aocx_name, num_devices); const unsigned char **binaries = (const unsigned char**)malloc(num_devices*sizeof(unsigned char*)); size_t *lengths=(size_t*)malloc(num_devices*sizeof(size_t)); cl_int err; for(cl_uint i=0; i<num_devices; i++) { binaries[i] = load_file(aocx_name,&lengths[i]); if (!binaries[i]) { printf("couldn't load %s\n", aocx_name); exit(-1); } } cl_program p = clCreateProgramWithBinary(context, num_devices, device_list, lengths, binaries, NULL, &err); free(lengths); free(binaries); if (err != CL_SUCCESS) { printf("Program Create Error\n"); } return p; } // main program main () { // Normal OpenCL setup } program = createMultiDeviceProgram(context, device_list, num_devices, "program.aocx"); clBuildProgram(program,num_devices,device_list,options,NULL,NULL);
Termination of the Runtime Environment and Error Recovery
The runtime environment is a library that is compiled as part of the host application. When the host application terminates, the runtime environment will also terminate along with any tracking activity that it performs. If you restart the host application, a new runtime environment and its associated tracking activities will reinitialize. The initialization functions reset the kernel's hardware state.
In same cases, unexpected termination of the host application causes the configuration of certain hardware (for example, PCIe® hard IP) to be incomplete. To restore the configuration of these hardware, the host needs to reprogram the FPGA.
If you use a Custom Platform that implements customized hardware blocks, be aware that restarting the host application and resetting these blocks might have design implications:
- When the host application calls the clGetPlatformIDs function, all kernels and channels will be reset for all available devices.
- When the host application calls the clGetPlatformIDs function, it resets FIFO buffers and channels as it resets the device.
- The host application initializes memory buffers via the clCreateBuffer and clEnqueueWriteBuffer function calls. You cannot access the contents of buffers from a previous host execution within a new host execution.
Allocating Shared Memory for OpenCL Kernels Targeting SoCs
- Mark the shared buffers between kernels as volatile to ensure that buffer modification by one kernel is visible to the other kernel.
- To access shared memory, you only need to modify the host code. Modifications to the kernel code are unnecessary.
- You cannot use the library function
malloc or the
operator new to
allocate physically shared memory. Also, the CL_MEM_USE_HOST_PTR flag
does not work with shared memory.
In DDR memory, shared memory must be physically contiguous. The FPGA cannot consume virtually contiguous memory without a scatter-gather direct memory access (SG-DMA) controller core. The malloc function and the new operator are for accessing memory that is virtually contiguous.
- CPU caching is disabled for the shared memory.
- When you use shared memory, one copy of the data is used for both the host and the kernel. When this memory is used, OpenCL memory calls are done as zero-copy transfers for buffer reads, buffer writers, maps, and unmaps.
-
To allocate and access shared memory,
structure your host code in a similar manner as the
following example:
cl_mem src = clCreateBuffer(…, CL_MEM_ALLOC_HOST_PTR, size, …); int *src_ptr = (int*)clEnqueueMapBuffer (…, src, size, …); *src_ptr = input_value; //host writes to ptr directly clSetKernelArg (…, src); clEnqueueNDRangeKernel(…); clFinish(); printf (“Result = %d\n”, *dst_ptr); //result is available immediately clEnqueueUnmapMemObject(…, src, src_ptr, …); clReleaseMemObject(src); // actually frees physical memory
You can include the CONFIG_CMA_SIZE_MBYTES kernel configuration option to control the maximum total amount of shared memory available for allocation. In practice, the total amount of allocated shared memory is smaller than the value of CONFIG_CMA_SIZE_MBYTES.Important:- If your target board has multiple DDR memory banks, the clCreateBuffer(..., CL_MEM_READ_WRITE, ...) function allocates memory to the nonshared DDR memory banks. However, if the FPGA has access to a single DDR bank that is shared memory, then clCreateBuffer(..., CL_MEM_READ_WRITE, ...) allocates to shared memory, similar to using the CL_MEM_ALLOC_HOST_PTR flag.
- The shared memory that you request with the clCreateBuffer(..., CL_MEM_ALLOC_HOST_PTR, size, ...) function is allocated in the Linux OpenCL kernel driver, and it relies on the contiguous memory allocator (CMA) feature of the Linux kernel. For detailed information on enabling and configuring the CMA, refer to the Recompiling the Linux Kernel and the OpenCL Linux Kernel Driver section of the Intel® FPGA SDK for OpenCL™ Cyclone V SoC Development Kit Reference Platform Porting Guide.
-
To transfer data from shared hard processor
system (HPS) DDR to FPGA DDR efficiently, include a kernel
that performs the memcpy
function, as shown below.
__attribute__((num_simd_work_items(8))) mem_stream(__global uint * src, __global uint * dst) { size_t gid = get_global_id(0); dst[gid] = src[gid]; }
Attention: Allocate the src pointer in the HPS DDR as shared memory using the CL_MEM_ALLOC_HOST_PTR flag. -
If the host allocates constant memory to
shared HPS DDR system and then modifies it after kernel
execution, the modifications might not take effect. As a
result, subsequent kernel executions might use outdated
data. To prevent kernel execution from using outdated
constant memory, perform one of the following tasks:
- Do not modify constant memory after its initialization.
- Create multiple constant memory buffers if you require multiple __constant data sets.
- If available, allocate constant memory to the FPGA DDR on your accelerator board.
Debugging Your OpenCL System That is Gradually Slowing Down
void oclContextCallback (const char *errinfo, const void *, size_t, void *) { printf ("Context callback: %s\n", errinfo); } int main(){ … // Create the context. context = clCreateContext (NULL, num_devices, device, &oclContextCallback, NULL, &status); … }
[Runtime Warning]: Too many 'event' objects in the host. This causes deterioration in runtime performance.
Compiling Your OpenCL Kernel
Before you compile an OpenCL™ kernel, verify that the QUARTUS_ROOTDIR_OVERRIDE environment variable points to the Intel® Quartus® Prime Standard Edition software.
Improper installation of the QuickUDP IP license causes kernel compilation to fail with the following error message:
Error (292014): Can't find valid feature line for core PLDA QUICKTCP (73E1_AE12) in current license.
Note that the error has no actual dependency on the TCP Hardware Stack QuickTCP IP from PLDA.
Compiling Your Kernel to Create Hardware Configuration File
Intel® recommends that you use this one-step compilation strategy under the following circumstances:
- After you optimize your kernel via the Intel® FPGA SDK for OpenCL™ design flow, and you are now ready to create the .aocx file for deployment onto the FPGA.
- You have one or more simple kernels that do not require any optimization.
To compile the kernel and generate the .aocx file in one step, invoke the aoc <your_kernel_filename1>.cl [<your_kernel_filename2>.cl ...] command.
Where [ <your_kernel_filename2>.cl ...] are the optional space-delimited file names of kernels that you can compile in addition to <your_kernel_filename1>.cl.
The Intel® FPGA SDK for OpenCL™ Offline Compiler groups the .cl files into a temporary file. It then compiles this file to generate the .aocx file.
Compiling Your Kernel without Building Hardware (-c)
- A .aoco file for each .cl kernel source file. The offline compiler creates the .aoco file(s) in a matter of seconds to minutes.
- A <your_kernel_filename> folder or subdirectory. It contains intermediate files that the SDK uses to build the hardware configuration file necessary for FPGA programming.
Specifying the Location of Header Files (-I=<directory>)
If the header files are in the same directory as your kernel, you do not need to include the -I=<directory> option in your aoc command. The offline compiler automatically searches the current folder or directory for header files.
For Windows systems, ensure that your include path does not contain any trailing slashes. The offline compiler considers a trailing forward slash (/) or backward slash (\) as illegal.
The offline compiler generates an error message if you invoke the aoc command in the following manner:
aoc -I=<drive>\<folder>\<subfolder>\ <your_kernel_filename>.cl
or
aoc -I=<drive>/<folder>/<subfolder>/ <your_kernel_filename>.cl
The correct way to specify the include path is as follows:
aoc -I=<drive>\<folder>\<subfolder> <your_kernel_filename>.cl
or
aoc -I=<drive>/<folder>/<subfolder> <your_kernel_filename>.cl
Specifying the Name of an Intel FPGA SDK for OpenCL Offline Compiler Output File (-o=<filename>)
-
If you implement the multistep compilation flow, specify the names of the output files in the following manner:
- To specify the name of the .aoco file that the offline compiler creates during an intermediate compilation step, invoke the aoc -c -o=<your_object_filename>.aoco <your kernel_filename>.cl command.
- To specify the name of the .aocx file that the offline compiler creates during the final compilation step, invoke the aoc -o=<your_executable_filename>.aocx <your_object_filename>.aoco command.
- If you implement the one-step compilation flow, specify the name of the .aocx file by invoking the aoc -o=<your_executable_filename>.aocx <your_kernel_filename>.cl command.
Compiling a Kernel for a Specific FPGA Board (-board=<board_name>)
When you compile your kernel by including the -board=<board_name> option in the aoc command, the Intel® FPGA SDK for OpenCL™ Offline Compiler defines the preprocessor macro AOCL_BOARD_<board_name> to be 1, which allows you to compile device-optimized code in your kernel.
-
To obtain the names of the available FPGA boards in your Custom
Platform, invoke the
aoc
-list-boards
command.
For example, the offline compiler generates the following output:
Board List: FPGA_board_1
where FPGA_board_1 is the <board_name>.
-
To compile your OpenCL kernel for FPGA_board_1, invoke the
aoc
-board=FPGA_board_1
<your_kernel_filename>.cl command.
The offline compiler defines the preprocessor macro AOCL_BOARD_FPGA_board_1 to be 1 and compiles kernel code that targets FPGA_board_1.
To readily identify compiled kernel files that target a specific FPGA board, Intel® recommends that you rename the kernel binaries by including the -o option in the aoc command.
To target your kernel to FPGA_board_1 in the one-step compilation flow, invoke the following command:
aoc -board=FPGA_board_1 <your_kernel_filename>.cl -o=<your_executable_filename>_FPGA_board_1.aocx
To target your kernel to FPGA_board_1 in the multistep compilation flow, perform the following tasks:
- Invoke the following command to generate the .aoco file:
aoc -c -board=FPGA_board_1 <your_kernel_filename>.cl -o=<my_object_filename>_FPGA_board_1.aoco
- Invoke the following command to generate the .aocx file:
aoc -board=FPGA_board_1 <your_object_filename>_FPGA_board_1.aoco -o=<your_executable_filename>_FPGA_board_1.aocx
If you have an accelerator board consisting of two FPGAs, each FPGA device has an equivalent "board" name (for example, board_fpga_1 and board_fpga_2). To target a kernel_1.cl to board_fpga_1 and a kernel_2.cl to board_fpga_2, invoke the following commands:
aoc -board=board_fpga1 kernel_1.cl
aoc -board=board_fpga2 kernel_2.cl
Resolving Hardware Generation Fitting Errors during Kernel Compilation (-high-effort)
When kernel compilation fails because of a fitting constraint problem, the Intel® FPGA SDK for OpenCL™ Offline Compiler displays the following error message:
Error: Kernel fit error, recommend using -high-effort. Error: Cannot fit kernel(s) on device
After you invoke the command, the offline compiler displays the following message:
High-effort hardware generation selected, compile time may increase significantly.
The offline compiler will make three attempts to recompile your kernel and generate hardware. Modify your kernel if compilation still fails after the -high-effort attempt.
Defining Preprocessor Macros to Specify Kernel Parameters (-D<macro_name>)
- To pass a preprocessor macro definition to the offline compiler, invoke the aoc -D <macro_name> <kernel_filename>.cl command.
-
To override the existing value of a defined preprocessor macro,
invoke the
aoc
-D
<macro_name>=<value>
<kernel_filename>.cl
command.
Consider the following code snippet for the kernel sum:
#ifndef UNROLL_FACTOR #define UNROLL_FACTOR 1 #endif __kernel void sum (__global const int * restrict x, __global int * restrict sum) { int accum = 0; #pragma unroll UNROLL_FACTOR for(size_t i = 0; i < 4; i++) { accum += x[i + get_global_id(0) * 4]; } sum[get_global_id(0)] = accum; }
To override the UNROLL_FACTOR of 1 and set it to 4, invoke the aoc -DUNROLL_FACTOR=4 sum.cl command. Invoking this command is equivalent to replacing the line #define UNROLL_FACTOR 1 with #define UNROLL_FACTOR 4 in the sum kernel source code.
-
To use preprocessor macros to control how the offline compiler
optimizes your kernel without modifying your kernel source code, invoke the
aoc
-o=<hardware_filename>.aocx -D
<macro_name>=<value>
<kernel_filename>.cl
Where:
-o is the offline compiler option you use to specify the name of the .aocx file that the offline compiler generates.
<hardware_filename> is the name of the .aocx file that the offline compiler generates using the preprocessor macro value you specify.
Tip: To preserve the results from both compilations on your file system, compile your kernels as separate binaries by using the -o flag of the aoc command.For example, if you want to compile the same kernel multiple times with required work-group sizes of 64 and 128, you can define a WORK_GROUP_SIZE preprocessor macro for the kernel attribute reqd_work_group_size, as shown below:__attribute__((reqd_work_group_size(WORK_GROUP_SIZE,1,1))) __kernel void myKernel(...) for (size_t i = 0; i < 1024; i++) { // statements }
Compile the kernel multiple times by typing the following commands:
aoc –o=myKernel_64.aocx –DWORK_GROUP_SIZE=64 myKernel.cl
aoc –o=myKernel_128.aocx –DWORK_GROUP_SIZE=128 myKernel.cl
Generating Compilation Progress Report (-v)
-
To direct the offline compiler to report on the progress of a
full compilation, invoke the
aoc
-v
<your_kernel_filename>.cl command.
The offline compiler generates a compilation progress report similar to the following example:
aoc: Environment checks are completed successfully. You are now compiling the full flow!! aoc: Selected target board s5_net aoc: Running OpenCL parser.... aoc: OpenCL parser completed successfully. aoc: Compiling.... aoc: Linking with IP library ... aoc: First stage compilation completed successfully. aoc: Setting up project for CvP revision flow.... aoc: Hardware generation completed successfully.
-
To direct the offline compiler to report on the progress of an
intermediate compilation step that does not build hardware, invoke the
aoc
-c
-v
<your_kernel_filename>.cl command.
The offline compiler generates a compilation progress report similar to the following example:
aoc: Environment checks are completed successfully. aoc: Selected target board s5_net aoc: Running OpenCL parser.... aoc: OpenCL parser completed successfully. aoc: Compiling.... aoc: Linking with IP library ... aoc: First stage compilation completed successfully. aoc: To compile this project, run "aoc <your_kernel_filename>.aoco"
-
To direct the offline compiler to report on the progress of a
compilation for emulation, invoke the
aoc
-march=emulator
-v
<your_kernel_filename>.cl command.
The offline compiler generates a compilation progress report similar to the following example:
aoc: Environment checks are completed successfully. You are now compiling the full flow!! aoc: Selected target board s5_net aoc: Running OpenCL parser....ex aoc: OpenCL parser completed successfully. aoc: Compiling for Emulation .... aoc: Emulator Compilation completed successfully. Emulator flow is successful.
Displaying the Estimated Resource Usage Summary On-Screen (-report)
You can review the estimated resource usage summary without performing a full compilation. To review the summary on-screen prior to generating the hardware configuration file, include the -c option in your aoc command.
+--------------------------------------------------------------------+ ; Estimated Resource Usage Summary ; +----------------------------------------+---------------------------+ ; Resource + Usage ; +----------------------------------------+---------------------------+ ; Logic utilization ; 35% ; ; ALUTs ; 22% ; ; Dedicated logic registers ; 15% ; ; Memory blocks ; 29% ; ; DSP blocks ; 0% ; +----------------------------------------+---------------------------;
Suppressing Warning Messages from the Intel FPGA SDK for OpenCL Offline Compiler (-W)
Converting Warning Messages from the Intel FPGA SDK for OpenCL Offline Compiler into Error Messages (-Werror)
Removing Debug Data from Compiler Reports and Source Code from the .aocx File (-g0)
Disabling Burst-Interleaving of Global Memory (-no-interleaving=<global_memory_type>)
-
To direct the offline compiler to disable burst-interleaving
for the default global memory, invoke the
aoc
<your_kernel_filename>.cl -no-interleaving=default
command.
Your accelerator board might include multiple global memory types. To identify the default global memory type, refer to board vendor's documentation for your Custom Platform.
-
For a heterogeneous memory system, to direct the offline
compiler to disable burst-interleaving of a specific global memory type, perform
the following tasks:
- Consult the board_spec.xml file of your Custom Platform for the names of the available global memory types (for example, DDR and quad data rate (QDR)).
-
To disable burst-interleaving for one of the memory
types (for example, DDR), invoke the
aoc
<your_kernel_filename>.cl
-no-interleaving=DDR
command.
The offline compiler enables manual partitioning for the DDR memory bank, and configures the other memory bank in a burst-interleaved fashion.
-
To disable burst-interleaving for more than one type of
global memory buffers, include a
-no-interleaving=<global_memory_type>
option for each global memory type.
For example, to disable burst-interleaving for both DDR and QDR, invoke the aoc <your_kernel_filename>.cl -no-interleaving=DDR -no-interleaving=QDR command.
Configuring Constant Memory Cache Size (-const-cache-bytes=<N>)
The default constant cache size is 16 kB.
Relaxing the Order of Floating-Point Operations (-fp-relaxed)
Implementing a balanced tree structure leads to more efficient hardware at the expense of numerical variation in results.
Reducing Floating-Point Rounding Operations (-fpc)
Implementing this optimization control also changes the rounding mode. It rounds towards zero only at the end of a chain of floating-point arithmetic operations (that is, multiplications, additions, and subtractions).
Emulating and Debugging Your OpenCL Kernel
The Intel® FPGA SDK for OpenCL™ Emulator generates a .aocx file that executes on x86-64 Windows or Linux host. This feature allows you to emulate the functionality of your kernel and iterate on your design without executing it on the actual FPGA each time. For Linux platform, you can also use the Emulator to perform functional debug.
Modifying Channels Kernel Code for Emulation
channel unlong4 inchannel __attribute__((io("eth0_in"))); __kernel void send (int size) { for (unsigned i = 0; i < size; i++) { ulong4 data = read_channel_intel(inchannel); //statements } }
To enable the Emulator to emulate a kernel with a channel that interfaces with an I/O channel, perform the following tasks:
-
Modify the kernel code in one of the following manner:
- Add a matching write_channel_intel call
such as the one shown
below.
#ifdef EMULATOR __kernel void io_in (__global char * restrict arr, int size) { for (unsigned i = 0; i < size; i++) { ulong4 data = arr[i]; //arr[i] being an alternate data source write_channel_intel(inchannel, data); } } #endif
- Replace the I/O channel access with a memory access, as shown
below:
__kernel void send (int size) { for (unsigned i = 0; i < size; i++) { #ifndef EMULATOR ulong4 data = read_channel_intel(inchannel); #else ulong4 data = arr[i]; //arr[i] being an alternate data source #endif //statements } }
- Add a matching write_channel_intel call
such as the one shown
below.
- Modify the host application to create and start this conditional kernel during emulation.
Emulating a Kernel that Passes Pipes or Channels by Reference
For example, you may emulate a kernel that has the following structure:
void my_function (pipe uint * pipe_ref, __global uint * dst, int i) { read_pipe (*pipe_ref, &dst[i]); } __kernel void consumer (__global uint * restrict dst, read_only pipe uint __attribute__((blocking)) c0) { for (int i=0;i<5;i++) { my_function( &c0, dst, i ); } }
Emulating Channel Depth
When you compile your OpenCL* kernel for emulation, the default channel depth is different from the default channel depth generated when your kernel is compiled for hardware. You can change this behavior when you compile your kernel for emulation with the -emulator-channel-depth-model option.
- default
- Channels with an explicit depth attribute have their specified depth. Channels without a specified depth are given a default channel depth that is chosen to provide the fastest execution time for your kernel emulation.
- strict
- All channel depths in the emulation are given a depth that matches the depth given for the FPGA compilation.
- ignore-depth
- All channels are given a channel depth chosen to provide the fastest execution time for your kernel emulation. Any explicitly set channel depth attribute is ignored.
Compiling a Kernel for Emulation (-march=emulator)
- Before you perform kernel emulation, perform the following
tasks:
- Install a Custom Platform from your board vendor for your FPGA accelerator boards.
- Verify that the environment variable QUARTUS_ROOTDIR_OVERRIDE points to the Intel® Quartus® Prime Standard Edition software installation directory.
- To emulate your kernels on Windows systems, you need the
Microsoft linker and additional compilation time libraries. Verify that the
PATH environment variable setting
includes all the paths described in the Setting the
Intel® FPGA SDK for OpenCL™
Standard
Edition User Environment Variables
(Windows)
section of the
Intel® FPGA SDK for OpenCL™
Standard
Edition Getting Started Guide.
The PATH environment variable setting must include the path to the LINK.EXE file in Microsoft Visual Studio.
- Ensure that your LIB
environment variable setting includes the path to the Microsoft compilation time
libraries.
The compilation time libraries are available with Microsoft Visual Studio.
- Verify that the LD_LIBRARY_PATH environment variable setting includes all the paths described in the Setting the Intel® FPGA SDK for OpenCL™ Standard Edition User Environment Variables (Linux) section in the Intel® FPGA SDK for OpenCL™ Standard Edition Getting Started Guide.
- To create kernel programs that are executable on x86-64 host systems, invoke the aoc -march=emulator <your_kernel_filename>.cl command.
- To compile a kernel for emulation that targets a specific board, invoke the aoc -march=emulator -board=<board_name> <your_kernel_filename>.cl command.
-
For Linux systems, the
Intel® FPGA SDK for OpenCL™ Offline Compiler offers symbolic debug support for the
debugger.
The offline compiler's debug support allows you to pinpoint the origins of functional errors in your kernel source code.
Emulating Your OpenCL Kernel
To emulate your kernel, perform the following steps:
- Run the utility command aocl linkflags to find out which libraries are necessary for building a host application. The software lists the libraries for both emulation and regular kernel compilation flows.
-
Build a host application and link it to the libraries from
Step 1.
Attention: To emulate multiple devices alongside other OpenCL SDKs, link your host application to the Khronos ICD Loader Library before linking it to the host runtime libraries. Link the host application to the ICD Loader Library by modifying the Makefile for the host application. Refer to Linking Your Host Application to the Khronos ICD Loader Library for more information.
- If necessary, move the <your_kernel_filename>.aocx file to a location where the host can find easily, preferably the current working directory.
-
To run the host application for emulation:
- For Windows, first define the number of emulated
devices by invoking the
set
CL_CONTEXT_EMULATOR_DEVICE_INTELFPGA=<number_of_devices>
command and then
run the host application.
After you run the host application, invoke set CL_CONTEXT_EMULATOR_DEVICE_INTELFPGA= to unset the variable.
- For Linux, invoke the env CL_CONTEXT_EMULATOR_DEVICE_INTELFPGA=<number_of_devices> <host_application_filename> command.
This command specifies the number of identical emulation devices that the Emulator needs to provide.Remember: When the environment variable CL_CONTEXT_EMULATOR_DEVICE_INTELFPGA is set, only the emulated devices are available, i.e., access to all physical boards is disabled. - For Windows, first define the number of emulated
devices by invoking the
set
CL_CONTEXT_EMULATOR_DEVICE_INTELFPGA=<number_of_devices>
command and then
run the host application.
- If you change your host or kernel program and you want to test it, only recompile the modified host or kernel program and then rerun emulation.
Debugging Your OpenCL Kernel on Linux
- During program execution, the debugger cannot step from the host
code to the kernel code. You must set a breakpoint before the actual kernel
invocation by adding these lines:
-
break <your_kernel>
This line sets a breakpoint before the kernel.
-
continue
If you have not begun debugging your host, then type start instead.
-
break <your_kernel>
- The kernel is loaded as a shared library immediately before the
host loads the kernels. The debugger does not recognize the kernel names until
the host actually loads the kernel functions. As a result, the debugger will
generate the following warning for the breakpoint you set before the execution
of the first kernel:
Function "<your_kernel>" not defined.
Make breakpoint pending on future shared library load? (y or [n])
Answer y. After initial program execution, the debugger will recognize the function and variable names, and line number references for the duration of the session.
To compile your OpenCL kernel for debugging, perform the following steps:
-
To generate a .aocx file
for debugging that targets a specific accelerator board, invoke the
aoc
-march=emulator
<your_kernel_filename>.cl -board=<board_name>
command.
Attention: Specify the name of your FPGA board when you run your host application. To verify the name of the target board for which you compile your kernel, invoke the aoc -march=emulator -v <your_kernel_filename>.cl command. The Intel® FPGA SDK for OpenCL™ Offline Compiler will display the name of the target FPGA board.
- Run the utility command aocl linkflags to find out the additional libraries necessary to build a host application that supports kernel debugging.
- Build a host application and link it to the libraries from Step 2.
- Ensure that the <your_kernel_filename>.aocx file is in a location where the host can find it, preferably the current working directory.
- To run the application, invoke the command env CL_CONTEXT_EMULATOR_DEVICE_INTELFPGA=<number_of_devices> gdb --args <your_host_program_name> , where <number_of_devices> is the number of identical emulation devices that the Emulator needs to provide.
- If you change your host or kernel program and you want to test it, only recompile the modified host or kernel program and then rerun the debugger.
Limitations of the Intel FPGA SDK for OpenCL Standard Edition Emulator
- Execution model
The Emulator supports the same compilation modes as the FPGA variant. As a result, you must call the clCreateProgramBinary function to create cl_program objects for emulation.
- Concurrent execution
Modeling of concurrent kernel executions has limitations. During execution, the Emulator does not actually run interacting work-items in parallel. Therefore, some concurrent execution behaviors, such as different kernels accessing global memory without a barrier for synchronization, might generate inconsistent emulation results between executions.
- Kernel performance
The .aocx file that you generate for emulation does not include any optimizations. Therefore, it might execute at a significantly slower speed than what an optimized kernel might achieve. In addition, because the Emulator does not implement actual parallel execution, the execution time multiplies with the number of work-items that the kernel executes.
- The Emulator executes the host runtime and the kernels in the same address space. Certain pointer or array usages in your host application might cause the kernel program to fail, and vice versa. Example usages include indexing external allocated memory and writing to random pointers. You may use memory leak detection tools such as Valgrind to analyze your program. However, the host might encounter a fatal error caused by out-of-bounds write operations in your kernel, and vice versa.
- Emulation of channel behavior has limitations, especially for conditional channel operations where the kernel does not call the channel operation in every loop iteration. In these cases, the Emulator might execute channel operations in a different order than on the hardware.
Discrepancies in Hardware and Emulator Results
The most common reasons for differences in emulator and hardware results are as follows:
- Your OpenCL kernel code is using the#pragma ivdep directive. The Emulator will not model your OpenCL system when a true dependence is broken by a pragma ivdep directive. During a full hardware compilation, you will observe this as an incorrect result.
- Your OpenCL kernel code is relying on uninitialized data. Examples of uninitialized data include uninitialized variables and uninitialized or partially initialized global buffers, local arrays, and private arrays.
- Your OpenCL kernel code behavior depends on the precise results of floating
point operations. The Emulator uses floating point computation hardware of the
CPU whereas the hardware run uses floating point cores implemented as FPGA
cores. The use of -fp-relaxed aoc option in
your OpenCL kernel code might change the order of operations leading to further
divergence in the floating point results. Note: The OpenCL standard allows one or more least significant bits of floating point computations to differ between platforms, while still being considered correct on both such platforms.
- Your OpenCL kernel code behavior depends on the order of channel accesses in different kernels. The emulation of channel behavior has limitations, especially for conditional channel operations where the kernel does not call the channel operation in every loop iteration. In such cases, the Emulator might execute channel operations in an order different from that on the hardware.
- Your OpenCL kernel or host code is accessing global memory buffers out-of-bounds.Attention:
- Uninitialized memory read and write behaviors are platform-dependent. Verify sizes of your global memory buffers when using all addresses within kernels, allocating clCreateBuffer function call, and transferring clEnqueueReadBuffer and clEnqueueWriteBuffer function calls.
- You may use software memory leak detection tools, such as Valgrind, on emulated version of your OpenCL system to analyze memory related problems. Absence of warnings from such tools does not mean the absence of problems. It only means that the tool could not detect any problem. In such a scenario, Intel recommends manual verification of your OpenCL kernel or host code.
- Your OpenCL kernel code is accessing local or private variables out-of-bounds. For example,
accessing a local or private array out-of-bounds or accessing a private variable
after it has gone out of scope. Attention: In software terms, these issues are referred to as stack corruption issues because accessing variables out-of-bounds usually affects unrelated variables located close to the variable being accessed on a software stack. Emulated OpenCL kernels are implemented as regular CPU functions, and have an actual stack that can be corrupted. When targeting hardware, no stack exists and hence, the stack corruption issues are guaranteed to manifest differently. You may use memory leak analyzer tools, such as Valgrind, when a stack corruption is suspected. However, stack related issues are usually difficult to identify. Intel recommends manual verification of your OpenCL kernel code to debug a stack related issue.
- Your OpenCL kernel code is using shifts that are larger than the type being
shifted. For example, shifting a 64-bit integer by 65 bits. According to the
OpenCL specification version 1.0, the behavior of such shifts is undefined.Warning: If the shift amount is known during compilation, the offline compiler issues a warning message. You must heed to the warning message.
- When you compile your OpenCL kernel for emulation, the default channel depth is different from the default channel depth generated when your kernel is compiled for hardware. This difference in channel depths might lead to scenarios where execution on the hardware hangs while kernel emulation works without any issue. Refer to Emulating Channel Depth for information on how to fix the channel depth difference.
- In terms of ordering the printed lines, the output of the printf function might be ordered differently on the Emulator and hardware. This is because, in the hardware, printf data is stored in a global memory buffer and flushed from the buffer only when the kernel execution is complete, or when the buffer is full. In the Emulator, the printf function uses the x86 stdout.
Reviewing Your Kernel's report.html File
The analyze-area Intel® FPGA SDK for OpenCL™ utility option has been deprecated. To view your kernel's estimated area usage, refer to the report.html file.
For reference information on the deprecated area report, refer to the Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage section in version 16.0 of the Altera SDK for OpenCL Best Practices Guide.
Profiling Your OpenCL Kernel
Instrumenting the Kernel Pipeline with Performance Counters (-profile)
-
To instrument the Verilog code in the
<your_kernel_filename>.aocx
file with performance counters, invoke the
aoc
-profile=(all|autorun|enqueued)
<your_kernel_filename>.cl command,
where:
- all argument instruments all kernels in the <your_kernel_filename>.cl file with performance counters. This is the default option if no argument is provided.
- autorun argument instruments only the autorun kernels with performance counters.
- enqueued argument instruments only the non-autorun kernels with performance counters.
Attention:- When profiling multiple, different kernels, do not use the same kernel names across different .aocx files. If the kernel names are the same, the profile data will be wrong for these kernels.
- Regardless of the input to the clGetProfileDataDeviceIntelFPGA host library call , the Intel® FPGA Dynamic Profiler for OpenCL™ only profiles kernel types that you indicate during compilation.
CAUTION:Profiling autorun kernels results in some hardware overhead for the counters. For large designs, the overhead can cause fmax and design frequency degradation. It can also lead to designs that cannot fit on the chip if the Intel® FPGA Dynamic Profiler for OpenCL™ profiles every kernel. -
Run your host application from a local disk to execute the
<your_kernel_filename>.aocx file on your
FPGA. During kernel execution, the performance counters throughout the kernel
pipeline collects profile information. The host saves the information in a
profile.mon monitor description file in
your current working directory.
CAUTION:Because of slow network disk accesses, running the host application from a networked directory might introduce delays between kernel executions. These delays might increase the overall execution time of the host application. In addition, they might introduce delays between kernel launches while the runtime stores profile output data to disk.
Launching the Intel FPGA Dynamic Profiler for OpenCL GUI (report)
The Intel® FPGA Dynamic Profiler for OpenCL™ stores performance data in a profile.mon file in your current working directory.
Profiling Autorun Kernels
Kernels that are marked with the autorun attribute are referred to as autorun kernels. Hence, an autorun kernel starts executing automatically before other kernels are launched explicitly by the host, and restarts automatically on completion. For more information about the autorun attribute, refer to Omit Communication Hardware between the Host and the Kernel topic.
Since autorun kernels never complete, you must call the host library call clGetProfileDataDeviceIntelFPGA to capture the autorun profiler data. You can instruct the host application to make this call at any point during execution.
Developing OpenCL Applications Using Intel Code Builder for OpenCL
The Intel® Code Builder for OpenCL™ provides a set of Microsoft Visual Studio and Eclipse plug-ins that enable capabilities for creating, building, debugging, and analyzing Windows and Linux applications accelerated with OpenCL.
Configuring the Intel Code Builder for OpenCL Offline Compiler Plug-in for Microsoft Visual Studio
To enable the Intel® Code Builder for OpenCL™ offline compiler plug-in for Microsoft Visual Studio, perform the following steps:
- In the Visual Studio software, select Project > Properties.
- In the Project > Properties > Code Builder page, change the Device to your desired FPGA device.
- In the C/C++ > General property page, under Additional Include Directories, enter the full path to the directory where the OpenCL code header files are located ($(INTELFPGAOCLSDKROOT)\include).
- In the Linker > General property page, under Additional Library Directories, enter the full path to the directory where the OpenCL code run-time import library file is located. For example, for 64-bit application, add $(INTELFPGAOCLSDKROOT)\lib\x64:
- In the Linker > Input property page, under Additional Dependencies, enter the name of the OpenCL ICD import library file as OpenCL.lib.
Configuring the Intel Code Builder for OpenCL Offline Compiler Plug-in for Eclipse
To enable the Intel® Code Builder for OpenCL™ offline compiler plug-in for Eclipse IDE, perform the following steps:
-
Copy the CodeBuilder_<version>.jar plug-in file
from $INTELFPGAOCLSDKROOT/eclipse-plug-in to
<ECLIPSE_ROOT_FOLDER>/dropins.
Attention: In Linux, you must add $INTELFPGAOCLSDKROOT\bin to the LD_LIBRARY_PATH environment variable.
- Run the Eclipse IDE.
- Select Windows > Preferences.
- Switch to the Intel® OpenCL dialog.
-
Set the OpenCL binary directory to $INTELFPGAOCLSDKROOT/bin.
Once the offline compiler is configured, you can use the Code-Builder menu to perform the following basic operations:
- Create a new session
- Open an existing session
- Save a session
- Build a session
- Compile a session
- Configure a session
For more information about the Intel® Code Builder for OpenCL™ , refer to Developer Guide for Intel SDK for OpenCL Applications. For information about how to configure the Intel® Code Builder for OpenCL™ for Microsoft Visual Studio, refer to Intel Code Builder for OpenCL API for Microsoft Visual Studio. For information about how to configure the Intel® Code Builder for OpenCL™ for Eclipse, refer to Intel Code Builder for OpenCL API for Eclipse.
Creating a Session in the Intel Code Builder for OpenCL
Perform the following steps to create a session in the Intel® Code Builder for OpenCL™ :
- Select Code-Builder > OpenCL Kernel Development > New Session.
- Specify the session name, path to the folder to store the session file and the content of the session (can be either an empty session or with a predefined OpenCL code).
- Click Done.
Once the session is created, the new session appears in the Code Builder Sessions Explorer view.

Configuring a Session
You can configure a session by right-clicking the session in the Code Builder Session Explorer and selecting Session Options. Alternatively, you can also open the Session Settings dialog box by selecting Code-Builder > OpenCL Kernel Development > Session Options.
The Session Settings dialog box allows you to configure:
- Device options such as target machine, OpenCL platform, and OpenCL device.
- Build options such as offline compiler flags and build architecture.
- Build artifacts such as .aocx and .aoco files, and static reports.
- General options such as job architecture and network settings.
In the Device Options tab, ensure to select Intel® FPGA SDK for OpenCL™ in the OpenCL platform drop-down list.
Under the Build Options tab, in the OpenCL Build Options section, enter the Intel® FPGA SDK for OpenCL™ Offline Compiler flags manually.
For more information about configuring a session and variable management, refer to the Developer Guide for Intel SDK for OpenCL Applications.
Intel FPGA SDK for OpenCL Standard Edition Advanced Features
OpenCL Library
You may use a previously-created library or create your own library. To use an OpenCL library, you do not require in-depth knowledge in hardware design or in the implementation of library components. To create an OpenCL library, you need to create the following files and components:
File or Component | Description |
---|---|
RTL Components | |
RTL source files | Verilog, System Verilog, or VHDL files that define the RTL
component. Additional files such as Intel® Quartus® Prime IP File (.qip), Synopsys Design Constraints File (.sdc), and Tcl Script File (.tcl) are not allowed. |
eXtensible Markup Language File (.xml) | Describes the properties of the RTL component. The Intel® FPGA SDK for OpenCL™ Offline Compiler uses these properties to integrate the RTL component into the OpenCL pipeline. |
Header file (.h) | A C-style header file that declares the signatures of function(s) that are implement by the RTL component. |
OpenCL emulation model file (.cl) | Provides C model for the RTL component that is used only for emulation. Full hardware compilations use the RTL source files. |
OpenCL Functions | |
OpenCL source files (.cl) | Contains definitions of the OpenCL functions. These functions are used during emulation and full hardware compilations. |
Header file (.h) | A C-style header file that declares the signatures of function(s) that are defined in the OpenCL source files. |
Understanding RTL Modules and the OpenCL Pipeline
Use RTL modules under the following circumstances:
- You want to use optimized and verified RTL modules in OpenCL kernels without rewriting the modules as OpenCL functions.
- You want to implement OpenCL kernel functionality that you cannot express effectively in OpenCL.
Overview: Intel FPGA SDK for OpenCL Pipeline Approach
Assume each level of operation is one stage in the pipeline. At each stage, the Intel® FPGA SDK for OpenCL™ Offline Compiler executes all operations in parallel by the thread existing at that stage. For example, thread 2 executes Load A, Load B, and copies the current global ID (via gid) to the next pipeline stage. Similar to the pipelined execution on instructions in reduced instruction set computing (RISC) processors, the SDK's pipeline stages also execute in parallel. The threads will advance to the next pipeline stage only after all the stages have completed execution.
Some operations are capable of stalling the Intel FPGA SDK for OpenCL pipeline. Examples of such operations include variable latency operations like memory load and store operations. To support stalls, ready and valid signals need to propagate throughout the pipeline so that the offline compiler can schedule the pipeline stages. However, ready signals are not necessary if all operations have fixed latency. In these cases, the offline compiler optimizes the pipeline to statically schedule the operations, which significantly reduces the logic necessary for pipeline implementation.
Integration of an RTL Module into the Intel FPGA SDK for OpenCL Pipeline
The depicted RTL module has a balanced latency where the threads of the RTL module match the number of pipeline stages. A balanced latency allows the threads of the RTL module to execute without stalling the SDK's pipeline.
Setting the latency of the RTL module in the RTL specification file allows the offline compiler to balance the pipeline latency. RTL supports Avalon™ Streaming ( Avalon™ -ST) interfaces; therefore, the latency of the RTL module can be variable (that is, not fixed). However, the variability in the latency should be small in order to maximize performance. In addition, specify the latency in the <RTL module description file name>.xml specification file so that the RTL module experiences a good approximation of the actual latency in steady state.
Stall-Free RTL
- To instruct the offline compiler to remove stall logic around the RTL module, if
appropriate, set the IS_STALL_FREE attribute under the
FUNCTION element to "yes".This modification informs the offline compiler that the RTL module produces valid data every EXPECTED_LATENCY cycle(s).Note: EXPECTED_LATENCY is an attribute you specify in the .xml file under the FUNCTION element.
- Specify a value for EXPECTED_LATENCY such that the
latency equals the number of pipeline stages in the module. CAUTION:An inaccurate EXPECTED_LATENCY value will cause the RTL module to be out of sync with the rest of the pipeline.
A stall-free RTL module might receive an invalid input signal (that is, ivalid is low). In this case, the module ignores the input and produces invalid data on the output. For a stall-free RTL module without an internal state, it might be easier to propagate the invalid input through the module. However, for an RTL module with an internal state, you must handle an ivalid=0 input carefully.
RTL Module Interfaces
For an RTL module to properly interact with other compiler-generated operations, you must support a simplified Avalon Streaming Interface at both input and output of an RTL module.
The following diagram shows the complete interface of the myMod RTL module shown in Figure 17.
In this diagram, myMod interacts with the upstream module through data signals, A and B, and control signals, ivalid (input) and oready (output). The ivalid control signal equals 1 (ivalid = 1) if and only if data signal A and data signal B contain valid data. When the control signal oready equals 1 (oready = 1), it indicates that the myMod RTL module can process the data signals A and B if they are valid (that is, ivalid = 1). When ivalid = 1 and oready = 0, the upstream module is expected to hold the values of ivalid, A, and B in the next clock cycle.
myMod interacts with the downstream module through data signal C and control signals, ovalid (output) and iready (input). The ovalid control signal equals 1 (ovalid = 1) if and only if data signal C contains valid data. The iready control signal equals 1 (ivalid = 1) indicates that the downstream module is able to process data signal C if it is valid. When ovalid = 1 and iready = 0, the myMod RLT module is expected to hold the valid of the ovalid and C signals in the next clock cycle.
myMod module will assert oready for a single clock cycle to indicate it is ready for an active cycle. Cycles during which myMod module is ready for data are called ready cycles. During ready cycles, the module above myMod module can assert ivalid to send data to myMod.
For a detailed explanation of data transfer under backpressure, refer to the Data Transfer with Backpressure section in the Avalon Interface Specification. Ignore the information pertaining to readyLatency option. For a detailed explanation of data transfer under backpressure, refer to the Data Transfer with Backpressure section in the Avalon Interface Specifications. Ignore the information pertaining to readyLatency option.
Avalon Streaming (Avalon-ST) Interface
The offline compiler expects the RTL module to support Avalon-ST interface with readyLatency = 0, at both input and output.
- ivalid and iready, as the input Avalon-ST interface
- ovalid and oready, as the output Avalon-ST interface

For an RTL module with a fixed latency, the output signals (ovalid and oready) can have constant high values, and the input ready signal (iready) can be ignored.
A stall-free RTL module might receive an invalid input signal (ivalid is low). In this case, the module ignores the input and produces invalid data on the output. For a stall-free RTL module without an internal state, it might be easier to propagate the invalid input through the module. However, for an RTL module with an internal state, you must handle an ivalid = 0 input carefully.
RTL Reset and Clock Signals
Because of the common clock and reset drivers, an RTL module runs in the same clock domain as the OpenCL kernel. The module is reset only when the OpenCL kernel is first loaded onto the FPGA, either via Intel® FPGA SDK for OpenCL™ program utility or the clCreateProgramwithBinary host function. In particular, if the host restarts a kernel via successive clEnqueueNDRangeKernel or clEnqueueTask invocations, the associated RTL modules will not reset in between these restarts.
The following steps outline the process of setting the kernel clock frequency:
- The Intel® Quartus® Prime software's Fitter applies an aggressive constraint on the kernel clock.
- The Intel® Quartus® Prime software's Timing Analyzer performs static timing analysis to determine the frequency that the Fitter actually achieves.
- The phase-locked loop (PLL) that drives the kernel clock sets the frequency determined in Step 2 to be the kernel clock frequency.
XML Syntax of an RTL Module
The following XML specification file is for an RTL module named my_fp_sqrt_double (line 2) that implements an OpenCL™ helper function named my_sqrtfd (line 2).
1: <RTL_SPEC> 2: <FUNCTION name="my_sqrtfd" module="my_fp_sqrt_double"> 3: <ATTRIBUTES> 4: <IS_STALL_FREE value="yes"/> 5: <IS_FIXED_LATENCY value="yes"/> 6: <EXPECTED_LATENCY value="31"/> 7: <CAPACITY value="1"/> 8: <HAS_SIDE_EFFECTS value="no"/> 9: <ALLOW_MERGING value="yes"/> 10: </ATTRIBUTES> 11: <INTERFACE> 12: <AVALON port="clock" type="clock"/> 13: <AVALON port="resetn" type="resetn"/> 14: <AVALON port="ivalid" type="ivalid"/> 15: <AVALON port="iready" type="iready"/> 16: <AVALON port="ovalid" type="ovalid"/> 17: <AVALON port="oready" type="oready"/> 18: <INPUT port="datain" width="64"/> 19: <OUTPUT port="dataout" width="64"/> 20: </INTERFACE> 21: <C_MODEL> 22: <FILE name="c_model.cl" /> 23: </C_MODEL> 24: <REQUIREMENTS> 25: <FILE name="my_fp_sqrt_double_s5.v" /> 26: <FILE name="fp_sqrt_double_s5.vhd" /> 27: </REQUIREMENTS> 28: <RESOURCES> 29: <ALUTS value="2057"/> 30: <FFS value="3098"/> 31: <RAMS value="15"/> 32: <MLABS value="43"/> 33: <DSPS value="1.5"/> 34: </RESOURCES> 35: </FUNCTION> 36: </RTL_SPEC>
XML Element | Description |
---|---|
RTL_SPEC | Top-level element in the XML specification file. There can only be one such top-level element in the file. In this example, the name RTL_SPEC is historic and carries no file-specific meaning. |
FUNCTION |
Element that defines the OpenCL function that the RTL module implements. The name attribute within the FUNCTION element specifies the function's name. You may have multiple FUNCTION elements, each declaring a different function that you can call from the OpenCL kernel. The same RTL module can implement multiple functions by specifying different parameters. |
ATTRIBUTES | Element containing other XML elements that describe various
characteristics (for example, latency) of the RTL module. The
example RTL module takes one PARAMETER setting named WIDTH, which has a value of 32. Refer to Table 7 for more details other ATTRIBUTES-specific elements. Note: If you create multiple OpenCL helper functions
for different modules, or use the same RTL module with different
PARAMETER settings, you
must create a separate FUNCTION
element for each function.
|
INTERFACE | Element containing other XML elements that describe the RTL module's interface. The example XML specification file shows the Avalon®-ST interface signals that every RTL module must provide (that is, clock, resetn, ivalid, iready, ovalid, and oready). The signal names must match the ones specified in the .xml file. An error will occur during library creation if a signal name is inconsistent. |
C_MODEL | Element specifying one or more files that implement OpenCL C model for the function. The model is used only during emulation. However, the C_MODEL element and the associated file(s) must be present when you create the library file. |
REQUIREMENTS | Element specifying one or more RTL resource files
(that is, .v, .sv, .vhd, .hex, and
.mif). The specified paths
to these files are relative to the location of the XML specification
file. Each RTL resource file becomes part of the associated Platform Designer component that corresponds
to the entire OpenCL system. Note: The
Intel® FPGA SDK for OpenCL™
library feature does not
support .qip files. An
Intel® FPGA SDK for OpenCL™ Offline Compiler
error will occur if you compile an OpenCL kernel while using a
library that includes an unsupported resource file
type.
|
RESOURCES | Optional element specifying the FPGA resources that the RTL module uses. If you do not specify this element, the FPGA resources that the RTL module uses will default to zero. |
XML Elements for ATTRIBUTES
XML Element | Description |
---|---|
IS_STALL_FREE |
Instructs the Intel® FPGA SDK for OpenCL™ Offline Compiler to remove all stall logic around the RTL module. Set IS_STALL_FREE to "yes" to indicate that the module neither generates stalls internally nor can it properly handle incoming stalls. The module simply ignores its stall input. If you set IS_STALL_FREE to "no", the module must properly handle all stall and valid signals. Note: If you set IS_STALL_FREE to "yes", you must also set IS_FIXED_LATENCY to "yes". Also, if the RTL module has an internal
state, it must properly handle ivalid=0 inputs.
An incorrect IS_STALL_FREE setting will lead to incorrect results in hardware. |
IS_FIXED_LATENCY |
Indicates whether the RTL module has a fixed latency. Set IS_FIXED_LATENCY to "yes" if the RTL module always takes known a number of clock cycles to compute its output. The value you assign to the EXPECTED_LATENCY element specifies the number of clock cycles. The safe value for IS_FIXED_LATENCY is "no". Note: For a given module, you may set IS_FIXED_LATENCY to "yes" and IS_STALL_FREE to "no". Such a module produces its output in a fixed
number of clock cycles and handles stall signals
properly.
|
EXPECTED_LATENCY |
Specifies the expected latency of the RTL module. If you set IS_FIXED_LATENCY to "yes", the EXPECTED_LATENCY value indicates the number of pipeline stages inside the module. In this case, you must set this value to be the exact latency of the module. Otherwise, the offline compiler will generate incorrect hardware. For a module with variable latency, the offline compiler balances the pipeline around this module to the EXPECTED_LATENCY value that you specify. The specified value and the actual latency might differ, which might affect the number of stalls inside the pipeline. However, the resulting hardware will be correct. |
CAPACITY |
Specifies the number of multiple inputs that this module can process simultaneously. You must specify a value for CAPACITY if you also set IS_STALL_FREE="no" and IS_FIXED_LATENCY="no". Otherwise, you do not need to specify a value for CAPACITY. If CAPACITY is strictly less than EXPECTED_LATENCY, the offline compiler will automatically insert capacity-balancing FIFO buffers after this module when necessary. The safe value for CAPACITY is 1. |
HAS_SIDE_EFFECTS | Indicates whether the RTL module has side
effects. Modules that have internal states or communicate with
external memories are examples of modules with side effects. Set HAS_SIDE_EFFECTS to "yes" to indicate that the module has side effects. Specifying HAS_SIDE_EFFECTS to "yes" ensures that optimization efforts do not remove calls to modules with side effects. Stall-free modules with side effects (that is, IS_STALL_FREE="yes" and HAS_SIDE_EFFECTS="yes") must properly handle ivalid=0 input cases because the module might receive invalid data occasionally. The safe value for HAS_SIDE_EFFECTS is "yes". |
ALLOW_MERGING | Instructs the offline compiler to merge multiple instances of the RTL
module. Set ALLOW_MERGING to "yes" to allow merging of multiple instances of the module. Intel® recommends setting ALLOW_MERGING to "yes". The safe value for ALLOW_MERGING is "no". Note: Marking the module with HAS_SIDE_EFFECTS="yes" does not prevent
merging.
|
XML Elements for INTERFACE
XML Element | Description |
---|---|
INPUT |
Specifies the input parameter of the RTL module. INPUT attributes:
The input parameters are concatenated to form the input stream. Aggregate data structures such as structs and arrays are not supported as input parameters. |
OUTPUT |
Specifies the output parameter of the RTL module. OUTPUT attributes:
The return value from the input stream is sent out via the output parameter on the output stream. Aggregate data structures such as structs and arrays are not supported as input parameters. |
If your RTL module communicates with external memory, you need to include additional XML elements:
<MEM_INPUT port="m_input_A" access="readonly"/> <MEM_INPUT port="m_input_sum" access ="readwrite"/> <AVALON_MEM port="avm_port0" width="512" burstwidth="5" optype="read" buffer_location=""/>
XML Element | Description |
---|---|
MEM_INPUT |
Describes a pointer input to the RTL module. MEM_INPUT attributes:
Because all pointers to external memory must be 64 bits, there is no width attribute associated with MEM_INPUT. |
AVALON_MEM |
Declares the Avalon-MM interface for your RTL module. AVALON_MEM attributes:
|
For the AVALON_MEM element defined in the code example above, the corresponding RTL module ports are as follows:
output avm_port0_enable, input [511:0] avm_port0_readdata, input avm_port0_readdatavalid, input avm_port0_waitrequest, output [31:0] avm_port0_address, output avm_port0_read, output avm_port0_write, input avm_port0_writeack, output [511:0] avm_port0_writedata, output [63:0] avm_port0_byteenable, output [4:0] avm_port0_burstcount,
There is no assumed correspondence between pointers that you specify with MEM_INPUT and the Avalon-MM interfaces that you specify with AVALON_MEM. An RTL module can use a single pointer to address zero to multiple Avalon-MM interfaces.
XML Elements for RESOURCES
XML Element | Description |
---|---|
ALUTS | Specifies the number of combinational adaptive look-up tables (ALUTs) that the module uses. |
FFS | Specifies the number of dedicated logic registers that the module uses. |
RAMS | Specifies the number of block RAMs that the module uses. |
DSPS | Specifies the number of digital signal processing (DSP) blocks that the module uses. |
MLABS | Specifies the number of memory logic arrays (MLABs) that the module uses. This value is equal to the number of adaptive logic modules (ALMs) that is used for memory divided by 10 because each MLAB consumes 10 ALMs. |
Interaction between RTL Module and External Memory
Allow your RTL module to interact with external memory only if the interaction is necessary and unavoidable.
The following examples demonstrate how to structure code in an RTL module for easy integration into an OpenCL library:
Complex RTL Module | Simplified RTL Module |
---|---|
// my_rtl_fn does: // out_ptr[idx] = fn(in_ptr[idx]) my_rtl_fn (in_ptr, out_ptr,idx); |
int in_value = in_ptr[idx]; // my_rtl_fn now does: out = fn(in) int out_value = my_rtl_fn (in_value); out_ptr[idx] = out_value; |
The complex RTL module on the left reads a value from external memory, performs a scalar function on the value, and then writes the value back to global memory. Such an RTL module is difficult to describe when you integrate it into an OpenCL library. In addition, this RTL module is harder to verify and causes very conservative pointer analysis in the Intel® FPGA SDK for OpenCL™ Offline Compiler.
The simplified RTL module on the right provides the same overall functionality as the complex RTL module. However, the simplified RTL module only performs a scalar-to-scalar calculation without connecting to global memory. Integrating this simplified RTL module into the OpenCL library makes it much easier for the offline compiler to analyze the resulting OpenCL kernel.
There are times when an RTL module requires an Avalon®-MM port to communicate with external memory. This Avalon-MM port connects to the same arbitration network to which all other global load and store units in the OpenCL kernels connect.
If an RTL module receives a memory pointer as an argument, the offline compiler enforces the following memory model:
- If an RTL module writes to a pointer, nothing else in the OpenCL kernel can read from or write to this pointer.
- If an RTL module reads from a pointer, the rest of the OpenCL kernel and other RTL modules may also read from this pointer.
- You may set the access field of the MEM_INPUT attribute to specify how the RTL module uses the memory pointer. Ensure that you set the value for access correctly because there is no way to verify the value.
Order of Threads Entering an RTL Module
OpenCL C Model of an RTL Module
Example OpenCL C model file for a square root function:
double my_sqrtfd (double a) { return sqrt(a); }
Intel® recommends that you emulate your OpenCL system. If you decide not to emulate your OpenCL system, create an empty function with a name that matches the function name you specified in the XML specification file.
Potential Incompatibility between RTL Modules and Partial Reconfiguration
Consider a situation where you create and verify your library on a device that does not support Partial Reconfiguration (PR). If a library user then uses the library's RTL module inside a PR region, the module might not function correctly after PR.
- The RTL modules do not use memory logic array blocks (MLABs) with initialized content.
- The RTL modules do not make any assumptions regarding the power-up values of any logic.
For complete PR coding guidelines, refer to Creating a Partial Reconfiguration Design in volume 1 of Intel® Quartus® Prime Pro Edition Handbook.
Packaging an OpenCL Helper Function File for an OpenCL Library
In general, you do not need to create a library to share helper functions written in OpenCL. You can distribute a helper function in source form (for example, <shared_file>.cl) and then insert the line #include "<shared_file>.cl" in the OpenCL kernel source code.
Consider creating a library under the following circumstances:
- The helper functions are in multiple files and you want to simplify distribution.
- You do not want to expose the helper functions' source code.
The helper functions are stored as LLVM IR, an assembly-like language, without comments inside the associated library.
Hardware generation is not necessary for the creation of a .aoco file. Compile the OpenCL source file using the -c offline compiler command option.
Packaging an RTL Component for an OpenCL Library
Hardware generation is not necessary for the creation of a .aoco file. Compile the OpenCL source file using the -c Intel® FPGA SDK for OpenCL™ Offline Compiler command option.
Restrictions and Limitations in RTL Support for the Intel FPGA SDK for OpenCL Standard Edition Library Feature
When creating your RTL module, ensure that it operates within the following restrictions:
- An RTL module must use a single input Avalon®-ST interface. That is, a single pair of ready and
valid logic must control all the inputs.
You have the option to provide the necessary Avalon-ST ports but declare the RTL module as stall-free. In this case, you do not have to implement proper stall behavior because the Intel® FPGA SDK for OpenCL™ Offline Compiler creates a wrapper for your module. Refer to XML Syntax of an RTL Module and Using an OpenCL Library that Works with Simple Functions (Example 1) for more syntax and usage information, respectively.
Note: You must handle ivalid signals properly if your RTL module has an internal state. Refer to Stall-Free RTL for more information. - The RTL module must work correctly with exactly one clock, regardless of clock frequency.
- Data input and output sizes must match valid OpenCL data types, from 8
bits for char to 1024 bits for long16.
For example, if you work with 24-bit values inside an RTL module, declare inputs to be 32 bits and declare function signature in the SDK's library header file to accept the uint data type. Then, inside the RTL module, accept the 32-bit input but discard the top 8 bits.
- RTL modules cannot connect to external I/O signals. All input and output signals must come from an OpenCL kernel.
- An RTL module must have a clock port, a resetn port, and Avalon-ST input and output ports (that is, ivalid, ovalid, iready, oready). Name the ports as specified here.
- RTL modules that communicate with external memory must have Avalon Memory-Mapped (Avalon-MM) port parameters that match the corresponding Custom Platform parameters. The offline compiler does not perform any width or burst adaptation.
- RTL modules that communicate with external memory must behave as
follows:
- They cannot burst across the burst boundary.
- They cannot make requests every clock cycle and stall the hardware by monopolizing the arbitration logic. An RTL module must pause its requests regularly to allow other load or store units to execute their operations.
- RTL modules cannot act as stand-alone OpenCL kernels. RTL modules can only be helper functions and be integrated into an OpenCL kernel during kernel compilation.
- Every function call that corresponds to RTL module instantiation is completely independent of other instantiations. There is no hardware sharing.
- Do not incorporate kernel code (that is, functions marked as kernel) into a .aoclib library file. Incorporating kernel code into the library file causes the offline compiler to issue an error message. You may incorporate helper functions into the library file.
- An RTL component must receive all its inputs at the same time. A single ivalid input signifies that all inputs contain valid data.
- The SDK does not support I/O RTL modules.
- You can only set RTL module parameters in the <RTL module description file name>.xml specification file, not the OpenCL kernel source file. To use the same RTL module with multiple parameters, create a separate FUNCTION tag for each parameter combination.
Currently, the SDK's RTL module support for the library feature has the following limitations:
- You can only pass data inputs to an RTL module by value via the OpenCL
kernel code. Do not pass data inputs to an RTL module via pass-by reference, structs, or
channels. In the case of channel data, extract the data from the channel first and then pass
the extracted the scalar data to the RTL module.Note: Passing data inputs to an RTL module via pass-by reference or structs will cause a fatal error to occur in the offline compiler.
- The debugger (for example, GDB for Linux) cannot step into a library function during emulation. In addition, optimization and area reports will not include code line numbers beside the library functions.
- Names of RTL module source files cannot conflict with the file names of Intel® FPGA SDK for OpenCL™ Offline Compiler IP. Both the RTL module source files and the offline compiler IP files are stored in the <kernel file name>/system/synthesis/submodules directory. Naming conflicts will cause existing offline compiler IP files in the directory to be overwritten by the RTL module source files.
- The SDK does not support .qip files. You must manually parse nested .qip files to create a flat list of RTL files.
- It is very difficult to debug an RTL module that works correctly on its own but works incorrectly as part of an OpenCL kernel. Double check all parameters under the ATTRIBUTES element in the <RTL module description file name>.xml file.
- All offline compiler area estimation tools assume that RTL module area is 0. The SDK does not currently support the capability of specifying an area model for RTL modules.
- RTL modules cannot access a 2x clock that is in-phase with the kernel clock and at twice the kernel clock frequency.
Verifying the RTL Modules
- Verify each RTL module using standard hardware verification methods.
-
Modify one of
Intel® FPGA SDK for OpenCL™
library
design examples to test your RTL modules inside the overall OpenCL system.
This testing step is critical to prevent library users from encountering hardware problems.
It is crucial that you set the values for the ATTRIBUTES elements in the XML specification file correctly. Because you cannot simulate the entire OpenCL system, you will likely not discover problems caused by interface-level errors until hardware runs.
-
Note: The Intel® FPGA SDK for OpenCL™ library utility performs consistency checks on the XML specification file and source files, with some limitations.Invoke the aocl library [<command option>] command.
- For a list of supported <command options>, invoke the aocl library command.
- The library utility does not detect errors in values assigned to elements within the ATTRIBUTES, MEM_INPUT, and AVALON_MEM elements in the XML specification file.
- The library utility does not detect RTL syntax errors. You must check the <your_kernel_filename>/quartus_sh_compile.log file for RTL syntax errors. However, parsing the errors might be time consuming.
Packaging Multiple Object Files into a Library File
Specifying an OpenCL Library when Compiling an OpenCL Kernel
You may include multiple instances of -l <library file name> and -L <library directory> in the offline compiler command.
For example, if you create a library that includes the functions my_div_fd(), my_sqrtfd(), and myrsqrtfd(), the OpenCL kernel code might resemble the following:
#include “lib_header.h” kernel void test_lib ( global double * restrict in, global double * restrict out, int N) { int i = get_global_id(0); for (int k =0; k < N; k++) { double x = in[i*N + k]; out[i*N + k] = my_divfd (my_rsqrtfd(x), my_sqrtfd(my_rsqrtfd (x))); } }
The corresponding lib_header.h file might resemble the following:
double my_sqrtfd (double x); double my_rsqrtfd(double x); double my_divfd(double a, double b);
Using an OpenCL Library that Works with Simple Functions (Example 1)
The example1.tgz tar ball includes a library, a kernel, and a host system. The example1.cl kernel source file includes two kernels. The kernel test_lib uses library functions; the kernel test_builtin uses built-in functions. The host runs both kernels and then compares their outputs and runtimes. Intel® recommends that you use the same strategy to verify your own library functions.
To compile this design example, perform the following tasks:
- Obtain example1.tgz from the OpenCL Design Examples web page.
- Unpack it into a local directory.
-
Follow the instructions in the README.html
file, which is located in the top-level of the unpacked example.
When you run the compiled host program, it should produce the following output:
Loading example1.aocx ... Create buffers Generate random data for conversion... Enqueuing both library and builtin in kernels 4 times with global size 65536 Kernel computation using library function took 5.35333 seconds Kernel computation using built-in function took 5.39949 seconds Reading results to buffers... Checking results... Library function throughput is within 5% of builtin throughput. PASSED
Using an OpenCL Library that Works with External Memory (Example 2)
The example2.tgz tar ball includes a library, a kernel, and a host system. In this example, the RTL code that communicates with global memory is Custom Platform- or Reference Platform-dependent. Ensure that the compilation targets the board that corresponds to the Stratix® V Network Reference Platform.
Intel® generated the RTL modules copyElement() and sumOfElements() using the Intel® FPGA SDK for OpenCL™ Offline Compiler, which explains the extra inputs in the code.
The example2.cl kernel source file includes two kernels. The kernel test6 is an NDRange kernel that calls the copyElement() RTL function, which copies data from B[] to A[] and then stores global_id+100 in C[]. The kernel test11 is a single work-item kernel that uses an RTL function . The sumOfElements() RTL function determines the sum of the elements of A[] in range [i, N] and then adds the rest to C[i].
To compile this design example, perform the following tasks:
- Obtain the example2.tgz from the OpenCL Design Examples web page.
- Unpack it into a local directory.
-
Follow the instructions in the README.html
file, which is located in the top-level of the unpacked example.
When you run the compiled host program, it should produce the following output:
Loading example2.aocx ... Running test6 Launching the kernel test6 with globalsize=128 localSize=16 Loading example2.aocx ... Running test11 Launching the kernel test11 with globalsize=1 localSize=1 PASSED
OpenCL Library Command-Line Options
Command Option | Description |
---|---|
-shared |
In conjunction with the -c command option, compiles an OpenCL source file into an object file (.aoco) that you can then include into a library. aoc -c -shared <OpenCL source file name>.cl -o <OpenCL object file name>.aoco |
-I=<library_directory> | Adds <library directory> to
the header file search path. aocl -I <library_header_file_directory> -l <library_file_name>.aoclib <kernel_file_name>.cl |
-L=<library directory> | Adds <library directory> to
the OpenCL library search path. Space after "-L" is optional. aoc -l=<library_file_name>.aoclib [-L=<library directory>] <kernel file name>.cl |
-l=<library_file_name>.aoclib | Specifies the OpenCL library file (
<library_file_name>.aoclib). Space after -l is optional. aoc -l=<library_file_name>.aoclib [-L=<library directory>] <kernel file name>.cl |
-library-debug | Generates debug output that relates to libraries. Part of the
additional output appears in stdout, the other part appears in the
<kernel_file_name>/<kernel_file_name>.log file. aoc -l=<library_file_name>.aoclib -library-debug <kernel_file_name>.cl |
Command Option | Description |
---|---|
hdl-comp-pkg <XML_specification_ file>.xml |
Packages a single HDL component into a .aoco file that you then include into a library. Invoking this command option is similar to invoking aoc -c <XML_specification_file>.xml. However, the processing time is faster because the aocl utility will not perform any environment checks. aocl library hdl-comp-pkg <XML_specification_ file>.xml -o <output_file>.aoco |
-c <XML_specification_ file>.xml |
Same function as hdl-comp-pkg <XML_specification_ file>.xml. aocl library -c <XML_specification_ file>.xml |
create |
Creates a library file from the .aoco files that you created by invoking the hdl-comp-pkg utility option or the aoc -shared command, and any other .aoclib libraries. aocl library create [-name <library_name>] [-vendor <library_vendor>] [-version <library_version>] [-o <output_file>.aoclib] [.aoco...] [.aoclib...] where -name, -vendor, and -version are optional information strings you can specify and add to the library. |
list <library_name> |
Lists all the RTL components in the library. Currently, this option is not available for use to list OpenCL functions. aocl library list <library_name> |
help | Prints the list of
Intel® FPGA SDK for OpenCL™
library
utility options and their descriptions on screen. aocl library help |
Kernel Attributes for Configuring Local and Private Memory Systems
Attribute | Description |
---|---|
register | Specifies that the local variable must be implemented in a register. |
memory | Specifies that the local variable must be implemented in a memory system. Including the memory kernel attribute is equivalent to declaring the local variable with the __local qualifier. |
numbanks(N)
N is an integer value. |
Specifies that the memory system implementing the local variable must have N banks, where N is a power-of-2 integer value greater than zero. |
bankwidth(N)
N is an integer value. |
Specifies that the memory system implementing the local variable must have banks that are N bytes wide, where N is a power-of-2 integer value greater than zero. |
singlepump | Specifies that the memory system implementing the local variable must be single pumped. |
doublepump | Specifies that the memory system implementing the local variable must be double pumped. |
numreadports(N)
N is an integer value. |
Specifies that the memory system implementing the local variable must have N read ports, where N is an integer value greater than zero. |
numwriteports(N)
N is an integer value. |
Specifies that the memory system implementing the local variable must have N write ports, where N is an integer value greater than zero. |
merge("label", "direction") | Forces two or more variables to be implemented in the
same memory system. label is an arbitrary string. Assign the same label to all variables that you want to merge. Specify direction as either width or depth to identify whether the memories should be merged width-wise or depth-wise, respectively. |
bank_bits(b 0 , b 1 , ..., b n ) | Forces the memory system to split into 2n
banks, with {b
0
, b
1
,
..., b
n
} forming the
bank-select bits. Important:
b
0
, b
1
,
..., b
n
must be
consecutive, positive integers.
If you specify the numbanks(n) attribute without the bank_bits attribute, the bank-select bits default to the least significant bits (that is, 0, 1, ..., log2(numbanks)-1). |
Example Use Case | Syntax |
---|---|
Implements a variable in a register |
int __attribute__((register)) a[12]; |
Implements a memory system with eight banks, each with a width of 8 bytes |
int __attribute__((memory, numbanks(8), bankwidth(8)) b[16]; |
Implements a double-pumped memory system with one 128-byte wide bank, one write port, and four read ports |
int __attribute__((memory, numbanks(1), bankwidth(128), doublepump, numwriteports(1), numreadports(4)) c[32]; |
Restrictions on the Usage of Variable-Specific Attributes
Unsupported usages of variable-specific attributes that cause compilation errors:
- You use the kernel attributes in declarations other than local or private variable declarations (for example, declarations for function parameters, global variable declarations, or function declarations).
- You use the register attribute in conjunction with any of the other variable-specific attributes.
- You specify the numbanks attribute but not the bankwidth attribute in the same variable declaration, or vice versa.
- You include both the singlepump and doublepump attributes in the same variable declaration.
- You specify the numreadports and numwriteports attributes without also including the singlepump or doublepump attribute in the same variable declaration.
- You specify the numreadports attribute but not the numwriteports attribute in the same variable declaration, or vice versa.
- You specify any of the following
attributes
without also specifying the numbanks and bankwidth
attributes
in the same
variable
declaration:
- numreadports
- numwriteports
- singlepump
- doublepump
Incorrect memory configurations that cause the offline compiler to issue warnings during compilation:
- The memory configuration that is defined by the variable-specific attributes exceeds the available storage size (for example, specifying eight banks of local memory for an integer variable).
Incorrect memory configurations that cause compilation errors:
- The bank width is smaller than the data storage size (for example, bank width is 2 bytes for an array of 4-byte integers).
- You specify memory configurations for the variables. However, because of compiler restrictions or coding style, the offline compiler implements the variables in the same memory instead of splitting the memory.
- You specify the register attribute for a variable. However, because of compiler restrictions or coding style, the offline compiler cannot implement the variable in a register.
Kernel Attributes for Reducing the Overhead on Hardware Usage
Hardware for Kernel Interface
Hardware around the kernel pipeline is necessary for functions such as the following:
- Dispatching IDs for work-items and work-groups
- Communicating with the host regarding kernel arguments and work-group sizes
Figure 18 illustrates the hardware that the offline compiler generates when it compiles the following kernel:
__kernel void my_kernel(global int* arg) { … int sum = 0; for(unsigned i = 0; i < n; i++) { if(sum < m) sum += val; } *arg = sum; … }
Omit Hardware that Generates and Dispatches Kernel IDs
Semantically, the max_global_work_dim(0) kernel attribute specifies that the global work dimension of the kernel is zero. Setting this kernel attribute means that the kernel does not use any global, local, or group IDs. The presence of this attribute in the kernel code serves as a guarantee to the offline compiler that the kernel is a single work-item kernel.
When compiling the following kernel, the offline compiler will generate interface hardware as illustrated in Figure 19.
channel int chan_in; channel int chan_out; __attribute__((max_global_work_dim(0))) __kernel void plusK (int N, int k) { for (int i = 0; i < N; ++i) { int data_in = read_channel_intel(chan_in); write_channel_intel(chan_out, data_in + k); } }
If your current kernel implementation has multiple work-items but does not use global, local, or group IDs, you can use the max_global_work_dim(0) kernel attribute if you modify the kernel code accordingly:
- Wrap the kernel body in a for loop that iterates as many times as the number of work-items.
- Launch the modified kernel with only one work-item.
Omit Communication Hardware between the Host and the Kernel
The autorun kernel attribute notifies the offline compiler that the kernel runs on its own and will not be enqueued by any host.
To leverage the autorun attribute, a kernel must meet all of the following criteria:
- Does not use I/O channelsNote: Kernel-to-kernel channels are supported.
- Does not have any arguments
- Has either the max_global_work_dim(0)
attribute or the reqd_work_group_size(X,Y,Z) attributeNote: The parameters of the reqd_work_group_size(X,Y,Z) attribute must be divisors of 232.
As mentioned above, kernels with the autorun attribute cannot have any arguments and start executing without the host launching them explicitly. As a result, the offline compiler does not need to generate the logic for communication between the host and the kernel. Omitting this logic reduces logic utilization and allows the offline compiler to apply additional performance optimizations.
A typical use case for the autorun attribute is a kernel that reads data from one or more kernel-to-kernel channels, processes the data, and then writes the results to one or more channels. When compiling the kernel, the offline compiler will generate hardware as illustrated in Figure 20.
channel int chan_in; channel int chan_out; __attribute__((max_global_work_dim(0))) __attribute__((autorun)) __kernel void plusOne () { while(1) { int data_in = read_channel_intel(chan_in); write_channel_intel(chan_out, data_in + 1); } }
Kernel Replication Using the num_compute_units(X,Y,Z) Attribute
As mentioned in Specifying Number of Compute Units, including the num_compute_units(N) kernel attribute in your kernel instructs the Intel® FPGA SDK for OpenCL™ Offline Compiler to generate multiple compute units to process data. The num_compute_unit(N) attribute instructs the offline compiler to generate N identical copies of the kernel in hardware.
Customization of Replicated Kernels Using the get_compute_id() Function
Retrieving compute IDs is a convenient alternative to replicating your kernel in source code and then adding specialized code to each kernel copy. When a kernel uses the num_compute_units(X,Y,Z) attribute and calls the get_compute_id() function, the Intel® FPGA SDK for OpenCL™ Offline Compiler assigns a unique compute ID to each compute unit. The get_compute_id() function then retrieves these unique compute IDs. You can use the compute ID to specify how the associated compute unit should behave differently from the other compute units that are derived from the same kernel source code. For example, you can use the return value of get_compute_id() to index into an array of channels to specify which channel each compute unit should read from or write to.
The num_compute_units attribute accepts up to three arguments (that is, num_compute_units(X,Y,Z)). In conjunction with the get_compute_id() function, this attribute allows you to create one-dimensional, two-dimensional, and three-dimensional logical arrays of compute units. An example use case of a 1D array of compute units is a linear pipeline of kernels (also called a daisy chain of kernels). An example use case of a 2D array of compute units is a systolic array of kernels.
__attribute__((max_global_work_dim(0))) __attribute__((autorun)) __attribute__((num_compute_units(4,4))) __kernel void PE() { row = get_compute_id(0); col = get_compute_id(1); … }
For a 3D array of compute units, you can retrieve the X, Y, and Z coordinates of a compute unit in the logical compute unit array using get_compute_id(0), get_compute_id(1), and get_compute_id(2), respectively. In this case, the API is very similar to the API of the work-item's intrinsic functions (that is, get_global_id(), get_local_id(), and get_group_id()).
Global IDs, local IDs, and group IDs can vary at runtime based on how the host invokes the kernel. However, compute IDs are known at compilation time, allowing the offline compiler to generate optimized hardware for each compute unit.
Using Channels with Kernel Copies
The example code below implements channels within multiple compute units.
#define N 4 channel int chain_channels[N+1]; __attribute__((max_global_work_dim(0))) __kernel void reader(global int *data_in, int size) { for (i = 0; i < size; ++i) { write_channel_intel(chain_channels[0], data_in[i]); } } __attribute__((max_global_work_dim(0))) __attribute__((autorun)) __attribute__((num_compute_units(N))) __kernel void plusOne() { int compute_id = get_compute_id(0); int input = read_channel_intel(chain_channels[compute_id]); write_channel_intel(chain_channels[compute_id+1], input + 1); } __attribute__((max_global_work_dim(0))) __kernel void writer(global int *data_out, int size) { for (i = 0; i < size; ++i) { data_out[i] = read_channel_intel(chain_channels[N]);; } }
Support Statuses of OpenCL Features
Support Statuses of OpenCL 1.0 Features
The following sections outline the support statuses of the OpenCL™ features described in the OpenCL Specification version 1.0.
OpenCL1.0 C Programming Language Implementation
Support Status column legend:
Symbol | Description |
---|---|
● | The feature is supported, and there might be a clarification for the supported feature in the Notes column |
○ | The feature is supported with exceptions identified in the Notes column. |
X | The feature is not supported. |
Section | Feature | Support Status | Notes |
---|---|---|---|
6.1.1 | Built-in Scalar Data Types | ||
double precision float | ○ | Preliminary support for all double
precision float built-in scalar data type. This feature might not
conform with the OpenCL Specification version 1.0. Currently, the following double precision floating-point functions conform with the OpenCL Specification version 1.0: add / subtract / multiply / divide / ceil / floor / rint / trunc / fabs / fmax / fmin / sqrt / rsqrt / exp / exp2 / exp10 / log / log2 / log10 / sin / cos / asin / acos / sinh / cosh / tanh / asinh / acosh / atanh / pow / pown / powr / tanh / atan / atan2 / ldexp / log1p / sincos |
|
half precision float | ○ | Support for scalar addition,
subtraction and multiplication. Support for conversions to and from
single-precision floating point. This feature might not conform with
the OpenCL Specification version 1.0. This feature is supported in the Emulator. |
|
6.1.2 | Built-in Vector Data Types | ○ |
Preliminary support for vectors with three elements. Three-element vector support is a supplement to the OpenCL Specification version 1.0. |
6.1.3 | Built-in Data Types | X | |
6.1.4 | Reserved Data Types | X | |
6.1.5 | Alignment of Types | ● | All scalar and vector types are aligned as required (vectors with three elements are aligned as if they had four elements). |
6.2.1 | Implicit Conversions | ● | Refer to Section 6.2.6: Usual Arithmetic Conversions in the OpenCL Specification version 1.2 for an important clarification of implicit conversions between scalar and vector types. |
6.2.2 | Explicit Casts | ● | The SDK allows scalar data casts to a vector with a different element type. |
6.5 | Address Space Qualifiers | ○ | Function scope__constant variables are not supported. |
6.6 | Image Access Qualifiers | X | |
6.7 | Function Qualifiers | ||
6.7.2 | Optional Attribute Qualifiers | ● | Refer to the
Intel® FPGA SDK for OpenCL™
Best Practices Guide
for tips on using reqd_work_group_size to improve kernel performance. The SDK parses but ignores the vec_type_hint and work_group_size_hint attribute qualifiers. |
6.9 | Preprocessor Directives and Macros | ||
#pragma directive: #pragma unroll | ● | The
Intel® FPGA SDK for OpenCL™ Offline Compiler supports only #pragma unroll. You may assign an
integer argument to the unroll directive to control the extent of
loop unrolling. For example, #pragma unroll 4 unrolls four iterations of a loop. By default, an unroll directive with no unroll factor causes the offline compiler to attempt to unroll the loop fully. Refer to the Intel® FPGA SDK for OpenCL™ Best Practices Guide for tips on using #pragma unroll to improve kernel performance. |
|
__ENDIAN_LITTLE__ defined to be value 1 | ● | The target FPGA is little-endian. | |
__IMAGE_SUPPORT__ | X | __IMAGE_SUPPORT__ is undefined; the SDK does not support images. | |
6.10 | Attribute Qualifiers—The offline compiler parses attribute qualifiers as follows: | ||
6.10.2 | Specifying Attributes of Functions—Structure-type kernel arguments | X | Convert structure arguments to a pointer to a structure in global memory. |
6.10.3 | Specifying Attributes of Variables—endian | X | |
6.10.4 | Specifying Attributes of Blocks and Control-Flow-Statements | X | |
6.10.5 | Extending Attribute Qualifiers | ● | The offline compiler can parse
attributes on various syntactic structures. It reserves some
attribute names for its own internal use. Refer to the Intel® FPGA SDK for OpenCL™ Best Practices Guide for tips on how to optimize kernel performance using these kernel attributes. |
6.11.2 | Math Functions | ||
built-in math functions | ○ | Preliminary support for built-in math functions for double precision float. These functions might not conform with the OpenCL Specification version 1.0. | |
built-in half_ and native_ math functions | ○ | Preliminary support for built-in half_ and native_ math functions for double precision float. These functions might not conform with the OpenCL Specification version 1.0. | |
6.11.5 | Geometric Functions | ○ | Preliminary support for built-in
geometric functions for double precision float. These functions
might not conform with the OpenCL Specification version 1.0. Refer to Argument Types for Built-in Geometric Functions for a list of built-in geometric functions supported by the SDK. |
6.11.8 | Image Read and Write Functions | X | |
6.11.9 | Synchronization Functions—the barrier synchronization function | ○ | Clarifications and exceptions: If a kernel specifies the reqd_work_group_size or max_work_group_size attribute, barrier supports the corresponding number of work-items. If neither attribute is specified, a barrier is instantiated with a default limit of 256 work-items. The work-item limit is the maximum supported work-group size for the kernel; this limit is enforced by the runtime. |
6.11.11 | Async Copies from Global to Local Memory, Local to Global Memory, and Prefetch | ○ | The implementation is naive: Work-item (0,0,0) performs the copy and the wait_group_events is implemented as a barrier. If a kernel specifies the reqd_work_group_size or max_work_group_size attribute, wait_group_events supports the corresponding number of work-items. If neither attribute is specified, wait_group_events is instantiated with a default limit of 256 work-items. |
OpenCL C Programming Language Restrictions
Feature | Support Status | Notes |
---|---|---|
pointer assignments between address spaces | ● | Arguments to __kernel functions declared in a
program that are pointers must be declared with the __global, __constant, or __local qualifier. The offline compiler enforces the OpenCL restriction against pointer assignments between address spaces. |
pointers to functions | X | The offline compiler does not enforce this restriction. |
structure-type kernel arguments | X | Convert structure arguments to a pointer to a structure in global memory. |
images | X | The SDK does not support images. |
bit fields | X | The offline compiler does not enforce this restriction. |
variable length arrays and structures | X | |
variable macros and functions | X | |
C99 headers | X | |
extern, static, auto, and register storage-class specifiers | X | The offline compiler does not enforce this restriction. |
predefined identifiers | ● | Use the -D option of the aoc command to provide preprocessor symbol definitions in your kernel code. |
recursion | X | The offline compiler does not return an error for this restriction, but this feature is not supported. |
irreducible control flow | X | The offline compiler does not return an error for this restriction, but this feature is not supported. |
writes to memory of built-in types less than 32 bits in size | ○ | Store operations less than 32 bits in size might result in lower memory performance. |
declaration of arguments to __kernel functions of type event_t | X | The offline compiler does not enforce this restriction. |
elements of a struct or a union belonging to different address spaces | X | The offline compiler does not
enforce this restriction. Warning: Assigning elements of a struct or a union to different address spaces might cause a
fatal error.
|
Symbol | Description |
---|---|
● | The feature is supported, and there might be a clarification for the supported feature in the Notes column |
○ | The feature is supported with exceptions identified in the Notes column. |
X | The feature is not supported. |
Argument Types for Built-in Geometric Functions
Function | Argument Type | |
---|---|---|
float | double | |
cross | ● | ● |
dot | ● | |
distance | ● | |
length | ● | |
normalize | ● | |
fast_distance | — | |
fast_length | — | |
fast_normalize | — |
Numerical Compliance Implementation
The table below summarizes the implementation statuses of the floating-point operators:
Section | Feature | Support Status | Notes |
---|---|---|---|
7.1 | Rounding Modes | ○ | Conversion between integer and
single and half precision floating-point types support all rounding
modes. Conversions between integer and double precision floating-point types support all rounding modes on a preliminary basis. This feature might not conform with the OpenCL Specification version 1.0. |
7.2 | INF, NaN and Denormalized Numbers | ○ | Infinity (INF) and Not a Number
(NaN) results for single precision operations are generated in a
manner that conforms with the OpenCL Specification version 1.0. Most
operations that handle denormalized numbers are flushed prior to and
after a floating-point operation. Preliminary support for double precision floating-point operation. This feature might not conform with the OpenCL Specification version 1.0. |
7.3 | Floating-Point Exceptions | X | |
7.4 | Relative Error as ULPs | ○ | Single precision floating-point
operations conform with the numerical accuracy requirements for an
embedded profile of the OpenCL Specification version 1.0. Preliminary support for double precision floating-point operation. This feature might not conform with the OpenCL Specification version 1.0. |
7.5 | Edge Case Behavior | ● |
Image Addressing and Filtering Implementation
Atomic Functions
- Section 9.5: Atomic Functions for 32-bit Integers—The SDK supports all 32-bit global and local memory
atomic functions. The SDK also supports 32-bit
atomic functions described in Section 6.11.11 of the OpenCL
Specification version 1.1 and Section 6.12.11 of the OpenCL Specification version 1.2.
- The SDK does not support 64-bit atomic functions described in Section 9.7 of the OpenCL Specification version 1.0.
Embedded Profile Implementation
The table below summarizes the clarifications and exceptions to the OpenCL embedded profile:
Clause | Feature | Support Status | Notes |
---|---|---|---|
1 | 64-bit integers | ● | |
2 | 3D images | X | The SDK does not support images. |
3 | Create 2D and 3D images with image_channel_data_type values | X | The SDK does not support images. |
4 | Samplers | X | |
5 | Rounding modes | ● | The default rounding mode for CL_DEVICE_SINGLE_FP_CONFIG is CL_FP_ROUND_TO_NEAREST. |
6 | Restrictions listed for single precision basic floating-point operations | X | |
7 | half type | X | This clause of the OpenCL Specification version 1.0 does not apply to the SDK. |
8 | Error bounds listed for conversions from CL_UNORM_INT8, CL_SNORM_INT8, CL_UNORM_INT16 and CL_SNORM_INT16 to float | ● | Refer to the table below for a list of allocation limits. |
Support Statuses of OpenCL 1.2 Features
The following sections outline the support statuses of the OpenCL™ features described in the OpenCL Specification version 1.2.
OpenCL 1.2 Runtime Implementation
- clSetMemObjectDestructorCallback
- clGetKernelArgInfo
- clSetEventCallback
For more information on these APIs, refer to sections 5.4.1, 5.7.3, and 5.9 of the OpenCL Specification 1.2, respectively.
OpenCL 1.2 C Programming Language Implementation
Section | Feature | Support Status | Notes |
---|---|---|---|
6.1.3 | Other Built-in Data Types | ● | Preliminary support. This feature might not conform with the OpenCL Specification version 1.0. |
6.12.12 | Miscellaneous Vector Functions | ● | The
SDK supports
implementations of the following additional built-in vector
functions:
|
6.12.13 | printf | ○ | Preliminary support. This feature might not conform with the OpenCL Specification version 1.0. See below for details. |
The printf function in OpenCL has syntax and features similar to the printf function in C99, with a few exceptions. For details, refer to the OpenCL Specification version 1.2. To use a printf function, there are no requirements for special compilation steps, buffers, or flags. You can compile kernels that include printf instructions with the usual aoc command. During kernel execution, printf data is stored in a global printf buffer that the Intel® FPGA SDK for OpenCL™ Offline Compiler allocates automatically. The size of this buffer is 64 kB; the total size of data arguments to a printf call should not exceed this size. When kernel execution completes, the contents of the printf buffer are printed to standard output. Buffer overflows are handled seamlessly; printf instructions can be executed an unlimited number of times. However, if the printf buffer overflows, kernel pipeline execution stalls until the host reads the buffer and prints the buffer contents. Because printf functions store their data into a global memory buffer, the performance of your kernel will drop if it includes such functions. There are no usage limitations on printf functions. You can use printf instructions inside if-then-else statements, loops, etc. A kernel can contain multiple printf instructions executed by multiple work-items. Format string arguments and literal string arguments of printf calls are transferred to the host system from the FPGA using a special memory region. This memory region can overflow if the total size of the printf string arguments is large (3000 characters or less is usually safe in a typical OpenCL application). If there is an overflow, the error message cannot parse auto-discovery string at byte offset 4096 is printed during host program execution. Output from printf is never intermixed, even though work-items may execute printf functions concurrently. However, the order of concurrent printf execution is not guaranteed. In other words, printf outputs might not appear in program order if the printf instructions are in concurrent datapaths. |
Support Statuses of OpenCL 2.0 Features
The following sections outline the support statuses of the OpenCL™ features described in the OpenCL Specification version 2.0.
OpenCL 2.0 Headers
OpenCL 2.0 Runtime Implementation
OpenCL 2.0 C Programming Language Restrictions for Pipes
Function | Support Status |
---|---|
int read_pipe (pipe gentype p, gentype *ptr) | ● |
int write_pipe (pipe gentype p, const gentype *ptr) | ● |
int read_pipe (pipe gentype p, reserve_id_t reserve_id, uint index, gentype *ptr) | X |
int write_pipe (pipe gentype p, reserve_id_t reserve_id, uint index, const gentype *ptr) | X |
reserve_id_t reserve_read_pipe (pipe gentype p, uint num_packets) reserve_id_t reserve_write_pipe (pipe gentype p, uint num_packets) |
X |
void commit_read_pipe (pipe gentype p, reserve_id_t reserve_id) void commit_write_pipe (pipe gentype p, reserve_id_t reserve_id) |
X |
bool is_valid_reserve_id (reserve_id_t reserve_id) | X |
Function | Support Status |
---|---|
reserve_id_t work_group_reserve_read_pipe (pipe gentype p, uint num_packets) reserve_id_t work_group_reserve_write_pipe (pipe gentype p, uint num_packets) |
X |
void work_group_commit_read_pipe (pipe gentype p, reserve_id_t reserve_id) void work_group_commit_write_pipe (pipe gentype p, reserve_id_t reserve_id) |
X |
Function | Support Status |
---|---|
uint get_pipe_num_packets (pipe gentype p) | X |
uint get_pipe_max_packets (pipe gentype p) | X |
Intel FPGA SDK for OpenCL Standard Edition Allocation Limits
Item | Limit |
---|---|
Maximum number of contexts | Limited only by host memory size |
Maximum number of devices | 32 |
Minimum global memory allocation by runtime | The runtime allocates 64 kB of
device memory when the context is created. This memory is reserved
for program variables in global address space and for static
variables inside functions.
If the OpenCL kernel uses the printf function, the runtime allocates an additional 64 kB of device memory. |
Maximum number of queues | 256 Attention: Each context uses two queues for
system purposes.
|
Maximum number of program objects per context | 20 |
Maximum number of even objects per context | Limited only by host memory size |
Maximum number of dependencies between events within a context | 1000 |
Maximum number of event dependencies per command | 20 |
Maximum number of concurrently running kernels | The total number of queues |
Maximum number of enqueued kernels | 1000 |
Maximum number of kernels per FPGA device | Hardware: no static limit Emulator: 256 |
Maximum number of arguments per kernel | 128 |
Maximum total size of kernel arguments | 256 bytes per kernel |
Maximum number of declared variables in the local memory per kernel | 128 |
Document Revision History of the Intel FPGA SDK for OpenCL Standard Edition Programming Guide
Document Version | Intel® Quartus® Prime Version | Changes |
---|---|---|
2019.04.22 | 18.1 | Fixed broken links. |
2018.09.24 | 18.1 |
|
2018.05.04 | 18.0 |
|
Date | Version | Changes |
---|---|---|
December 2017 | 2017.12.08 |
|
November 2017 | 2017.11.06 |
|
May 2017 | 2017.05.08 |
|
October 2016 | 2016.10.31 |
|
May 2016 | 2016.05.02 |
|
November 2015 | 2015.11.02 |
|
May 2015 | 15.0.0 |
|
December 2014 | 14.1.0 |
|
June 2014 | 14.0.0 |
|
December 2013 | 13.1.1 |
|
November 2013 | 13.1.0 |
|
June 2013 | 13.0 SP1.0 |
|
May 2013 | 13.0.1 |
|
May 2013 | 13.0.0 |
|
November 2012 | 12.1.0 | Initial release. |