Kernels for performing the Scan
operation.
More...
Functions | |
kernel void | inclusiveScan_i (global int4 *in, global int4 *out, local int *data, global int *sums, uint n) |
Performs an inclusive scan operation on the columns of an array. More... | |
kernel void | exclusiveScan_i (global int4 *in, global int4 *out, local int *data, global int *sums, uint n) |
Performs an exclusive scan operation on the columns of an array. More... | |
kernel void | addGroupSums_i (global int *sums, global int4 *out, uint n) |
Adds the group sums in the associated blocks. More... | |
Kernels for performing the Scan
operation.
kernel void addGroupSums_i | ( | global int * | sums, |
global int4 * | out, | ||
uint | n | ||
) |
Adds the group sums in the associated blocks.
It's the second part of the Blelloch scan algorithm.
scan
handled 2 int4
elements per work-item. addGroupSums
handles 1 int4
element per work-item. The global workspace should be \( 2*(wgXdim-1)*lXdim_{scan} \) in the x dimension, and \( M \) in the y dimension. The global workspace should also have an offset \( 2*lXdim_{scan} \) in the x dimension. The local workspace should be \( 2*lXdim_{scan} \) in the x dimension, and 1
in the y dimension. [in] | sums | (scan) array of work-group sums. Its size is \(M \times wgXdim\). |
[out] | out | (scan) output array of int elements (before processing, it contains the block scans performed in a previous step. |
[in] | n | the number of elements in a row of the array divided by 4. |
kernel void exclusiveScan_i | ( | global int4 * | in, |
global int4 * | out, | ||
local int * | data, | ||
global int * | sums, | ||
uint | n | ||
) |
Performs an exclusive scan operation on the columns of an array.
The parallel scan algorithm by Blelloch is implemented.
N
, in a row of the array should be a multiple of 4 (the data are handled as int4
). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 float
(= 2 float4
) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M
, in the array. That is, \( \ gYdim = M \). The local workspace should be 1
in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp
per work-group. 0
, in the sums array, since in the next phase the sums array is going to be handled as int4
.[in] | in | input array of int elements. |
[out] | out | (scan per work-group) output array of int elements. |
[in] | data | local buffer. Its size should be 2 int elements for each work-item in a work-group. That is \( 2*lXdim*sizeof\ (int) \). |
[out] | sums | array of block sums. Each work-group outputs the sum of its elements. It's size should be \( M \times wgXdim \). |
[in] | n | the number of elements in a row of the array divided by 4. |
kernel void inclusiveScan_i | ( | global int4 * | in, |
global int4 * | out, | ||
local int * | data, | ||
global int * | sums, | ||
uint | n | ||
) |
Performs an inclusive scan operation on the columns of an array.
The parallel scan algorithm by Blelloch is implemented.
N
, in a row of the array should be a multiple of 4 (the data are handled as int4
). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 float
(= 2 float4
) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M
, in the array. That is, \( \ gYdim = M \). The local workspace should be 1
in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp
per work-group. 0
, in the sums array, since in the next phase the sums array is going to be handled as int4
.[in] | in | input array of int elements. |
[out] | out | (scan per work-group) output array of int elements. |
[in] | data | local buffer. Its size should be 2 int elements for each work-item in a work-group. That is \( 2*lXdim*sizeof\ (int) \). |
[out] | sums | array of block sums. Each work-group outputs the sum of its elements. It's size should be \( M \times wgXdim \). |
[in] | n | the number of elements in a row of the array divided by 4. |