Kernels for performing the Scan operation.
More...
Functions | |
| kernel void | inclusiveScan_i (global int4 *in, global int4 *out, local int *data, global int *sums, uint n) |
| Performs an inclusive scan operation on the columns of an array. More... | |
| kernel void | exclusiveScan_i (global int4 *in, global int4 *out, local int *data, global int *sums, uint n) |
| Performs an exclusive scan operation on the columns of an array. More... | |
| kernel void | addGroupSums_i (global int *sums, global int4 *out, uint n) |
| Adds the group sums in the associated blocks. More... | |
Kernels for performing the Scan operation.
| kernel void addGroupSums_i | ( | global int * | sums, |
| global int4 * | out, | ||
| uint | n | ||
| ) |
Adds the group sums in the associated blocks.
It's the second part of the Blelloch scan algorithm.
scan handled 2 int4 elements per work-item. addGroupSums handles 1 int4 element per work-item. The global workspace should be \( 2*(wgXdim-1)*lXdim_{scan} \) in the x dimension, and \( M \) in the y dimension. The global workspace should also have an offset \( 2*lXdim_{scan} \) in the x dimension. The local workspace should be \( 2*lXdim_{scan} \) in the x dimension, and 1 in the y dimension. | [in] | sums | (scan) array of work-group sums. Its size is \(M \times wgXdim\). |
| [out] | out | (scan) output array of int elements (before processing, it contains the block scans performed in a previous step. |
| [in] | n | the number of elements in a row of the array divided by 4. |
| kernel void exclusiveScan_i | ( | global int4 * | in, |
| global int4 * | out, | ||
| local int * | data, | ||
| global int * | sums, | ||
| uint | n | ||
| ) |
Performs an exclusive scan operation on the columns of an array.
The parallel scan algorithm by Blelloch is implemented.
N, in a row of the array should be a multiple of 4 (the data are handled as int4). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 float (= 2 float4) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M, in the array. That is, \( \ gYdim = M \). The local workspace should be 1 in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp per work-group. 0, in the sums array, since in the next phase the sums array is going to be handled as int4.| [in] | in | input array of int elements. |
| [out] | out | (scan per work-group) output array of int elements. |
| [in] | data | local buffer. Its size should be 2 int elements for each work-item in a work-group. That is \( 2*lXdim*sizeof\ (int) \). |
| [out] | sums | array of block sums. Each work-group outputs the sum of its elements. It's size should be \( M \times wgXdim \). |
| [in] | n | the number of elements in a row of the array divided by 4. |
| kernel void inclusiveScan_i | ( | global int4 * | in, |
| global int4 * | out, | ||
| local int * | data, | ||
| global int * | sums, | ||
| uint | n | ||
| ) |
Performs an inclusive scan operation on the columns of an array.
The parallel scan algorithm by Blelloch is implemented.
N, in a row of the array should be a multiple of 4 (the data are handled as int4). The x dimension of the global workspace, \( gXdim \), should be greater than or equal to the number of elements in a row of the array divided by 8. That is, \( \ gXdim \geq N/8 \). Each work-item handles 8 float (= 2 float4) elements in a row of the array. The y dimension of the global workspace, \( gYdim \), should be equal to the number of rows, M, in the array. That is, \( \ gYdim = M \). The local workspace should be 1 in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp per work-group. 0, in the sums array, since in the next phase the sums array is going to be handled as int4.| [in] | in | input array of int elements. |
| [out] | out | (scan per work-group) output array of int elements. |
| [in] | data | local buffer. Its size should be 2 int elements for each work-item in a work-group. That is \( 2*lXdim*sizeof\ (int) \). |
| [out] | sums | array of block sums. Each work-group outputs the sum of its elements. It's size should be \( M \times wgXdim \). |
| [in] | n | the number of elements in a row of the array divided by 4. |
1.8.9.1