RandomBallCover  1.2.1 Hosted by GitHub
scan_kernels.cl File Reference

Kernels for performing the Scan operation. More...

## Functions

kernel void inclusiveScan_i (global int4 *in, global int4 *out, local int *data, global int *sums, uint n)
Performs an inclusive scan operation on the columns of an array. More...

kernel void exclusiveScan_i (global int4 *in, global int4 *out, local int *data, global int *sums, uint n)
Performs an exclusive scan operation on the columns of an array. More...

kernel void addGroupSums_i (global int *sums, global int4 *out, uint n)
Adds the group sums in the associated blocks. More...

## Detailed Description

Kernels for performing the Scan operation.

Version
1.0
Date
2015
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

## Function Documentation

 kernel void addGroupSums_i ( global int * sums, global int4 * out, uint n )

Adds the group sums in the associated blocks.

It's the second part of the Blelloch scan algorithm.

Note
scan handled 2 int4 elements per work-item. addGroupSums handles 1 int4 element per work-item. The global workspace should be $$2*(wgXdim-1)*lXdim_{scan}$$ in the x dimension, and $$M$$ in the y dimension. The global workspace should also have an offset $$2*lXdim_{scan}$$ in the x dimension. The local workspace should be $$2*lXdim_{scan}$$ in the x dimension, and 1 in the y dimension.
This part should follow after a scan has been performed on the group sums.
Parameters
 [in] sums (scan) array of work-group sums. Its size is $$M \times wgXdim$$. [out] out (scan) output array of int elements (before processing, it contains the block scans performed in a previous step. [in] n the number of elements in a row of the array divided by 4.
 kernel void exclusiveScan_i ( global int4 * in, global int4 * out, local int * data, global int * sums, uint n )

Performs an exclusive scan operation on the columns of an array.

The parallel scan algorithm by Blelloch is implemented.

Note
When there are multiple rows in the array, a scan operation is performed per row, in parallel.
The number of elements, N, in a row of the array should be a multiple of 4 (the data are handled as int4). The x dimension of the global workspace, $$gXdim$$, should be greater than or equal to the number of elements in a row of the array divided by 8. That is, $$\ gXdim \geq N/8$$. Each work-item handles 8 float (= 2 float4) elements in a row of the array. The y dimension of the global workspace, $$gYdim$$, should be equal to the number of rows, M, in the array. That is, $$\ gYdim = M$$. The local workspace should be 1 in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp per work-group.
When the number of elements per row of the array is small enough to be handled by a single work-group, the output array will contain the true scan result. When the elements are more than that, they are partitioned into blocks and scanned independently. In this case, the kernel outputs the results from each block scan operation. A scan should then be made on the sums of the elements of each block per row. Finally, the results from the last block-sums scan should be added in the corresponding block. The number of work-groups in the x dimension, $$wgXdim$$, for the case of multiple work-groups, should be made a multiple of 4. The potential extra work-groups are used for enforcing correctness. They write the necessary identity operands, 0, in the sums array, since in the next phase the sums array is going to be handled as int4.
Parameters
 [in] in input array of int elements. [out] out (scan per work-group) output array of int elements. [in] data local buffer. Its size should be 2 int elements for each work-item in a work-group. That is $$2*lXdim*sizeof\ (int)$$. [out] sums array of block sums. Each work-group outputs the sum of its elements. It's size should be $$M \times wgXdim$$. [in] n the number of elements in a row of the array divided by 4.
 kernel void inclusiveScan_i ( global int4 * in, global int4 * out, local int * data, global int * sums, uint n )

Performs an inclusive scan operation on the columns of an array.

The parallel scan algorithm by Blelloch is implemented.

Note
When there are multiple rows in the array, a scan operation is performed per row, in parallel.
The number of elements, N, in a row of the array should be a multiple of 4 (the data are handled as int4). The x dimension of the global workspace, $$gXdim$$, should be greater than or equal to the number of elements in a row of the array divided by 8. That is, $$\ gXdim \geq N/8$$. Each work-item handles 8 float (= 2 float4) elements in a row of the array. The y dimension of the global workspace, $$gYdim$$, should be equal to the number of rows, M, in the array. That is, $$\ gYdim = M$$. The local workspace should be 1 in the y dimension, and a power of 2 in the x dimension. It is recommended to use one wavefront/warp per work-group.
When the number of elements per row of the array is small enough to be handled by a single work-group, the output array will contain the true scan result. When the elements are more than that, they are partitioned into blocks and scanned independently. In this case, the kernel outputs the results from each block scan operation. A scan should then be made on the sums of the elements of each block per row. Finally, the results from the last block-sums scan should be added in the corresponding block. The number of work-groups in the x dimension, $$wgXdim$$, for the case of multiple work-groups, should be made a multiple of 4. The potential extra work-groups are used for enforcing correctness. They write the necessary identity operands, 0, in the sums array, since in the next phase the sums array is going to be handled as int4.
Parameters
 [in] in input array of int elements. [out] out (scan per work-group) output array of int elements. [in] data local buffer. Its size should be 2 int elements for each work-item in a work-group. That is $$2*lXdim*sizeof\ (int)$$. [out] sums array of block sums. Each work-group outputs the sum of its elements. It's size should be $$M \times wgXdim$$. [in] n the number of elements in a row of the array divided by 4.