Safe Haskell | None |
---|---|

Language | Haskell2010 |

Perform a restricted form of register tiling corresponding to
the following pattern:
* a stream is perfectly nested inside a kernel with at least
three parallel dimension (the perfectly nested restriction
can be relaxed a bit);
* all streamed arrays are one dimensional;
* all streamed arrays are variant to exacly one of the three
innermost parallel dimensions, and conversly for each of
the three innermost parallel dimensions, there is at least
one streamed array variant to it;
* the stream's result is a tuple of scalar values, which are
also the "thread-in-space" return of the kernel.
Target code can be found in "tests*reg-tiling*reg-tiling-3d.fut".

# Documentation

doRegTiling3D :: Stm Kernels -> TileM (Maybe (Stms Kernels, Stm Kernels)) Source #

Expects a kernel statement as argument.
CONDITIONS for 3D tiling optimization to fire are:
1. a) The kernel body can be broken into
scalar-code-1 ++ [GroupStream stmt] ++ scalar-code-2.
b) The kernels has a "ThreadsReturn ThreadsInSpace" result,
and obviously the result is variant to the 3rd dimension
(counter from innermost to outermost)
2. For the GroupStream (morally StreamSeq):
a) the arrays' outersize must equal the maximal chunk size
b) the streamed arrays are one dimensional
c) each of the array arguments of GroupStream are variant
to exactly one of the three innermost-parallel dimension
of the kernel. This condition can be relaxed by interchanging
kernel dimensions whenever possible.
3. For scalar-code-1:
a) each of the statements is a slice that produces one of the
streamed arrays
4. For simplicity assume scalar-code-2 is empty!
(To be extended later.)
ASSUME the initial kernel is (as in tests*reg-tiling*reg-tiling-3d.fut):

kernel map(num groups: num_groups, group size: group_size, num threads: num_threads, global TID -> global_tid, local TID -> local_tid, group ID -> group_id) (gtid_z < size_z, gtid_y < size_xy, gtid_x < size_xy) : {f32} { let {[size_com]f32 flags} = empty_or_match_cert_6685fss_6664[gtid_z, 0i32:+size_com*1i32] let {[size_com]f32 ass} = ass_6662[gtid_y, 0i32:+size_com*1i32] let {[size_com]f32 bss} = res_6687[gtid_x, 0i32:+size_com*1i32] let {f32 res_ker} = stream(size_com, size_com, fn (int chunk_size_out, int chunk_offset_6736, f32 acc_out, [chunk_size_out]f32 flags_chunk_out, [chunk_size_out]f32 ass_chunk_out, [chunk_size_out]f32 bss_chunk_out) => let {f32 res_out} = stream(chunk_size_out, 1i32, fn (int chunk_size_in, int i_6743, f32 acc_in, [chunk_size_in]f32 flags_chunk_in, [chunk_size_in]f32 ass_chunk_in, [chunk_size_in]f32 bss_chunk_in) => let {f32 f} = flags_chunk_in[0i32] let {f32 a} = ass_chunk_in[0i32] let {f32 b} = bss_chunk_in[0i32] let {bool cond} = lt32(f, 9.0f32) let {f32 tmp} = if cond then { let {f32 tmp1} = fmul32(a, b) in {tmp1} } else {0.0f32} let {f32 res_in} = fadd32(acc_in, tmp) in {res_in}, {acc_out}, flags_chunk_out, ass_chunk_out, bss_chunk_out) in {res_out}, {0.0f32}, flags, ass, bss) return {thread in space returns res_ker} }