Revision - e5d0f33 - Synthesize "active mask" for CUDA (#1352) - origin: https://github.com/shader-slang/slang

visit type:

https://github.com/shader-slang/slang

06 April 2024, 09:30:39 UTC

Revision e5d0f3360f44a4cdd2390e7817db17bb3cc0dd04 authored by Tim Foley on 26 May 2020, 22:11:22 UTC, committed by GitHub on 26 May 2020, 22:11:22 UTC

Synthesize "active mask" for CUDA (#1352)

* Synthesize "active mask" for CUDA

The Big Picture
===============

The most important change here is to `hlsl.meta.slang`, where the declaration of `WaveGetActiveMask()` is changed so that instead of mapping to `__activemask()` on CUDA (which is semantically incorrect) it maps to a dedicated IR instruction.

The other `WaveActive*()` intrinsics that make use of the implicit "active mask" concept had already been changed in #1336 so that they explicitly translate to call the equivalent `WaveMask*()` intrinsic with the result of `WaveGetActiveMask()`. As a result, all of the `WaveActive*()` functions are now no different from a user-defined function that uses `WaveGetActiveMask()`.

The bulk of the work in this change goes into an IR pass to replace the new instruction for getting the active mask gets replaced with appropriately computed values before we generate output CUDA code. That work is in `slang-ir-synthesize-active-mask.{h,cpp}`.

Utilities
=========

There are a few pieces of code that were helpful in writing the main pass but that can be explained separately:

* IR instructions were added corresponding to the Slang `WaveMaskBallot()` and `WaveMaskMatch()` functions, which map to the CUDA `__ballot_sync()` and `__match_any_sync()` operations, respectively. These are only implemented for the CUDA target because they are only being generated as part of our CUDA-only pass.

* The `IRDominatorTree` type was updated to make it a bit more robust in the presence of unreachable blocks in the CFG. It is possible that the same ends could be achieved more efficiently by folding the corner cases into the main logic, but I went ahead and made things very explicit for now.

* I added an `IREdge` utility type to better encapsulate the way that certain code operating on the predecessors/successors of an `IRBlock` were using an `IRUse*` to represent a control-flow edge. The `IREdge` type makes the logic of those operations more explicit. A future change should proably change it so that `IRBlock::getPredecessors()` and `getSuccessors()` are instead `getIncomingEdges()` and `getOutgoingEdges()` and work as iterators over `IREdge` values, given the way that the predecessor and successor lists today can contain duplicates.

* Using the above `IREdge` type, the logic for detecting and break critical edges was broken down into something that is a bit more clear (I hope), and that also factors out the breaking of an edge (by inserting a block along it) into a reusable subroutine.

The Main Pass
=============

The implementation of the new pass is in `slang-ir-synthesize-active-mask.cpp`, and that file attempts to include enough comments to make the logic clear. A brief summary for the benefit of the commit history:

* The first order of business is to identify functions that need to have the active mask value piped into them, and to add an additional parameter to them so that the active mask is passed down explicitly. Call sites are adjusted to pass down the active mask which can then result in new functions being identified as needing the active mask.

* The next challenge is for a function that uses the active mask, to compute the active mask value to use in each basic block. The entry block can easily use the active mask value that was passed in, while other blocks need more work.

* When doing a conditional branch, we can compute the new mask for the block we branch to as a function of the existing mask and the branch condition. E.g., the value `WaveMaskBallot(existingMask, condition)` can be used as the mask for the "then" block of an `if` statement.

* When control flow paths need to "reconverge" at a point after a structured control-flow statement, we need to insert logic to synchronize and re-build the mask that will execute after the statement, while also excluding any lanes/threads that exited the statement in other ways (e.g., an early `return` from the function).

The explanation here is fairly hand-wavy, but the actual pass uses much more crisp definitions, so the code itself should be inspected if you care about the details.

Tests
=====

The tests for the new feature are all under `tests/hlsl-intrinsic/active-mask/`. Most of them stress a single control-flow construct (`if`, `switch`, or loop) and write out the value of `WaveGetActiveMask()` at various points in the code.

In practice, our definition of the active mask doesn't always agree with what D3D/Vulkan implementations seem to produce in practice, and as a result a certain amount of effort has gone into adding tweaks to the tests that force them to produce the expected output on existing graphics APIs. These tweaks usually amount to introducing conditional branches that aren't actually conditional in practice (the branch condition is always `true` or always `false` at runtime), in order to trick some simplistic analysis approaches that downstream compilers seem to employ.

One test case currently fails on our CUDA target (`switch-trivial-fallthrough.slang`) and has been disabled. This is an expected failure, because making it produce the expected value requires a bit of detailed/careful coding that would add a lot of additional complexity to this change. It seemed better to leave that as future work.

Future Work
===========

* As discussed under "Tests" above, the handling of simple `switch` statements in the current pass is incomplete.

* There's an entire can of worms to be dealt with around the handling of fall-through for `switch`.

* The current work also doesn't handle `discard` statements, which is unimportant right now (CUDA doesn't have fragment shaders), but might matter if we decide to synthesize masks for other targets. Similar work would probably be needed if we ever have `throw` or other non-local control flow that crosses function boundaries.

* An important optimization opportunity is being left on the floor in this change. When block that comes "after" a structured control-flow region (which is encoded explicitly in Slang IR and SPIR-V) post-dominates the entry block of the region, then we know that the active mask when exiting the region must be the same as the mask when entering the region, and there is no need to insert explicit code to cause "re-convergence." This should be addressed in a follow-on change once we add code to Slang for computing a post-dominator tree from a function CFG.

* Related to the above, the decision-making around whether a basic block "needs" the active mask is perhaps too conservative, since it decides that any block that precedes one needing the active mask also needs it. This isn't true in cases where the active mask for a merge block can be inferred by post-dominance (as described above), so that the blocks that branch to it don't need to compute an active mask at all.

* If/when we extend the CPU target to support these operations (along with SIMD code generation, I assume), we will also need to synthesize an active mask on that platform, but the approach taken here (which pretty much relies on support for CUDA "cooperative groups") wouldn't seem to apply in the SIMD case.

* Similarly, the approach taken to computing the active mask here requires a new enough CUDA SM architecture version to support explicit cooperative groups. If we want to run on older CUDA-supporting architectures, we will need a new and potentially very different strategy.

* Because the new pass here changes the signature of functions that require the active mask (and not those that don't), it creates possible problems for generating code that uses dynamic dispatch (via function pointers). In principle, we need to know at a call site whether or not the callee uses the active mask. There are multiple possible solutions to this problem, and they'd need to be worked through before we can make the implicit active mask and dynamic dispatch be mutually compatible.

* Related to changing function signatures: no effort is made in this pass to clean up the IR type of the functions it modifies, so there could technically be mismatches between the IR type of a function and its actual signature. If/when this causes problems for downstream passes we probably need to do some cleanup.

* fixup: backslash-escaped lines

I did some "ASCII art" sorts of diagrams to explain cases in the CFG, and some of those diagrams used backslash (`\`) characters as the last character on the line, causing them to count as escaped newlines for C/C++.

The gcc compiler apparently balked at those lines, since they made some of the single-line comments into multi-line comments.

I solved the problem by adding a terminating column of `|` characters at the end of each line that was part of an ASCII art diagram.

* fixup: typos

Co-authored-by: jsmall-nvidia <jsmall@nvidia.com>

1 parent b136904

Files
Changes

Permalinks

Tip revision: e5d0f3360f44a4cdd2390e7817db17bb3cc0dd04 authored by Tim Foley on 26 May 2020, 22:11:22 UTC
Synthesize "active mask" for CUDA (#1352)

Tip revision: e5d0f33

File	Mode	Size
docs
examples
external
prelude
source
tests
tools
.editorconfig	-rw-r--r--	937 bytes
.gitattributes	-rw-r--r--	95 bytes
.gitignore	-rw-r--r--	480 bytes
.gitmodules	-rw-r--r--	774 bytes
.travis.yml	-rw-r--r--	1.7 KB
CODE_OF_CONDUCT.md	-rw-r--r--	3.1 KB
LICENSE	-rw-r--r--	1.1 KB
README.md	-rw-r--r--	7.4 KB
appveyor.yml	-rw-r--r--	4.0 KB
premake5.lua	-rw-r--r--	34.8 KB
slang-com-helper.h	-rw-r--r--	4.8 KB
slang-com-ptr.h	-rw-r--r--	4.8 KB
slang-tag-version.h	-rw-r--r--	36 bytes
slang.h	-rw-r--r--	127.2 KB
slang.sln	-rw-r--r--	11.7 KB
test.bat	-rw-r--r--	1.4 KB
travis_build.sh	-rw-r--r--	460 bytes
travis_test.sh	-rw-r--r--	435 bytes

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/shader-slang/slang

Synthesize "active mask" for CUDA (#1352)

README.md