Revision - b101595 - Exhaustively compute max on host for non-monotonic shared memory sizes - origin: https://github.com/halide/Halide

visit type:

https://github.com/halide/Halide

21 March 2024, 02:33:35 UTC

Revision b10159529cb327e056193ce23ec155e9b07560cf authored by Andrew Adams on 25 August 2020, 19:52:18 UTC, committed by Andrew Adams on 25 August 2020, 19:52:18 UTC

Exhaustively compute max on host for non-monotonic shared memory sizes

GPU kernel launches must use the same amount of shared memory per block,
and this has to be computed ahead of time on the host. The expression
that gives the size of the allocations compute_at blocks are inside the
kernel though, and are a function of bounds inference. We therefore have
to take the max of these sizes over all blocks. This is extremely prone
to interval arithmetic being overconservative, because these are extents
computed from a max minus a min, and the max and min are both frequently
correlated with the block variable. This causes a lot of otherwise fine
schedules to fail at runtime with CUDA_ERROR_INVALID_VALUE.

This PR detects cases where interval arithmetic is going to be
overconservative using is_monotonic, and hoists the computation of
shared memory size to an explicit loop over blocks on the CPU, taking
the max shared allocation size exhaustively. This implies some work on
the CPU, but

1) A loop over blocks is typically at least 32x fewer iterations than
the loop over pixels
2) This work can overlap with the previous kernel launch on the GPU
still running
3) The alternative is crashing

This feature has proved to make GPU schedules much more robust in the
gpu autoscheduler branch, so I think we should promote it to master.

It's a bit wild though, because this is the first instance I can think
of where we inject a new unscheduled loop for some bounds inference
purpose.

1 parent 211a4ef

Files
Changes

Permalinks

Tip revision: b10159529cb327e056193ce23ec155e9b07560cf authored by Andrew Adams on 25 August 2020, 19:52:18 UTC
Exhaustively compute max on host for non-monotonic shared memory sizes

Tip revision: b101595

File	Mode	Size
.github
apps
cmake
dependencies
doc
packaging
python_bindings
src
test
tools
tutorial
util
.clang-format	-rw-r--r--	1.5 KB
.clang-format-ignore	-rw-r--r--	265 bytes
.clang-tidy	-rw-r--r--	469 bytes
.gitattributes	-rw-r--r--	342 bytes
.gitignore	-rw-r--r--	1.1 KB
.gitmodules	-rw-r--r--	0 bytes
CMakeLists.txt	-rw-r--r--	4.4 KB
CODE_OF_CONDUCT.md	-rw-r--r--	3.5 KB
LICENSE.txt	-rw-r--r--	3.2 KB
Makefile	-rw-r--r--	91.9 KB
README.md	-rw-r--r--	19.1 KB
README_cmake.md	-rw-r--r--	12.3 KB
README_rungen.md	-rw-r--r--	12.1 KB
README_webassembly.md	-rw-r--r--	7.5 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/halide/Halide

Exhaustively compute max on host for non-monotonic shared memory sizes

README.md