Revision ed529e04bb181185dd68abc8681929c1cb72959c authored by Andrew Adams on 29 November 2020, 22:07:28 UTC, committed by Andrew Adams on 29 November 2020, 22:07:28 UTC
When we codegen something like f[ramp(x + 1, 2, 16)], where f is an
internal allocation, we subtract the 1, do the dense load f[ramp(x, 1,
32)] and then take the odd lanes of the result. The reason for this is
that it's likely that there's an f[ramp(x, 2, 16)] nearby, and aligning
down the x+1 to x means we can share the dense loads and just
deinterleave.

This PR does the same when there's no x, just an odd constant. This
means that cases like f[ramp(64, 2, 16)] + f[ramp(65, 2, 16)] now
generate much better assembly. In one case I have it speeds up an entire
pipeline by 8%, because aligning the loads in this way causes them to
all be promoted off the stack into registers.
1 parent bfbfacd
History
File Mode Size
.github
apps
cmake
dependencies
doc
packaging
python_bindings
src
test
tools
tutorial
util
.clang-format -rw-r--r-- 1.5 KB
.clang-format-ignore -rw-r--r-- 265 bytes
.clang-tidy -rw-r--r-- 1.8 KB
.gitattributes -rw-r--r-- 342 bytes
.gitignore -rw-r--r-- 1.1 KB
.gitmodules -rw-r--r-- 0 bytes
CMakeLists.txt -rw-r--r-- 4.5 KB
CODE_OF_CONDUCT.md -rw-r--r-- 3.5 KB
LICENSE.txt -rw-r--r-- 3.2 KB
Makefile -rw-r--r-- 94.7 KB
README.md -rw-r--r-- 20.6 KB
README_cmake.md -rw-r--r-- 67.0 KB
README_rungen.md -rw-r--r-- 12.1 KB
README_webassembly.md -rw-r--r-- 7.5 KB
run-clang-format.sh -rwxr-xr-x 1.1 KB
run-clang-tidy.sh -rwxr-xr-x 2.8 KB

README.md

back to top