Revision ed529e04bb181185dd68abc8681929c1cb72959c authored by Andrew Adams on 29 November 2020, 22:07:28 UTC, committed by Andrew Adams on 29 November 2020, 22:07:28 UTC
When we codegen something like f[ramp(x + 1, 2, 16)], where f is an internal allocation, we subtract the 1, do the dense load f[ramp(x, 1, 32)] and then take the odd lanes of the result. The reason for this is that it's likely that there's an f[ramp(x, 2, 16)] nearby, and aligning down the x+1 to x means we can share the dense loads and just deinterleave. This PR does the same when there's no x, just an odd constant. This means that cases like f[ramp(64, 2, 16)] + f[ramp(65, 2, 16)] now generate much better assembly. In one case I have it speeds up an entire pipeline by 8%, because aligning the loads in this way causes them to all be promoted off the stack into registers.
1 parent bfbfacd
File | Mode | Size |
---|---|---|
.github | ||
apps | ||
cmake | ||
dependencies | ||
doc | ||
packaging | ||
python_bindings | ||
src | ||
test | ||
tools | ||
tutorial | ||
util | ||
.clang-format | -rw-r--r-- | 1.5 KB |
.clang-format-ignore | -rw-r--r-- | 265 bytes |
.clang-tidy | -rw-r--r-- | 1.8 KB |
.gitattributes | -rw-r--r-- | 342 bytes |
.gitignore | -rw-r--r-- | 1.1 KB |
.gitmodules | -rw-r--r-- | 0 bytes |
CMakeLists.txt | -rw-r--r-- | 4.5 KB |
CODE_OF_CONDUCT.md | -rw-r--r-- | 3.5 KB |
LICENSE.txt | -rw-r--r-- | 3.2 KB |
Makefile | -rw-r--r-- | 94.7 KB |
README.md | -rw-r--r-- | 20.6 KB |
README_cmake.md | -rw-r--r-- | 67.0 KB |
README_rungen.md | -rw-r--r-- | 12.1 KB |
README_webassembly.md | -rw-r--r-- | 7.5 KB |
run-clang-format.sh | -rwxr-xr-x | 1.1 KB |
run-clang-tidy.sh | -rwxr-xr-x | 2.8 KB |
Computing file changes ...