https://github.com/halide/Halide
Revision ed529e04bb181185dd68abc8681929c1cb72959c authored by Andrew Adams on 29 November 2020, 22:07:28 UTC, committed by Andrew Adams on 29 November 2020, 22:07:28 UTC
When we codegen something like f[ramp(x + 1, 2, 16)], where f is an
internal allocation, we subtract the 1, do the dense load f[ramp(x, 1,
32)] and then take the odd lanes of the result. The reason for this is
that it's likely that there's an f[ramp(x, 2, 16)] nearby, and aligning
down the x+1 to x means we can share the dense loads and just
deinterleave.

This PR does the same when there's no x, just an odd constant. This
means that cases like f[ramp(64, 2, 16)] + f[ramp(65, 2, 16)] now
generate much better assembly. In one case I have it speeds up an entire
pipeline by 8%, because aligning the loads in this way causes them to
all be promoted off the stack into registers.
1 parent bfbfacd
History
Tip revision: ed529e04bb181185dd68abc8681929c1cb72959c authored by Andrew Adams on 29 November 2020, 22:07:28 UTC
Align the base when doing strided loads from constant addresses
Tip revision: ed529e0

back to top