Revision history - refs/heads/abadams/rounding_shift_right_use_average - origin: https://github.com/halide/Halide

visit type:

Newer
Older

Revision	Author	Date	Message	Commit Date
357a12a	Andrew Adams	13 December 2021, 16:37:12 UTC	Address review comments	13 December 2021, 16:37:12 UTC
ddde625	Andrew Adams	11 December 2021, 15:42:32 UTC	rounding shift rights should use rounding halving add On x86 currently we lower cast<uint8_t>((cast<uint16_t>(x) + 8) / 16) to: cast<uint8_t>(shift_right(widening_add(x, 8), 4)) This compiles to 8 instructions on x86: Widen each half of the input vector, add 8 to each half-vector, shift each half-vector, then narrow each half-vector. First, this should have been a rounding_shift_right. Some patterns were missing in FindIntrinsics. Second, rounding_shift_right had suboptimal codegen in the case where the second arg is a positive const. On archs without a rounding shift right instruction you can further rewrite this to: shift_right(rounding_halving_add(x, 7), 3) which is just two instructions on x86.	11 December 2021, 15:42:32 UTC
11448b2	Steven Johnson	10 December 2021, 19:31:01 UTC	Document the usage of llvm::legacy::PassManager (#6491) * Document the usage of llvm::legacy::PassManager There is some confusion about whether this usage is acceptable. TL;DR: it's not just acceptable, it's required for the forseeable future. Add comments to capture this to avoid future such questions. (With great thanks to Alina for pointing me at the relevant LLVM discussion links!) * Add date	10 December 2021, 19:31:01 UTC
7fe1e2c	Andrew Adams	10 December 2021, 15:06:30 UTC	Let lerp lowering incorporate a final cast. (#6480) * Let lerp lowering incorporate a final cast This lets it save a few instructions on x86 and arm. cast(UInt(16), lerp(some_u8s)) produces the following, before and after this PR Before: x86: vmovdqu (%r15,%r13), %xmm4 vpmovzxbw -2(%r15,%r13), %ymm5 vpxor %xmm0, %xmm4, %xmm6 vpmovzxbw %xmm6, %ymm6 vpmovzxbw -1(%r15,%r13), %ymm7 vpmullw %ymm6, %ymm5, %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm7, %ymm4 vpaddw %ymm4, %ymm5, %ymm4 vpaddw %ymm1, %ymm4, %ymm4 vpmulhuw %ymm2, %ymm4, %ymm4 vpsrlw $7, %ymm4, %ymm4 vpand %ymm3, %ymm4, %ymm4 vmovdqu %ymm4, (%rbx,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b urshr v1.8h, v4.8h, #8 urshr v2.8h, v0.8h, #8 raddhn v1.8b, v1.8h, v4.8h raddhn v0.8b, v2.8h, v0.8h ushll v0.8h, v0.8b, #0 ushll v1.8h, v1.8b, #0 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 After: x86: vpmovzxbw -2(%r15,%r13), %ymm3 vmovdqu (%r15,%r13), %xmm4 vpxor %xmm0, %xmm4, %xmm5 vpmovzxbw %xmm5, %ymm5 vpmullw %ymm5, %ymm3, %ymm3 vpmovzxbw -1(%r15,%r13), %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm5, %ymm4 vpaddw %ymm4, %ymm3, %ymm3 vpaddw %ymm1, %ymm3, %ymm3 vpmulhuw %ymm2, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 vmovdqu %ymm3, (%rbp,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b ursra v4.8h, v4.8h, #8 ursra v0.8h, v0.8h, #8 urshr v1.8h, v4.8h, #8 urshr v0.8h, v0.8h, #8 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 So on X86 we skip a pointless and instruction, and on ARM we get a rounding add and shift right instead of a rounding narrowing add shift right followed by a widen. * Add test * Fix bug in test * Don't produce out-of-range lerp values	10 December 2021, 15:06:30 UTC
bcfd6af	Steven Johnson	09 December 2021, 23:06:05 UTC	Fail if no_bounds_query specified for HL_JIT_TARGET (#6489) * Fail if no_bounds_query specified for HL_JIT_TARGET JIT requires the use of bounds_query; disabling it will almost certainly fail in JIT mode, either with a confusing assert message, or a crash (if you also specify no_asserts). This adds a more useful failure message. * Update Target.cpp	09 December 2021, 23:06:05 UTC
59118de	Steven Johnson	08 December 2021, 22:12:53 UTC	Deal with Printer::scratch (#6469) (#6472) Instead of trying to optimize every Printer instance to use stack (and failing), move the StackPrinter concept into printer.h directly and require opt-in at the point of compilation to use stack instead of malloc. This PR also does a few other drive-by cleanups: - Ensures that all Printer ctors are explicit - Makes some template aliases to make using (e.g.) ErrorPrinter with a custom buffer size slightly cleaner syntax - Have tracing use the `.str()` method, which already deals with MSAN internally - Make all the Printer data members private - Fix some evil code in opencl.cpp that previously used the now-private data members	08 December 2021, 22:12:53 UTC
d089588	Steven Johnson	06 December 2021, 17:02:21 UTC	Move null check from Printer to halide_string_to_string() The Printer is (currently) usually inlined into every module, so this check is repeated in multiple chunks of code. Since the goal is to avoid crashing when debugging, let's move it to halide_string_to_string() (which will catch all these, and possibly more) and save some code size. (Further improvements in Printer code size on the way; this change seems worthy of considering separately.)	08 December 2021, 19:11:28 UTC
7199e7d	Andrew Adams	08 December 2021, 01:45:10 UTC	Try removing optional buffer added to closure	08 December 2021, 18:53:35 UTC
7992369	Andrew Adams	07 December 2021, 16:16:50 UTC	Add a fast integer divide that rounds to zero (#6455) * Add a version of fast_integer_divide that rounds towards zero * clang-format * Fix test condition * Clean up debugging code * Add explanatory comment to performance test * Pacify clang tidy	07 December 2021, 16:16:50 UTC
fb305fd	Roman Lebedev	07 December 2021, 02:15:18 UTC	`apps/linear_algebra/benchmarks/macros.h`: don't forget SSE guard (#6471) This is breaking i386 build: https://buildd.debian.org/status/fetch.php?pkg=halide&arch=i386&ver=13.0.1-3&stamp=1638786518&raw=0	07 December 2021, 02:15:18 UTC
e0df687	Marcos Slomp	06 December 2021, 20:34:44 UTC	decommissioning StackPrinter (#6470)	06 December 2021, 20:34:44 UTC
392430d	Steven Johnson	02 December 2021, 21:23:25 UTC	Fix Closure API (#6464) The current API requires calling a Visitor from the Closure ctor, which means we implicitly call virtual methods from the class ctor, which is a no-no for a non-final class (see comments on https://github.com/halide/Halide/pull/6443).	02 December 2021, 21:23:25 UTC
0ed461b	Steven Johnson	02 December 2021, 18:38:50 UTC	Add operator<< for Closure (#6443) * Add operator<< for Closure Moves the ad-hoc implementation our of HostClosure::arguments() for easier debugging usage. Also, drive-by elimination of the body of HostClosure ctor, which was identical to the one inherited from Closure. * Update DeviceArgument.cpp * Add explanatory comment	02 December 2021, 18:38:50 UTC
5cf9ae5	Andrew Adams	02 December 2021, 15:04:43 UTC	Reduce overhead of sampling profiler by having only one thread do it (#6433) * Reduce overhead of sampling profiler by having only one thread do it * Use const ref * One line per member	02 December 2021, 15:04:43 UTC
479d839	Steven Johnson	02 December 2021, 03:42:04 UTC	Add LinkageType::ExternalPlusArgv (#6452) (#6463) Allows us to skip generating metadata for offloaded hexagon funcs, which will never use it.	02 December 2021, 03:42:04 UTC
4877d26	Steven Johnson	01 December 2021, 19:22:00 UTC	Tweak Hexagon codegen output to match the pattern in Lower.cpp more accurately (for level 1 vs 2); also prefix the outputs so they are easier to read as Hexagon-specific when debugging (#6461)	01 December 2021, 19:22:00 UTC
c0192ff	Steven Johnson	30 November 2021, 06:13:44 UTC	Re-enable performance_async_gpu for D3D12Compute (#6450) * Re-enable performance_async_gpu for D3D12Compute It's been disabled for ~2 years because of flaky failures (#3586); we should see if the many changes since then have improved things or not. * tickle buildbots	30 November 2021, 06:13:44 UTC
5aeea6a	Andrew Adams	26 November 2021, 22:32:24 UTC	Fixes for c++20 (#6446) Fixes #6445	26 November 2021, 22:32:24 UTC
76c0946	Martijn Courteaux	26 November 2021, 20:03:24 UTC	Syntax highlighting for embedded PTX code. (#6447) * Include GPU source kernels in Stmt and StmtHtml file. * Syntax highlighting for embedded PTX code.	26 November 2021, 20:03:24 UTC
3bde22a	Martijn Courteaux	24 November 2021, 20:59:37 UTC	Include GPU source kernels in Stmt and StmtHtml file. (#6444)	24 November 2021, 20:59:37 UTC
8b68f85	Andrew Adams	23 November 2021, 21:13:48 UTC	Avoid needless gather in fast_integer_divide lowering (#6441) * Avoid needless gather in fast_integer_divide lowering fast_integer_divide did two lookups, one for a multiplier, and one for a shift. It turns out you can just use count leading zeros to compute a workable shift instead of having to do a lookup. This PR speeds up use of fast_integer_divide in cases where the denominator varies across vector lanes by ~70% or so by avoiding one of the two expensive gathers. * Fix slash direction * Pacify clang-tidy * Use portable bit-counting methods * Cleaner initialization of tables	23 November 2021, 21:13:48 UTC
d12fbd1	Steven Johnson	23 November 2021, 17:33:38 UTC	Codegen_C: buffer compilation needs to special-case scalar buffers (#6442) The existing code will emit something like `halide_dimension_t foo_buffer_shape[] = {};` for these, which is a zero-length array, which some compilers will (justifiably) say has no effect. We should be able to just use nullptr for the shape in these cases.	23 November 2021, 17:33:38 UTC
59d6da7	Andrew Adams	23 November 2021, 17:25:47 UTC	Skip custom cuda context test on older GPUs (#6437)	23 November 2021, 17:25:47 UTC
a89041b	Steven Johnson	22 November 2021, 21:29:11 UTC	Ensure that halide_start_clock() is called before halide_current_time_ns() in hexagon_host.cpp (#6438) This oversight was causing an assert with the -debug feature flag enabled (with presumably-misleading timing results as well)	22 November 2021, 21:29:11 UTC
57d1e05	Steven Johnson	22 November 2021, 19:46:52 UTC	Set up SANITIZER_FLAGS and OPTIMIZE for apps/Makefile.inc (#6435) Minor hygiene to make it easy to build AOT apps with TSAN or ASAN.	22 November 2021, 19:46:52 UTC
2239443	Andrew Adams	19 November 2021, 22:56:12 UTC	Do target-specific lowering of lerp (#6432) * Do target-specific lowering of lerp Saves instructions on x86. Before #6426 vpaddw %ymm0, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm2 vpaddw %ymm2, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm1 After #6426 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm4 vpaddw %ymm2, %ymm4, %ymm2 vpaddw %ymm3, %ymm2, %ymm2 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm2 vpaddw %ymm2, %ymm3, %ymm2 vpand %ymm1, %ymm2, %ymm2 This PR: vpaddw %ymm0, %ymm3, %ymm3 vpmulhuw %ymm1, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 * Target is a struct	19 November 2021, 22:56:12 UTC
cfd03c9	Steven Johnson	19 November 2021, 17:41:05 UTC	Don't remap the function name or the target in the metadata (#6430) The remapping is only intended to be used for output argument(s), not the function name; if you have an output with the same name as the function, you can get the metadata emitted with incorrect information. (And remapping the target string is just silly.) This is almost impossible to do currently, but if you construct a Generator just right, you can make it happen.	19 November 2021, 17:41:05 UTC
c3040cb	Volodymyr Kysenko	19 November 2021, 17:10:15 UTC	Rewrite integer lerp using intrinsics (#6426) * Rewrite integer lerp using intrinsics * Comment	19 November 2021, 17:10:15 UTC
0e40edc	Ashish Uthama	18 November 2021, 21:27:53 UTC	Include LICENSE.txt in package (#6428) Co-authored-by: Ashish Uthama <you@example.com>	18 November 2021, 21:27:53 UTC
36dd10f	Steven Johnson	17 November 2021, 23:14:21 UTC	Fix Introspection issues (#6424) - DWARF v5 has a slightly different header; this recognizes it so we don't fail immediately - Add support for the line_strp form - Allow for a graceful failure if a debug abbreviation is missing; I've only seen this when compiling for TSAN, and I'm honestly not entirely sure if this is a bug in the DWARF generation for those tools vs a subtle flaw in our parsing, but bailing out early and skipping introspection seems kinder than assert-fail.	17 November 2021, 23:14:21 UTC
16fa3ce	Steven Johnson	12 November 2021, 23:17:53 UTC	[hannk] Pacify clang-tidy (#6412) * [hannk] Pacify clang-tidy * One more ASAN fix We must use use_global_gc = false to work properly with the JIT * Revert "One more ASAN fix" This reverts commit 9ed07a70b4a656790236a5ff6966155df823a319. * Rework Op::mutate() to avoid UB	12 November 2021, 23:17:53 UTC
b63f6af	Steven Johnson	12 November 2021, 20:56:57 UTC	[hannk] Fix lower_tflite_fullyconnected (#6414) Fixed the bounds calculation in lower_tflite_fullyconnected() to preserve the invariants expected, and added a testcase that previously failed.	12 November 2021, 20:56:57 UTC
8c2dd5f	Steven Johnson	12 November 2021, 20:34:14 UTC	One more ASAN fix (#6413) We must use use_global_gc = false to work properly with the JIT	12 November 2021, 20:34:14 UTC
0153c6b	Steven Johnson	12 November 2021, 16:35:37 UTC	Revamp Hannk IR (#6379) Refactor Hannk IR and transforms to use a Mutator-based approach	12 November 2021, 16:35:37 UTC
79da2a0	Steven Johnson	12 November 2021, 16:34:30 UTC	Fix broken ASAN code (#6408) * Fix broken ASAN code Various changes and merges ended up with us using multiple ASAN passes, which was pretty crashy (we just didn't notice because it isn't tested well enough on our buildbots, but is elsewhere). I think we really only want to use the ModuleAddressSanitizerPass (not the non-Module version), which is what Clang does. * set UseAfterScope = true	12 November 2021, 16:34:30 UTC
02a394d	Steven Johnson	12 November 2021, 03:25:52 UTC	x86_cpuid_halide must preserve all 64 bits of rbx/rsi (#6409) The existing code attempts to preserve ebx (since the cpuid instruction can trash it), but it only preserves the lower 32 bits; on 64-bit systems, this (amazingly) usually works OK unless you are compiling in (e.g.) ASAN mode, which can subtly change codegen such that the full 32 bits of rbx must be preserved. I'm genuinely astonished this hasn't bitten us before now!	12 November 2021, 03:25:52 UTC
d763406	Volodymyr Kysenko	12 November 2021, 01:30:05 UTC	Change implementation of round_f* in CodeGen_C to use nearbyint to match CodeGen_LLVM (#6406)	12 November 2021, 01:30:05 UTC
9ff87ce	Steven Johnson	11 November 2021, 18:04:09 UTC	_halide_buffer_crop() needs to check for runtime failures (v2) (#6403) * _halide_buffer_crop() needs to check for runtime failures (v2) (Alternate to #6402) We currently assume that _halide_buffer_crop() will never fail. This is a bad assumption, as it can call device_crop(), which can fail due to unexpected runtime errors, or from a backend simply leaving the device_crop field at the default (unimplemented) case (as is currently the case for the OGLC backend). When this happens, the dst buffer was left in an inconsistent, invalid state (which was what led to the crashes fixed by #6401). This change modifies _halide_buffer_crop() to return nullptr in the event of an error, and ensure that all cropped buffers are checked for null at the right point. (This is not optimal, of course, since the specific error returned by device_crop is getting dropped on the floor, but the existence of an error is no longer ignored.) This addresses at least some of the failure issues we are seeing in performance_async_gpu with the OpenGLCompute backend. (Also: drive-by whitespace fix in CodegenC) * Oops	11 November 2021, 18:04:09 UTC
d343e76	Andrew Adams	11 November 2021, 17:06:00 UTC	Fix obscure bug in widening let substitution (#6405) Fix obscure bug in widening let substitution	11 November 2021, 17:06:00 UTC
8e34a35	Steven Johnson	09 November 2021, 23:10:40 UTC	Remove halide_abort_if_false() usage in runtime/metal (#6398) * Remove halide_abort_if_false() usage in runtime/metal This converts all the usage of `halide_abort_if_false()` in runtime/metal into either an explicit runtime check-and-return-error-code (if the check looks plausible), or `halide_debug_assert()` (if the check seems to be stating an invariant that shouldn't be possible in well-structured code). These changes are admittedly subjective, so feedback is especially welcome. Also, driveby change to sync-common.h to use `halide_debug_assert()` rather than a local equivalent. * nits	09 November 2021, 23:10:40 UTC