Revision history - refs/heads/srj/lossless-test - origin: https://github.com/halide/Halide

visit type:

Revision	Author	Date	Message	Commit Date
3a24168	Steven Johnson	05 June 2024, 17:10:56 UTC	Merge branch 'abadams/fix_lossless_cast_of_sub' into srj/lossless-test	05 June 2024, 17:10:56 UTC
360add6	Steven Johnson	05 June 2024, 17:07:21 UTC	Merge branch 'main' into abadams/fix_lossless_cast_of_sub	05 June 2024, 17:07:21 UTC
69bfd80	Steven Johnson	05 June 2024, 16:35:38 UTC	Merge branch 'main' into xtensa-codegen	05 June 2024, 16:35:38 UTC
5b5b0c6	Steven Johnson	05 June 2024, 16:26:46 UTC	Merge branch 'main' into xtensa-codegen	05 June 2024, 16:26:46 UTC
74b9044	Andrew Adams	05 June 2024, 16:24:06 UTC	It's generally a bad idea for simplifier rules to multiply constants (#8234) Fixes #8227 but may break other things. Needs thorough testing. Also, there are more rules like this lurking.	05 June 2024, 16:24:06 UTC
c33dbfb	Andrew Adams	04 June 2024, 23:01:15 UTC	Don't introduce out-of-range shifts in lossless_cast	04 June 2024, 23:01:15 UTC
46e866d	Martijn Courteaux	04 June 2024, 16:32:54 UTC	Report useful error to user if the promise_clamp all fails to losslessly cast. (#8238) Co-authored-by: Steven Johnson <srj@google.com>	04 June 2024, 16:32:54 UTC
775bfbf	Jason Ansel	04 June 2024, 16:31:30 UTC	Python binding support for int64 literals (#8254) This makes >32bit python integers get mapped to `hl.i64` implicitly. Fixes #8224	04 June 2024, 16:31:30 UTC
9c75554	Shoaib Kamil	04 June 2024, 15:21:04 UTC	Fix Metal handling for float16 literals (#8260) * Fix Metal handling of float16 from bits, infinity, neg infinity, and nans * Disable test for OpenCL half for now * Formatting	04 June 2024, 15:21:04 UTC
7414ee6	Andrew Adams	03 June 2024, 20:37:04 UTC	Delete commented-out code	03 June 2024, 20:37:04 UTC
9570818	Andrew Adams	03 June 2024, 19:09:56 UTC	Fix mul_shift_right expansion	03 June 2024, 19:09:56 UTC
c8f7e8f	Andrew Adams	03 June 2024, 18:39:47 UTC	Fix bugs in lossless_cast	03 June 2024, 18:39:47 UTC
ac5b13d	Andrew Adams	03 June 2024, 18:39:39 UTC	Rework find_mpy_ops to handle more structures	03 June 2024, 18:39:39 UTC
1ef63f7	Steven Johnson	03 June 2024, 17:49:00 UTC	Merge branch 'main' into xtensa-codegen	03 June 2024, 17:49:00 UTC
bf28e00	Andrew Adams	02 June 2024, 21:46:23 UTC	Merge remote-tracking branch 'origin/main' into abadams/fix_lossless_cast_of_sub	02 June 2024, 21:46:23 UTC
7ca95d8	Jason Ansel	02 June 2024, 21:39:44 UTC	Expose BFloat in Python bindings (#8255) There are two parts to support for BFloat16 in Python: 1) Ability to define kernels and AOT compile them [fixed in this PR] 2) Ability to call kernels from Python This fixes part 1, which is what I need for my use case. Part 2 is blocked on bfloat16 support in Python buffer protocols. See #6849 for more details.	02 June 2024, 21:39:44 UTC
a0f1d23	Andrew Adams	02 June 2024, 21:36:57 UTC	Add constant interval test	02 June 2024, 21:36:57 UTC
7cf2951	Jason Ansel	02 June 2024, 21:34:36 UTC	Remove max size assert from Anderson2021 (#8253) Fixes #8252	02 June 2024, 21:34:36 UTC
a9b8fbf	Andrew Adams	02 June 2024, 21:33:45 UTC	Rework the simplifier to use ConstantInterval for bounds (#8222) * Update the simplifier to use ConstantInterval and track the bounds through more types * Move the simplify fuzzer back to a correctness test * Make debug_indent not static Otherwise it causes a race condition in any parallel tests * Track expr info on non-overflowing casts to int * Delete commented-out code * clang-tidy * Delete unused member * Fix cmakelists for the fuzzer removal * Handle contradictions more gracefully in learn_true The contradiction was arising from: if (extent > 0) { ... } else { for (x = 0; x < extent; x++) { In here we can assume extent > 0, but we also know from the if statement that extent <= 0 } } * Better comments * Address review comments * Fix failure to pop loop var info	02 June 2024, 21:33:45 UTC
35143d2	Martijn Courteaux	02 June 2024, 21:19:04 UTC	Mark host_dirty() and device_dirty() with no_discard. (#8248) Co-authored-by: Steven Johnson <srj@google.com>	02 June 2024, 21:19:04 UTC
711dc88	Cheng Wang	31 May 2024, 17:53:47 UTC	Add HVX_v68 target to support Hexagon HVX v68. (#8232)	31 May 2024, 17:53:47 UTC
3ea4747	Misha Gutman	30 May 2024, 17:27:38 UTC	[xtensa] added support for sqrt_f16 (#8247)	30 May 2024, 17:27:38 UTC
33d5ba9	Andrew Adams	24 May 2024, 19:56:03 UTC	Fix saturating add matching in associativity checking (#8220) * Fix saturating add matching in associativity checking The associative ops table defined saturating add as saturating_narrow(widen(x + y)), instead of saturating_narrow(widen(x) + y)	24 May 2024, 19:56:03 UTC
b5f5065	Andrew Adams	23 May 2024, 18:17:49 UTC	Add some EVAL_IN_LAMBDAs to Simplify_Sub.cpp (#8230) Massively reduces compile time and peak cl.exe memory consumption on windows (from 9.5gb down to 2.3gb). Simplify_LT.cpp has these same EVAL_IN_LAMBDAs, which is probably why it hasn't been causing build problems.	23 May 2024, 18:17:49 UTC
8a316d1	Misha Gutman	23 May 2024, 16:23:09 UTC	[xtensa] Added vector load for two vectors for f16 and f32 (#8226)	23 May 2024, 16:23:09 UTC
17d4351	Steven Johnson	20 May 2024, 16:37:14 UTC	Merge branch 'main' into xtensa-codegen	20 May 2024, 16:37:14 UTC
e9f8b04	Steven Johnson	15 May 2024, 21:43:17 UTC	Fix for top-of-tree LLVM (#8223) * Fix for top-of-tree LLVM * Update LLVM_Runtime_Linker.cpp	15 May 2024, 21:43:17 UTC
16d77e9	Andrew Adams	15 May 2024, 17:43:34 UTC	Fix give-up case in ModulusRemainder (#8221) A default-constructed ModulusRemainder means no information, which is what we want here. ModulusRemainder{0, 1} means the constant one!	15 May 2024, 17:43:34 UTC
211bafa	Alexander Root	14 May 2024, 20:15:57 UTC	Fix Reinterpret cmp in IREquality (#8217) fix Reinterpret cmp	14 May 2024, 20:15:57 UTC
d61390a	Misha Gutman	10 May 2024, 17:55:47 UTC	[xtensa] Fixed index conversion for gather_load with undefined ramp (#8215) Fixed index conversion for gather_load with undefined ramp	10 May 2024, 17:55:47 UTC
dfaf6ad	Steven Johnson	30 April 2024, 15:08:26 UTC	Insert apparently-missing `break;` in IREquality.cpp (#8211) * Insert apparently-missing `break;` in IREquality.cpp * Enable -Wimplicit-fallthrough * Also add -Wimplicit-fallthrough to runtime builds * Add missing break to runtime/webgpu.cpp * Also add flag to Makefile --------- Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>	30 April 2024, 15:08:26 UTC
8141197	Alexander Root	30 April 2024, 13:38:30 UTC	[x86 & HVX & WASM] Use bounds inference for saturating_narrow instruction selection (#7805) * x86 bounds inference for saturating_narrow * bounds inference for HVX too * use can_represent(ConstantInterval) + clang-format * use bounds inference for WASM IS too + add tests * add tracking issue for scoped constant bounds * add TODO about lossless_cast usage --------- Co-authored-by: Steven Johnson <srj@google.com>	30 April 2024, 13:38:30 UTC
d55d82b	Steven Johnson	29 April 2024, 16:38:30 UTC	Update debug_to_file API to remove type_code (#8183) * Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add .npy support to debug_to_file() Built on top of https://github.com/halide/Halide/pull/8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like ``` ss = numpy.load("my_file.npy") print(ss) ``` in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see https://github.com/halide/Halide/issues/8176) Did a drive-by change to the correctness test to use this format instead of .mat. * Add float16 support * Add support for Float16 images in npy * Assume little-endian * Remove redundant halide_error() * naming convention * naming convention * Test both mat and npy * Don't call halide_error() * Use old-school parser * clang-tidy * Update debug_to_file API to remove type_code * Clean up into single table * Update CodeGen_LLVM.cpp * Fix tmp codes * Update InjectHostDevBufferCopies.cpp * Update InjectHostDevBufferCopies.cpp * trigger buildbots	29 April 2024, 16:38:30 UTC
3101277	Steven Johnson	29 April 2024, 15:21:33 UTC	Merge branch 'main' into xtensa-codegen	29 April 2024, 15:21:33 UTC
8202163	Andrew Adams	28 April 2024, 21:39:41 UTC	More aggressively unify duplicate lets (#8204) * Make unify_duplicate_lets more aggressive The simplifier can also clean up most of these, but it's harder for it because it has to consider that other mutations may have taken place. Beefing this up has no impact on lowering times for most apps, but something pathological was going on for local_laplacian. At 20 pyramid levels, this speeds up lowering by 1.3x. At 50 pyramid levels it's 2.3x. At 100 pyramid levels it's 4.1x. It also slightly reduces binary size. * Clarify comment; Avoid double-lookup into the scope Looking up with an Expr key and deep equality is expensive, so this was bad. * Add a std::move	28 April 2024, 21:39:41 UTC
64caf31	Andrew Adams	28 April 2024, 21:38:54 UTC	Faster vars used tracking in simplify let visitor (#8205) * Speed up the vars_used visitor in the simplifier let visitor This visitor shows up as the main cost of lowering in very large pipelines. This visitor is for tracking which lets are actually used for real inside the body of a let block (as opposed to the tracking we do when mutating, which is approximate, because we could construct and Expr that uses a Var and then discard it in a later mutation). The old implementation made a map of all variables referenced, and then checked each let name against that map one by one. If there are a small number of lets outside a huge Stmt, this is bad, because the data structure has to hold a number of names proportional to the stmt size instead of proportional to the number of lets. This new implementation instead makes a hash set of the let names, and than traverses the Stmt, removing names from the set as they are encountered. This is a big speed-up. We then make the speed-up larger by about the same factor again doing the following: 1) Only add names to the map that might be used based on the recursive mutate call. These are very very likely to be used, because we saw them at least once, and mutations that remove all uses of a Var are rare. 2) The visitor should early out when the map becomes empty. The let variables are often all used immediately, so this is frequent. Speeds up lowering of local laplacian by 1.44x, 2.6x, and 4.8x respectively for 20, 50, and 100 pyramid levels. Speeds up lowering of resnet50 by 1.04x. Speeds up lowering of lens blur by 1.06x * Exploit the ref count of the replacement Expr * Fix is_sole_reference logic in Simplify_Let.cpp * Reduce hash map size	28 April 2024, 21:38:54 UTC
302aa1c	Andrew Adams	25 April 2024, 18:58:23 UTC	Refactor ConstantInterval (#8179) * Make ConstantInterval more of a first-class thing and use it in Monotonic.cpp * Restore bound_correlated_differences calls * Elaborate on TODO * Handle some TODOs Also explicit ignore lossless_cast bugs that will be fixed in #8155 * Fix constant interval mod, clean up constant interval saturating cast * Improve comment * Avoid unsigned overflow * Fix the most obvious bug in lossless_cast, to make the fuzzer pass more * Skip over pipelines that fail the lossless_cast check * Drop iteration count on lossless_cast test * Add test to CMakeLists.txt * Avoid UB in constant_interval test (signed integer overflow of the scalars) * Restore accidentally-deleted line from CMakeLists.txt * Print on success * Handle Lets in constant_integer_bounds Also, plumb the cache through the recursive calls * Delete duplicate operator<< * Just always cast the bounds back to the range of the op type * Address review comments * Redo operator<< for ConstantIntervals * Improve comment; disable buggy code for now	25 April 2024, 18:58:23 UTC
e39497b	Andrew Adams	21 April 2024, 03:43:38 UTC	Make Interval::is_single_point check for deep equality (#8202) * Make is_single_point compare min and max by deep equality Interval::is_single_point() used to only compare expressions by shallow equality to see if they are the same Expr object. However, bounds_of_expr_in_scope is really improved if it uses deep equality instead, so it has a prepass that goes over the provided scope, calls equal(min, max) on everything, and fixes up anything where deep equality is true but shallow equality. This prepass costs O(n) for n things in scope, regardless of how complex the expression being analyzed is. So if you ask for the bounds of '4' say in a context where there are lots of things in the scope, it's absurdly slow. We were doing this! BoxTouched calls bounds_of_expr_in_scope lots of times on small index Exprs within the same very large scope. It's better to just make Interval::is_single_point() check deep equality. This speeds up local laplacian lowering by 1.1x, and resnet50 lowering by 1.5x. There were also places where intervals that were a single point were diverging due to carelessly written code. E.g. the interval [408, 408], where both of those 40*8s are the same Mul node, was being simplified like this: interval.min = simplify(interval.min); interval.max = simplify(interval.max); Not only does this do double the simplification work it should, but it also caused something that was a single point to diverge into not being a single point, because the repeated constant-folding creates a new Expr. With the new is_single_point this matters a lot less, but even so, I centralized simplification of intervals into a single helper that doesn't do the pointless double-simplification for single points. Some of these shallowly-unequal but deeply-equal Intervals were being created in bounds inference itself after the prepass, which may have been generating suboptimal bounds. This change should fix that in addition to the compile-time benefits. Also added a simplify call in SkipStages because I noticed when it processed specializations it was creating things like (condition) \|\| (!condition).	21 April 2024, 03:43:38 UTC
31c52ab	Andrew Adams	19 April 2024, 19:59:34 UTC	Faster substitute_facts (#8200) * Fix computational complexity of substitute_facts It was O(n) for n facts. This makes it O(log(n)) This was particularly bad for pipelines with lots of inputs or outputs, because those pipelines have lots of asserts, which make for lots of facts to substitute in. Speeds up lowering of local laplacian with 20 pyramid levels (which has only one input and one output) by 1.09x Speeds up lowering of the adams 2019 cost model training pipeline (lots of weight inputs and lots outputs due to derivatives) by 1.5x Speeds up resnet50 (tons of weight inputs) lowering by 7.3x! * Add missing switch breaks * Add missing comments * Elaborate on why we treat NaNs as equal	19 April 2024, 19:59:34 UTC
dd1d0e8	aankit-quic	19 April 2024, 17:33:44 UTC	[HEXAGON] Keep support for hexagon_remote/Makefile (#8186) Update hexagon_remote/Makefile	19 April 2024, 17:33:44 UTC
4e0b313	Andrew Adams	18 April 2024, 19:48:59 UTC	Rewrite IREquality to use a more compact stack instead of deep recursion (#8198) * Rewrite IREquality to use a more compact stack instead of deep recursion Deletes a bunch of code and speeds up lowering time of local laplacian with 20 pyramid levels by ~2.5% * clang-tidy * Fold in the version of equal in IRMatch.h/cpp * Add missing switch breaks * Add missing comments * Elaborate on why we treat NaNs as equal	18 April 2024, 19:48:59 UTC
7994e70	Andrew Adams	16 April 2024, 21:27:43 UTC	Fix corner case in if_then_else simplification (#8189) Co-authored-by: Steven Johnson <srj@google.com>	16 April 2024, 21:27:43 UTC
3e712ba	Volodymyr Kysenko	12 April 2024, 17:29:06 UTC	Disable fused mul-add for f16 while investigating	12 April 2024, 17:29:06 UTC
f247636	Volodymyr Kysenko	12 April 2024, 17:25:08 UTC	Merge branch 'main' into xtensa-codegen	12 April 2024, 17:25:08 UTC
f4c7831	Andrew Adams	11 April 2024, 22:07:20 UTC	Don't print on parallel task entry/exit with -debug flag (#8185) Fixes #8184	11 April 2024, 22:07:20 UTC
dc83707	Steven Johnson	11 April 2024, 18:04:42 UTC	Add .npy support to debug_to_file() (#8177) * Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add .npy support to debug_to_file() Built on top of https://github.com/halide/Halide/pull/8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like ``` ss = numpy.load("my_file.npy") print(ss) ``` in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see https://github.com/halide/Halide/issues/8176) Did a drive-by change to the correctness test to use this format instead of .mat. * Add float16 support * Add support for Float16 images in npy * Assume little-endian * Remove redundant halide_error() * naming convention * naming convention * Test both mat and npy * Don't call halide_error() * Use old-school parser * clang-tidy	11 April 2024, 18:04:42 UTC
8f3f6cf	Fabian Schuetze	11 April 2024, 16:58:36 UTC	Update Hexagon Install Instructions (#8182) update Hexagon install instructions	11 April 2024, 16:58:36 UTC
e3d3c8c	Martijn Courteaux	08 April 2024, 15:29:33 UTC	Fix unused variable. (#8180)	08 April 2024, 15:29:33 UTC
35f0c29	Steven Johnson	06 April 2024, 15:17:25 UTC	Add .npy support to halide_image_io (#8175) * Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add float16 support * Use old-school parser * clang-tidy	06 April 2024, 15:17:25 UTC
f42a3d7	Steven Johnson	05 April 2024, 16:43:20 UTC	Merge branch 'main' into xtensa-codegen	05 April 2024, 16:43:20 UTC
14ae082	Andrew Adams	05 April 2024, 16:39:07 UTC	Clarify the meaning of Shuffle::is_broadcast() (#8158) * Fix horrifying bug in lossless_cast of a subtract * A 'broadcast' shuffle is more complex than it seems I was poking at the Shuffle node, and checking its usage, and it seems that despite the comment, Shuffles that return true for is_broadcast are not the same as a Broadcast node. Instead of repeating the input vector some number of times, it repeats a shuffle of the input vector. This means IRPrinter was incorrect. None of the other usages were bad. This PR makes this clearer in the comment, and fixes IRPrinter. * Revert accidental change	05 April 2024, 16:39:07 UTC
a462044	Alexander Root	05 April 2024, 16:38:46 UTC	Tighten bounds of abs() (#8168) * Tighten bounds of abs() * make abs bounds tight for non-int32 too * make int32 min expression match non-int32 min expression	05 April 2024, 16:38:46 UTC
ddab1cf	Andrew Adams	05 April 2024, 16:30:59 UTC	Fix bug in lossless_negate	05 April 2024, 16:30:59 UTC
7d99357	Steven Johnson	05 April 2024, 16:07:05 UTC	Add conversion code for Float16 that was missed in #8174 (#8178) * Add conversion code for Float16 that was missed in #8174 * Don't sniff for _Float16 when building ASAN * Update HalideRuntime.h	05 April 2024, 16:07:05 UTC
3b8a532	Steven Johnson	04 April 2024, 17:19:13 UTC	Add some missing _Float16 support (#8174) (Changes extracted from https://github.com/halide/Halide/pull/8169, which may or may not land in its current form) Some missing support for _Float16 that will likely be handy: - Allow _Float16 to be detected for Clang 15 (since my local XCode Clang 15 definitely supports it) - Expr(_Float16) - HALIDE_DECLARE_EXTERN_SIMPLE_TYPE(_Float16); - Add _Float16 to the convert matrix in halide_image_io.h	04 April 2024, 17:19:13 UTC
a4158c0	Andrew Adams	03 April 2024, 19:28:25 UTC	fix ub in lower rounding shift right (#8173) * Avoid out-of-range shifts in lower_rounding_shift_left/right Consider `lower_rounding_shift_right(a, (uint8)0)` The term b - 1 becomes 255, and now you have an out-of-range shift, which causes the simplifier to inject a signed_integer_overflow intrinsic, and compilation to fail. This is a little annoying because if b == 0, b_positive is a zero mask, so the result isn't used anyway (this is also why this change is legal). In llvm, it's a poison value, not UB, so masking it off works. If the simplifier were smarter, it might just drop the signed_integer_overflow intrinsic on detecting that it was being bitwise-and-ed with zero. But the safest thing to do is not overflow. saturating_add/sub are typically as cheap as add/sub. 99.9% of the time b is some positive constant anyway, so it's going to get constant-folded. * Add test	03 April 2024, 19:28:25 UTC
1737a52	Andrew Adams	02 April 2024, 19:34:37 UTC	Add shifts to the lossless cast fuzzer This required a more careful signed-integer-overflow detection routine	02 April 2024, 19:34:37 UTC
c652667	Andrew Adams	02 April 2024, 19:11:42 UTC	Avoid UB in lowering of rounding_shift_right/left	02 April 2024, 19:11:42 UTC
6bcc66a	Andrew Adams	02 April 2024, 19:11:26 UTC	Fix bad test when lowering mul_shift_right b_shift + b_shift < missing_q	02 April 2024, 19:11:26 UTC
c6065ff	Andrew Adams	02 April 2024, 19:10:41 UTC	Delete bad rewrite rules	02 April 2024, 19:10:41 UTC
0fb8d38	Andrew Adams	02 April 2024, 19:10:14 UTC	Stronger assert in Simplify_Div	02 April 2024, 19:10:14 UTC
66c56f1	Andrew Adams	01 April 2024, 20:35:01 UTC	Fix some UB	01 April 2024, 20:35:01 UTC
ecfae44	Andrew Adams	01 April 2024, 20:32:27 UTC	It's too late to change the semantics of fixed point intrinsics	01 April 2024, 20:32:27 UTC
16a706d	Andrew Adams	01 April 2024, 20:30:50 UTC	Remove bad TODO I can't think of a single case that could cause this	01 April 2024, 20:30:50 UTC
0856319	Andrew Adams	01 April 2024, 20:13:03 UTC	clear_bounds_info -> clear_expr_info	01 April 2024, 20:13:03 UTC
4a293b1	Andrew Adams	01 April 2024, 20:08:33 UTC	Add missing comment	01 April 2024, 20:08:33 UTC
854122f	Andrew Adams	01 April 2024, 20:01:16 UTC	Remove redundant helpers	01 April 2024, 20:01:16 UTC
413b4a6	Andrew Adams	01 April 2024, 19:59:33 UTC	Account for more aggressive simplification in fuse test	01 April 2024, 19:59:33 UTC
b053ec6	Andrew Adams	01 April 2024, 19:54:51 UTC	Add missing files	01 April 2024, 19:54:51 UTC
26efb7c	Andrew Adams	01 April 2024, 19:28:57 UTC	Misc cleanups and test improvements	01 April 2024, 19:28:57 UTC
2f14881	Andrew Adams	01 April 2024, 18:09:06 UTC	Add a simplifier rule which is apparently now necessary	01 April 2024, 18:09:06 UTC
cffadd8	Andrew Adams	01 April 2024, 18:08:29 UTC	Fix ConstantInterval multiplication	01 April 2024, 18:08:29 UTC
cff71e1	Steven Johnson	29 March 2024, 17:11:43 UTC	Merge branch 'main' into xtensa-codegen	29 March 2024, 17:11:43 UTC
f308a8c	Andrew Adams	28 March 2024, 18:09:48 UTC	Add cache for constant bounds queries	28 March 2024, 18:09:48 UTC
6434210	Andrew Adams	28 March 2024, 18:08:45 UTC	Fix * operator. Add min/max/mod	28 March 2024, 18:08:45 UTC
7f4bb38	Andrew Adams	25 March 2024, 22:29:43 UTC	Handle bounds of narrower types in the simplifier too	25 March 2024, 22:29:43 UTC
bee38ce	Andrew Adams	25 March 2024, 21:37:54 UTC	Make the simplifier use ConstantInterval	25 March 2024, 21:37:54 UTC
67855a5	Andrew Adams	25 March 2024, 15:44:24 UTC	Move new classes to new files Also fix up Monotonic.cpp	25 March 2024, 15:44:24 UTC
214f0fd	Andrew Adams	22 March 2024, 18:00:17 UTC	Using constant_integer_bounds to strengthen FindIntrinsics In particular, we can do better instruction selection for pmulhrsw	22 March 2024, 18:00:17 UTC
e0f9f8e	Andrew Adams	21 March 2024, 17:27:52 UTC	Fix ARM and HVX instruction selection Also added more TODOs	21 March 2024, 17:27:52 UTC
8864e8a	Roman Lebedev	18 March 2024, 23:09:09 UTC	Python bindings: `add_python_test()`: do set `HL_JIT_TARGET` too (#8156) This one took quite a bit of digging. I wanted to enable opencl tests on debian package, and `boundary_conditions.py`+`division.py` were failing when run with `HL_TARGET=host OCL_ICD_VENDORS=no-opencl-please.missing` env variables with `clGetPlatformIDs failed`, which made no sense to me. Empty `HL_JIT_TARGET` results in `opencl` being detected, unsurprisingly.	18 March 2024, 23:09:09 UTC
9c33c94	Andrew Adams	18 March 2024, 22:43:41 UTC	Use constant integer intervals to analyze safety for lossless_cast TODO: - Dedup the constant integer code with the same code in the simplifier. - Move constant interval arithmetic operations out of the class. - Make the ConstantInterval part of the return type of lossless_cast (and turn it into an inner helper) so that it isn't constantly recomputed.	18 March 2024, 22:43:41 UTC
f60781f	Steven Johnson	15 March 2024, 22:07:53 UTC	Merge branch 'main' into xtensa-codegen	15 March 2024, 22:07:53 UTC
a132246	Andrew Adams	15 March 2024, 21:04:44 UTC	Fix two compute_with bugs. (#8152) * Fix two compute_with bugs. This PR fixes a bug in compute_with, and another bug I found while fixing it (we could really use a compute_with fuzzer). The first bug is that you can get into situations where the bounds of a producer func will refer directly to the loop variable of a consumer func, where the consumer is in a compute_with fused group. In main, that loop variable may not be defined because fused loop names have been rewritten to include the token ".fused.". This PR adds let stmts to define it just inside the fused loop body. The second bug is that not all parent loops in compute_with fused groups were having their bounds expanded to cover the region to be computed of all children, because the logic for deciding which loops to expand only considered the non-specialized pure definition. So e.g. compute_with applied to an update stage would fail to compute values of the child Func where they do not overlap with the parent Func. This PR visits all definitions of the parent Func of the fused group, instead of just the unspecialized pure definition of the parent Func. Fixes #8149 * clang-tidy	15 March 2024, 21:04:44 UTC
76a7dd4	Zalman Stern	15 March 2024, 20:01:51 UTC	Support for ARM SVE2. (#8051) * Checkpoint SVE2 restart. * Remove dead code. Add new test. * Update cmake for new file. * Checkpoint progress on SVE2. * Checkpoint ARM SVE2 support. Passes correctness_simd_op_check_sve2 test at 128 and 256 bits. * Remove an opportunity for RISC V codegen to change due to SVE2 support. * Ensure SVE intrinsics get vscale vectors and non-SVE ones get fixed vectors. Use proper prefix for neon intrinsics. Comment cleanups. * Checkpoint SVE2 work. Generally passes test, though using both NEON and SVE2 with simd_op_check_sve2 fails as both posibilities need to be allowed for 128-bit or smaller operations. * Remove an unfavored implementation possibility. * Fix opcode recognition in test to handle some cases that show up. Change name of test class to avoid confusion. * Formatting fixes. Replace internal_error with nop return for CodeGen_LLVM::match_vector_type_scalable called on scalar. * Formatting fix. * Limit SVE2 test to LLVM 19. Remove dead code. * Fix a degenerate case asking for zero sized vectors via a HAlide type with lanes of zero, which is not correct. * Fix confusion about Neon64/Neon128 and make it clear this is just the width multiplier applied to intrinsics. * REmove extraneous commented out line. * Address some review feedback. Mostly comment fixes. * Fix missed conflict resolution. * Fix some TODOs in SVE code. Move utility function to Util.h and common code the other obvious use. * Formatting. * Add missed refactor change. * Add issue to TODO comment. * Remove TODOs that don't seem necessary. * Add issue for TODO. * Add issue for TODO. * Remove dubious looking FP to int code that was ifdef'ed out. Doesn't look like a TODO is needed anymore. * Add issues for TODOs. * Update simd_op_check_sve2.cpp * Make a deep copy of each piece of test IR so that we can parallelize * Fix two clang-tidy warnings * Remove try/catch block from simd-op-check-sve2 * Don't try to run SVE2 code if vector_bits doesn't match host. * Add support for fcvtm/p, make scalars go through pattern matching too (#8151) * Don't do arm neon instruction selection on scalars This revealed a bug. FindIntrinsics was not enabled for scalars anyway, so it was semi-pointless. --------- Co-authored-by: Zalman Stern <zalman@macbook-pro.lan> Co-authored-by: Steven Johnson <srj@google.com> Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>	15 March 2024, 20:01:51 UTC
7d80f8b	Andrew Adams	14 March 2024, 21:37:38 UTC	Fix horrifying bug in lossless_cast of a subtract	14 March 2024, 21:37:38 UTC
f841a27	Volodymyr Kysenko	14 March 2024, 19:53:17 UTC	Bound allocation extents for hoist_storage using loop variables one-by-one (#8154) * Bound allocation extents using loop variable one-by-one * Use emplace_back	14 March 2024, 19:53:17 UTC
83616f2	Andrew Adams	13 March 2024, 00:00:49 UTC	Fix three nits (#8137) 1) has_gpu_feature already includes Vulkan, so there's no need to check for it. 2) Use emplace(...) instead of insert(make_pair(...)) 3) Fixed a place that should be using a ScopedValue	13 March 2024, 00:00:49 UTC
4988ab5	Martijn Courteaux	12 March 2024, 23:58:14 UTC	Feature: mark a Func as no_profiling, to prevent injection of profiling. (2nd implementation) (#8143) * Small feature to allow you to specify that a (typically small inner loop) Func should not be profiled. * Simplified the tuple name handling. * Optimize tuple name normalization in Profiling.cpp * Clang-format * Feedback on Function already being a pointer. Bump the Patch version of the serialization.	12 March 2024, 23:58:14 UTC
bf0d611	Andrew Adams	12 March 2024, 16:49:26 UTC	Rewrite the pass that adds mutexes for atomic nodes (#8105) * Avoid redundant scope lookups This pattern has been bugging me for a long time: ``` if (scope.contains(key)) { Foo f = scope.get(key); } ``` This redundantly looks up the key in the scope twice. I've finally gotten around to fixing it. I've introduced a find method that either returns a const pointer to the value, if it exists, or null. It also searches any containing scopes, which are held by const pointer, so the method has to return a const pointer. ``` if (const Foo f = scope.find(key)) { } ``` For cases where you want to get and then mutate, I added shallow_find, which doesn't search enclosing scopes, but returns a mutable pointer. We were also doing redundant scope lookups in ScopedBinding. We stored the key in the helper object, and then did a pop on that key in the ScopedBinding destructor. This commit changes Scope so that Scope::push returns an opaque token that you can pass to Scope::pop to have it remove that element without doing a fresh lookup. ScopedBinding now uses this. Under the hood it's just an iterator on the underlying map (map iterators are not invalidated on inserting or removing other stuff). The net effect is to speed up local laplacian lowering by about 5% I also considered making it look more like an stl class, and having find return an iterator, but it doesn't really work. The iterator it returns might point to an entry in an enclosing scope, in which case you can't compare it to the .end() method of the scope you have. Scopes are different enough from maps that the interface really needs to be distinct. Pacify clang-tidy * Rewrite the pass that injects mutexes to support atomics For O(n) nested allocate nodes, this pass was quadratic in n, even if there was no use of atomics. This commit rewrites it to use a linear-time algorithm, and skips it entirely after the first validation pass if there aren't any atomic nodes. It also needlessly used IRGraphMutators, which slowed things down, didn't handle LargeBuffers (could overflow in the allocation), incorrectly thought every producer/consumer node was associated with an output buffer, and didn't print the realization name when printing the atomic node (the body of an atomic node is only atomic w.r.t. a specific realization). I noticed all this because it stuck out in a profile. For resnet 50, the rewrite that changed to a linear algorithm took this stage from 185ms down to 6.7ms, and then skipping it entirely when it doesn't find any atomic nodes added 1.5 for the single IRVisitor check. For local laplacian with 100 pyramid levels (which contains many nested allocate nodes due to a large number of skip connections), the times are 5846 ms -> 16 ms -> 4.6 ms This is built on top of #8103 * Fix unintentional mutation of interval in scope --------- Co-authored-by: Steven Johnson <srj@google.com>	12 March 2024, 16:49:26 UTC
3c2d809	Andrew Adams	12 March 2024, 00:05:44 UTC	Use python itself to get the extension suffix, not python-config (#8148) * Use python itself to get the extension suffix, not python-config * Add a comment	12 March 2024, 00:05:44 UTC
30f29fc	Steven Johnson	08 March 2024, 21:20:30 UTC	Merge branch 'main' into xtensa-codegen	08 March 2024, 21:20:30 UTC
009fe7a	Andrew Adams	08 March 2024, 16:50:20 UTC	Handle loads of broadcasts in FlattenNestedRamps (#8139) With sufficiently perverse schedules, it's possible to end up with a load of a broadcast index (rather than a broadcast of a scalar load). This made FlattenNestedRamps divide by zero. Unfortunately this happened in a complex production pipeline, so I'm not entirely sure how to reproduce it. For that pipeline, this change fixes it and produces correct output.	08 March 2024, 16:50:20 UTC
8cc4f02	Steven Johnson	08 March 2024, 02:13:56 UTC	Fix for top-of-tree LLVM (#8145)	08 March 2024, 02:13:56 UTC
22868a4	Prasoon Mishra	06 March 2024, 21:40:00 UTC	Add sobel in hexagon benchmarks app for CMake builds (#8127) * Add sobel in hexagon_benchmarks app for CMake builds Resolved compilation errors caused by the eliminate interleave pass, which changed the instruction from halide.hexagon.pack_satub.vuh to halide.hexagon.trunc_satub.vuh. The latter is only available in v65 or later. This commit ensures compatibility with v65 and later versions. * Minor fix to address the issue. --------- Co-authored-by: Steven Johnson <srj@google.com>	06 March 2024, 21:40:00 UTC
754e6ec	Derek Gerstmann	06 March 2024, 19:46:23 UTC	[vulkan] Add conform API methods to memory allocator to fix block allocations (#8130) * Add conform API methods to block and region allocator classes Override conform requests for Vulkan memory allocator Cleanup memory requirement constraints for Vulkan Add conform test cases to block_allocator runtime test. * Clang format/tidy pas * Fix unsigned int comparisons * Clang format pass * Fix other unsigned int comparisons * Fix mismatched template types for max() * Fix whitespace for clang format --------- Co-authored-by: Derek Gerstmann <dgerstmann@adobe.com>	06 March 2024, 19:46:23 UTC
aad94de	Volodymyr Kysenko	05 March 2024, 22:43:44 UTC	Disable halide_xtensa_mul_add_f32 temporarily	05 March 2024, 22:43:44 UTC
b6449b3	Steven Johnson	05 March 2024, 17:54:38 UTC	Merge branch 'main' into xtensa-codegen	05 March 2024, 17:54:38 UTC
10e07e6	Zalman Stern	05 March 2024, 17:53:29 UTC	Add class template type deduction guides to avoid CTAD warning. (#8135) * Add class template type dedeuction guides to avoid CTAD warning. * Formatting.	05 March 2024, 17:53:29 UTC
05ae15a	Andrew Adams	05 March 2024, 17:50:19 UTC	Make gpu thread and block for loop names opaque (#8133) This is one of our largest remaining type of magic name. These were explicitly constructed in lots of places and then explicitly checked for with ends_with in lots of places. This PR makes the names opaque. Only CanonicalizeGPUVars.cpp knows what they are, and they don't have to be a single fixed thing as long as they're consistent within a process. Also reduced the number of GPU dimensions to three more uniformly. We were already asserting this, but there was lots of dead code in lowering passes after gpu loop validation that allowed for four. Also fixed a bug I found in is_block_uniform. It didn't consider that the dependence on a gpu thread variable in a load index could be because a let variable encountered depends on a gpu thread variable.	05 March 2024, 17:50:19 UTC

Newer
Older