https://github.com/halide/Halide

sort by:
Revision Author Date Message Commit Date
3a24168 Merge branch 'abadams/fix_lossless_cast_of_sub' into srj/lossless-test 05 June 2024, 17:10:56 UTC
360add6 Merge branch 'main' into abadams/fix_lossless_cast_of_sub 05 June 2024, 17:07:21 UTC
69bfd80 Merge branch 'main' into xtensa-codegen 05 June 2024, 16:35:38 UTC
5b5b0c6 Merge branch 'main' into xtensa-codegen 05 June 2024, 16:26:46 UTC
74b9044 It's generally a bad idea for simplifier rules to multiply constants (#8234) Fixes #8227 but may break other things. Needs thorough testing. Also, there are more rules like this lurking. 05 June 2024, 16:24:06 UTC
c33dbfb Don't introduce out-of-range shifts in lossless_cast 04 June 2024, 23:01:15 UTC
46e866d Report useful error to user if the promise_clamp all fails to losslessly cast. (#8238) Co-authored-by: Steven Johnson <srj@google.com> 04 June 2024, 16:32:54 UTC
775bfbf Python binding support for int64 literals (#8254) This makes >32bit python integers get mapped to `hl.i64` implicitly. Fixes #8224 04 June 2024, 16:31:30 UTC
9c75554 Fix Metal handling for float16 literals (#8260) * Fix Metal handling of float16 from bits, infinity, neg infinity, and nans * Disable test for OpenCL half for now * Formatting 04 June 2024, 15:21:04 UTC
7414ee6 Delete commented-out code 03 June 2024, 20:37:04 UTC
9570818 Fix mul_shift_right expansion 03 June 2024, 19:09:56 UTC
c8f7e8f Fix bugs in lossless_cast 03 June 2024, 18:39:47 UTC
ac5b13d Rework find_mpy_ops to handle more structures 03 June 2024, 18:39:39 UTC
1ef63f7 Merge branch 'main' into xtensa-codegen 03 June 2024, 17:49:00 UTC
bf28e00 Merge remote-tracking branch 'origin/main' into abadams/fix_lossless_cast_of_sub 02 June 2024, 21:46:23 UTC
7ca95d8 Expose BFloat in Python bindings (#8255) There are two parts to support for BFloat16 in Python: 1) Ability to define kernels and AOT compile them [fixed in this PR] 2) Ability to call kernels from Python This fixes part 1, which is what I need for my use case. Part 2 is blocked on bfloat16 support in Python buffer protocols. See #6849 for more details. 02 June 2024, 21:39:44 UTC
a0f1d23 Add constant interval test 02 June 2024, 21:36:57 UTC
7cf2951 Remove max size assert from Anderson2021 (#8253) Fixes #8252 02 June 2024, 21:34:36 UTC
a9b8fbf Rework the simplifier to use ConstantInterval for bounds (#8222) * Update the simplifier to use ConstantInterval and track the bounds through more types * Move the simplify fuzzer back to a correctness test * Make debug_indent not static Otherwise it causes a race condition in any parallel tests * Track expr info on non-overflowing casts to int * Delete commented-out code * clang-tidy * Delete unused member * Fix cmakelists for the fuzzer removal * Handle contradictions more gracefully in learn_true The contradiction was arising from: if (extent > 0) { ... } else { for (x = 0; x < extent; x++) { In here we can assume extent > 0, but we also know from the if statement that extent <= 0 } } * Better comments * Address review comments * Fix failure to pop loop var info 02 June 2024, 21:33:45 UTC
35143d2 Mark host_dirty() and device_dirty() with no_discard. (#8248) Co-authored-by: Steven Johnson <srj@google.com> 02 June 2024, 21:19:04 UTC
711dc88 Add HVX_v68 target to support Hexagon HVX v68. (#8232) 31 May 2024, 17:53:47 UTC
3ea4747 [xtensa] added support for sqrt_f16 (#8247) 30 May 2024, 17:27:38 UTC
33d5ba9 Fix saturating add matching in associativity checking (#8220) * Fix saturating add matching in associativity checking The associative ops table defined saturating add as saturating_narrow(widen(x + y)), instead of saturating_narrow(widen(x) + y) 24 May 2024, 19:56:03 UTC
b5f5065 Add some EVAL_IN_LAMBDAs to Simplify_Sub.cpp (#8230) Massively reduces compile time and peak cl.exe memory consumption on windows (from 9.5gb down to 2.3gb). Simplify_LT.cpp has these same EVAL_IN_LAMBDAs, which is probably why it hasn't been causing build problems. 23 May 2024, 18:17:49 UTC
8a316d1 [xtensa] Added vector load for two vectors for f16 and f32 (#8226) 23 May 2024, 16:23:09 UTC
17d4351 Merge branch 'main' into xtensa-codegen 20 May 2024, 16:37:14 UTC
e9f8b04 Fix for top-of-tree LLVM (#8223) * Fix for top-of-tree LLVM * Update LLVM_Runtime_Linker.cpp 15 May 2024, 21:43:17 UTC
16d77e9 Fix give-up case in ModulusRemainder (#8221) A default-constructed ModulusRemainder means no information, which is what we want here. ModulusRemainder{0, 1} means the constant one! 15 May 2024, 17:43:34 UTC
211bafa Fix Reinterpret cmp in IREquality (#8217) fix Reinterpret cmp 14 May 2024, 20:15:57 UTC
d61390a [xtensa] Fixed index conversion for gather_load with undefined ramp (#8215) Fixed index conversion for gather_load with undefined ramp 10 May 2024, 17:55:47 UTC
dfaf6ad Insert apparently-missing `break;` in IREquality.cpp (#8211) * Insert apparently-missing `break;` in IREquality.cpp * Enable -Wimplicit-fallthrough * Also add -Wimplicit-fallthrough to runtime builds * Add missing break to runtime/webgpu.cpp * Also add flag to Makefile --------- Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com> 30 April 2024, 15:08:26 UTC
8141197 [x86 & HVX & WASM] Use bounds inference for saturating_narrow instruction selection (#7805) * x86 bounds inference for saturating_narrow * bounds inference for HVX too * use can_represent(ConstantInterval) + clang-format * use bounds inference for WASM IS too + add tests * add tracking issue for scoped constant bounds * add TODO about lossless_cast usage --------- Co-authored-by: Steven Johnson <srj@google.com> 30 April 2024, 13:38:30 UTC
d55d82b Update debug_to_file API to remove type_code (#8183) * Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add .npy support to debug_to_file() Built on top of https://github.com/halide/Halide/pull/8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like ``` ss = numpy.load("my_file.npy") print(ss) ``` in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see https://github.com/halide/Halide/issues/8176) Did a drive-by change to the correctness test to use this format instead of .mat. * Add float16 support * Add support for Float16 images in npy * Assume little-endian * Remove redundant halide_error() * naming convention * naming convention * Test both mat and npy * Don't call halide_error() * Use old-school parser * clang-tidy * Update debug_to_file API to remove type_code * Clean up into single table * Update CodeGen_LLVM.cpp * Fix tmp codes * Update InjectHostDevBufferCopies.cpp * Update InjectHostDevBufferCopies.cpp * trigger buildbots 29 April 2024, 16:38:30 UTC
3101277 Merge branch 'main' into xtensa-codegen 29 April 2024, 15:21:33 UTC
8202163 More aggressively unify duplicate lets (#8204) * Make unify_duplicate_lets more aggressive The simplifier can also clean up most of these, but it's harder for it because it has to consider that other mutations may have taken place. Beefing this up has no impact on lowering times for most apps, but something pathological was going on for local_laplacian. At 20 pyramid levels, this speeds up lowering by 1.3x. At 50 pyramid levels it's 2.3x. At 100 pyramid levels it's 4.1x. It also slightly reduces binary size. * Clarify comment; Avoid double-lookup into the scope Looking up with an Expr key and deep equality is expensive, so this was bad. * Add a std::move 28 April 2024, 21:39:41 UTC
64caf31 Faster vars used tracking in simplify let visitor (#8205) * Speed up the vars_used visitor in the simplifier let visitor This visitor shows up as the main cost of lowering in very large pipelines. This visitor is for tracking which lets are actually used for real inside the body of a let block (as opposed to the tracking we do when mutating, which is approximate, because we could construct and Expr that uses a Var and then discard it in a later mutation). The old implementation made a map of all variables referenced, and then checked each let name against that map one by one. If there are a small number of lets outside a huge Stmt, this is bad, because the data structure has to hold a number of names proportional to the stmt size instead of proportional to the number of lets. This new implementation instead makes a hash set of the let names, and than traverses the Stmt, removing names from the set as they are encountered. This is a big speed-up. We then make the speed-up larger by about the same factor again doing the following: 1) Only add names to the map that might be used based on the recursive mutate call. These are very very likely to be used, because we saw them at least once, and mutations that remove *all* uses of a Var are rare. 2) The visitor should early out when the map becomes empty. The let variables are often all used immediately, so this is frequent. Speeds up lowering of local laplacian by 1.44x, 2.6x, and 4.8x respectively for 20, 50, and 100 pyramid levels. Speeds up lowering of resnet50 by 1.04x. Speeds up lowering of lens blur by 1.06x * Exploit the ref count of the replacement Expr * Fix is_sole_reference logic in Simplify_Let.cpp * Reduce hash map size 28 April 2024, 21:38:54 UTC
302aa1c Refactor ConstantInterval (#8179) * Make ConstantInterval more of a first-class thing and use it in Monotonic.cpp * Restore bound_correlated_differences calls * Elaborate on TODO * Handle some TODOs Also explicit ignore lossless_cast bugs that will be fixed in #8155 * Fix constant interval mod, clean up constant interval saturating cast * Improve comment * Avoid unsigned overflow * Fix the most obvious bug in lossless_cast, to make the fuzzer pass more * Skip over pipelines that fail the lossless_cast check * Drop iteration count on lossless_cast test * Add test to CMakeLists.txt * Avoid UB in constant_interval test (signed integer overflow of the scalars) * Restore accidentally-deleted line from CMakeLists.txt * Print on success * Handle Lets in constant_integer_bounds Also, plumb the cache through the recursive calls * Delete duplicate operator<< * Just always cast the bounds back to the range of the op type * Address review comments * Redo operator<< for ConstantIntervals * Improve comment; disable buggy code for now 25 April 2024, 18:58:23 UTC
e39497b Make Interval::is_single_point check for deep equality (#8202) * Make is_single_point compare min and max by deep equality Interval::is_single_point() used to only compare expressions by shallow equality to see if they are the same Expr object. However, bounds_of_expr_in_scope is really improved if it uses deep equality instead, so it has a prepass that goes over the provided scope, calls equal(min, max) on everything, and fixes up anything where deep equality is true but shallow equality. This prepass costs O(n) for n things in scope, regardless of how complex the expression being analyzed is. So if you ask for the bounds of '4' say in a context where there are lots of things in the scope, it's absurdly slow. We were doing this! BoxTouched calls bounds_of_expr_in_scope lots of times on small index Exprs within the same very large scope. It's better to just make Interval::is_single_point() check deep equality. This speeds up local laplacian lowering by 1.1x, and resnet50 lowering by 1.5x. There were also places where intervals that were a single point were diverging due to carelessly written code. E.g. the interval [40*8, 40*8], where both of those 40*8s are the same Mul node, was being simplified like this: interval.min = simplify(interval.min); interval.max = simplify(interval.max); Not only does this do double the simplification work it should, but it also caused something that was a single point to diverge into not being a single point, because the repeated constant-folding creates a new Expr. With the new is_single_point this matters a lot less, but even so, I centralized simplification of intervals into a single helper that doesn't do the pointless double-simplification for single points. Some of these shallowly-unequal but deeply-equal Intervals were being created in bounds inference itself after the prepass, which may have been generating suboptimal bounds. This change should fix that in addition to the compile-time benefits. Also added a simplify call in SkipStages because I noticed when it processed specializations it was creating things like (condition) || (!condition). 21 April 2024, 03:43:38 UTC
31c52ab Faster substitute_facts (#8200) * Fix computational complexity of substitute_facts It was O(n) for n facts. This makes it O(log(n)) This was particularly bad for pipelines with lots of inputs or outputs, because those pipelines have lots of asserts, which make for lots of facts to substitute in. Speeds up lowering of local laplacian with 20 pyramid levels (which has only one input and one output) by 1.09x Speeds up lowering of the adams 2019 cost model training pipeline (lots of weight inputs and lots outputs due to derivatives) by 1.5x Speeds up resnet50 (tons of weight inputs) lowering by 7.3x! * Add missing switch breaks * Add missing comments * Elaborate on why we treat NaNs as equal 19 April 2024, 19:59:34 UTC
dd1d0e8 [HEXAGON] Keep support for hexagon_remote/Makefile (#8186) Update hexagon_remote/Makefile 19 April 2024, 17:33:44 UTC
4e0b313 Rewrite IREquality to use a more compact stack instead of deep recursion (#8198) * Rewrite IREquality to use a more compact stack instead of deep recursion Deletes a bunch of code and speeds up lowering time of local laplacian with 20 pyramid levels by ~2.5% * clang-tidy * Fold in the version of equal in IRMatch.h/cpp * Add missing switch breaks * Add missing comments * Elaborate on why we treat NaNs as equal 18 April 2024, 19:48:59 UTC
7994e70 Fix corner case in if_then_else simplification (#8189) Co-authored-by: Steven Johnson <srj@google.com> 16 April 2024, 21:27:43 UTC
3e712ba Disable fused mul-add for f16 while investigating 12 April 2024, 17:29:06 UTC
f247636 Merge branch 'main' into xtensa-codegen 12 April 2024, 17:25:08 UTC
f4c7831 Don't print on parallel task entry/exit with -debug flag (#8185) Fixes #8184 11 April 2024, 22:07:20 UTC
dc83707 Add .npy support to debug_to_file() (#8177) * Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add .npy support to debug_to_file() Built on top of https://github.com/halide/Halide/pull/8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like ``` ss = numpy.load("my_file.npy") print(ss) ``` in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see https://github.com/halide/Halide/issues/8176) Did a drive-by change to the correctness test to use this format instead of .mat. * Add float16 support * Add support for Float16 images in npy * Assume little-endian * Remove redundant halide_error() * naming convention * naming convention * Test both mat and npy * Don't call halide_error() * Use old-school parser * clang-tidy 11 April 2024, 18:04:42 UTC
8f3f6cf Update Hexagon Install Instructions (#8182) update Hexagon install instructions 11 April 2024, 16:58:36 UTC
e3d3c8c Fix unused variable. (#8180) 08 April 2024, 15:29:33 UTC
35f0c29 Add .npy support to halide_image_io (#8175) * Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add float16 support * Use old-school parser * clang-tidy 06 April 2024, 15:17:25 UTC
f42a3d7 Merge branch 'main' into xtensa-codegen 05 April 2024, 16:43:20 UTC
14ae082 Clarify the meaning of Shuffle::is_broadcast() (#8158) * Fix horrifying bug in lossless_cast of a subtract * A 'broadcast' shuffle is more complex than it seems I was poking at the Shuffle node, and checking its usage, and it seems that despite the comment, Shuffles that return true for is_broadcast are not the same as a Broadcast node. Instead of repeating the input vector some number of times, it repeats a shuffle of the input vector. This means IRPrinter was incorrect. None of the other usages were bad. This PR makes this clearer in the comment, and fixes IRPrinter. * Revert accidental change 05 April 2024, 16:39:07 UTC
a462044 Tighten bounds of abs() (#8168) * Tighten bounds of abs() * make abs bounds tight for non-int32 too * make int32 min expression match non-int32 min expression 05 April 2024, 16:38:46 UTC
ddab1cf Fix bug in lossless_negate 05 April 2024, 16:30:59 UTC
7d99357 Add conversion code for Float16 that was missed in #8174 (#8178) * Add conversion code for Float16 that was missed in #8174 * Don't sniff for _Float16 when building ASAN * Update HalideRuntime.h 05 April 2024, 16:07:05 UTC
3b8a532 Add some missing _Float16 support (#8174) (Changes extracted from https://github.com/halide/Halide/pull/8169, which may or may not land in its current form) Some missing support for _Float16 that will likely be handy: - Allow _Float16 to be detected for Clang 15 (since my local XCode Clang 15 definitely supports it) - Expr(_Float16) - HALIDE_DECLARE_EXTERN_SIMPLE_TYPE(_Float16); - Add _Float16 to the convert matrix in halide_image_io.h 04 April 2024, 17:19:13 UTC
a4158c0 fix ub in lower rounding shift right (#8173) * Avoid out-of-range shifts in lower_rounding_shift_left/right Consider `lower_rounding_shift_right(a, (uint8)0)` The term b - 1 becomes 255, and now you have an out-of-range shift, which causes the simplifier to inject a signed_integer_overflow intrinsic, and compilation to fail. This is a little annoying because if b == 0, b_positive is a zero mask, so the result isn't used anyway (this is also why this change is legal). In llvm, it's a poison value, not UB, so masking it off works. If the simplifier were smarter, it might just drop the signed_integer_overflow intrinsic on detecting that it was being bitwise-and-ed with zero. But the safest thing to do is not overflow. saturating_add/sub are typically as cheap as add/sub. 99.9% of the time b is some positive constant anyway, so it's going to get constant-folded. * Add test 03 April 2024, 19:28:25 UTC
1737a52 Add shifts to the lossless cast fuzzer This required a more careful signed-integer-overflow detection routine 02 April 2024, 19:34:37 UTC
c652667 Avoid UB in lowering of rounding_shift_right/left 02 April 2024, 19:11:42 UTC
6bcc66a Fix bad test when lowering mul_shift_right b_shift + b_shift < missing_q 02 April 2024, 19:11:26 UTC
c6065ff Delete bad rewrite rules 02 April 2024, 19:10:41 UTC
0fb8d38 Stronger assert in Simplify_Div 02 April 2024, 19:10:14 UTC
66c56f1 Fix some UB 01 April 2024, 20:35:01 UTC
ecfae44 It's too late to change the semantics of fixed point intrinsics 01 April 2024, 20:32:27 UTC
16a706d Remove bad TODO I can't think of a single case that could cause this 01 April 2024, 20:30:50 UTC
0856319 clear_bounds_info -> clear_expr_info 01 April 2024, 20:13:03 UTC
4a293b1 Add missing comment 01 April 2024, 20:08:33 UTC
854122f Remove redundant helpers 01 April 2024, 20:01:16 UTC
413b4a6 Account for more aggressive simplification in fuse test 01 April 2024, 19:59:33 UTC
b053ec6 Add missing files 01 April 2024, 19:54:51 UTC
26efb7c Misc cleanups and test improvements 01 April 2024, 19:28:57 UTC
2f14881 Add a simplifier rule which is apparently now necessary 01 April 2024, 18:09:06 UTC
cffadd8 Fix ConstantInterval multiplication 01 April 2024, 18:08:29 UTC
cff71e1 Merge branch 'main' into xtensa-codegen 29 March 2024, 17:11:43 UTC
f308a8c Add cache for constant bounds queries 28 March 2024, 18:09:48 UTC
6434210 Fix * operator. Add min/max/mod 28 March 2024, 18:08:45 UTC
7f4bb38 Handle bounds of narrower types in the simplifier too 25 March 2024, 22:29:43 UTC
bee38ce Make the simplifier use ConstantInterval 25 March 2024, 21:37:54 UTC
67855a5 Move new classes to new files Also fix up Monotonic.cpp 25 March 2024, 15:44:24 UTC
214f0fd Using constant_integer_bounds to strengthen FindIntrinsics In particular, we can do better instruction selection for pmulhrsw 22 March 2024, 18:00:17 UTC
e0f9f8e Fix ARM and HVX instruction selection Also added more TODOs 21 March 2024, 17:27:52 UTC
8864e8a Python bindings: `add_python_test()`: do set `HL_JIT_TARGET` too (#8156) This one took quite a bit of digging. I wanted to enable opencl tests on debian package, and `boundary_conditions.py`+`division.py` were failing when run with `HL_TARGET=host OCL_ICD_VENDORS=no-opencl-please.missing` env variables with `clGetPlatformIDs failed`, which made no sense to me. Empty `HL_JIT_TARGET` results in `opencl` being detected, unsurprisingly. 18 March 2024, 23:09:09 UTC
9c33c94 Use constant integer intervals to analyze safety for lossless_cast TODO: - Dedup the constant integer code with the same code in the simplifier. - Move constant interval arithmetic operations out of the class. - Make the ConstantInterval part of the return type of lossless_cast (and turn it into an inner helper) so that it isn't constantly recomputed. 18 March 2024, 22:43:41 UTC
f60781f Merge branch 'main' into xtensa-codegen 15 March 2024, 22:07:53 UTC
a132246 Fix two compute_with bugs. (#8152) * Fix two compute_with bugs. This PR fixes a bug in compute_with, and another bug I found while fixing it (we could really use a compute_with fuzzer). The first bug is that you can get into situations where the bounds of a producer func will refer directly to the loop variable of a consumer func, where the consumer is in a compute_with fused group. In main, that loop variable may not be defined because fused loop names have been rewritten to include the token ".fused.". This PR adds let stmts to define it just inside the fused loop body. The second bug is that not all parent loops in compute_with fused groups were having their bounds expanded to cover the region to be computed of all children, because the logic for deciding which loops to expand only considered the non-specialized pure definition. So e.g. compute_with applied to an update stage would fail to compute values of the child Func where they do not overlap with the parent Func. This PR visits all definitions of the parent Func of the fused group, instead of just the unspecialized pure definition of the parent Func. Fixes #8149 * clang-tidy 15 March 2024, 21:04:44 UTC
76a7dd4 Support for ARM SVE2. (#8051) * Checkpoint SVE2 restart. * Remove dead code. Add new test. * Update cmake for new file. * Checkpoint progress on SVE2. * Checkpoint ARM SVE2 support. Passes correctness_simd_op_check_sve2 test at 128 and 256 bits. * Remove an opportunity for RISC V codegen to change due to SVE2 support. * Ensure SVE intrinsics get vscale vectors and non-SVE ones get fixed vectors. Use proper prefix for neon intrinsics. Comment cleanups. * Checkpoint SVE2 work. Generally passes test, though using both NEON and SVE2 with simd_op_check_sve2 fails as both posibilities need to be allowed for 128-bit or smaller operations. * Remove an unfavored implementation possibility. * Fix opcode recognition in test to handle some cases that show up. Change name of test class to avoid confusion. * Formatting fixes. Replace internal_error with nop return for CodeGen_LLVM::match_vector_type_scalable called on scalar. * Formatting fix. * Limit SVE2 test to LLVM 19. Remove dead code. * Fix a degenerate case asking for zero sized vectors via a HAlide type with lanes of zero, which is not correct. * Fix confusion about Neon64/Neon128 and make it clear this is just the width multiplier applied to intrinsics. * REmove extraneous commented out line. * Address some review feedback. Mostly comment fixes. * Fix missed conflict resolution. * Fix some TODOs in SVE code. Move utility function to Util.h and common code the other obvious use. * Formatting. * Add missed refactor change. * Add issue to TODO comment. * Remove TODOs that don't seem necessary. * Add issue for TODO. * Add issue for TODO. * Remove dubious looking FP to int code that was ifdef'ed out. Doesn't look like a TODO is needed anymore. * Add issues for TODOs. * Update simd_op_check_sve2.cpp * Make a deep copy of each piece of test IR so that we can parallelize * Fix two clang-tidy warnings * Remove try/catch block from simd-op-check-sve2 * Don't try to run SVE2 code if vector_bits doesn't match host. * Add support for fcvtm/p, make scalars go through pattern matching too (#8151) * Don't do arm neon instruction selection on scalars This revealed a bug. FindIntrinsics was not enabled for scalars anyway, so it was semi-pointless. --------- Co-authored-by: Zalman Stern <zalman@macbook-pro.lan> Co-authored-by: Steven Johnson <srj@google.com> Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com> 15 March 2024, 20:01:51 UTC
7d80f8b Fix horrifying bug in lossless_cast of a subtract 14 March 2024, 21:37:38 UTC
f841a27 Bound allocation extents for hoist_storage using loop variables one-by-one (#8154) * Bound allocation extents using loop variable one-by-one * Use emplace_back 14 March 2024, 19:53:17 UTC
83616f2 Fix three nits (#8137) 1) has_gpu_feature already includes Vulkan, so there's no need to check for it. 2) Use emplace(...) instead of insert(make_pair(...)) 3) Fixed a place that should be using a ScopedValue 13 March 2024, 00:00:49 UTC
4988ab5 Feature: mark a Func as no_profiling, to prevent injection of profiling. (2nd implementation) (#8143) * Small feature to allow you to specify that a (typically small inner loop) Func should not be profiled. * Simplified the tuple name handling. * Optimize tuple name normalization in Profiling.cpp * Clang-format * Feedback on Function already being a pointer. Bump the Patch version of the serialization. 12 March 2024, 23:58:14 UTC
bf0d611 Rewrite the pass that adds mutexes for atomic nodes (#8105) * Avoid redundant scope lookups This pattern has been bugging me for a long time: ``` if (scope.contains(key)) { Foo f = scope.get(key); } ``` This redundantly looks up the key in the scope twice. I've finally gotten around to fixing it. I've introduced a find method that either returns a const pointer to the value, if it exists, or null. It also searches any containing scopes, which are held by const pointer, so the method has to return a const pointer. ``` if (const Foo *f = scope.find(key)) { } ``` For cases where you want to get and then mutate, I added shallow_find, which doesn't search enclosing scopes, but returns a mutable pointer. We were also doing redundant scope lookups in ScopedBinding. We stored the key in the helper object, and then did a pop on that key in the ScopedBinding destructor. This commit changes Scope so that Scope::push returns an opaque token that you can pass to Scope::pop to have it remove that element without doing a fresh lookup. ScopedBinding now uses this. Under the hood it's just an iterator on the underlying map (map iterators are not invalidated on inserting or removing other stuff). The net effect is to speed up local laplacian lowering by about 5% I also considered making it look more like an stl class, and having find return an iterator, but it doesn't really work. The iterator it returns might point to an entry in an enclosing scope, in which case you can't compare it to the .end() method of the scope you have. Scopes are different enough from maps that the interface really needs to be distinct. * Pacify clang-tidy * Rewrite the pass that injects mutexes to support atomics For O(n) nested allocate nodes, this pass was quadratic in n, even if there was no use of atomics. This commit rewrites it to use a linear-time algorithm, and skips it entirely after the first validation pass if there aren't any atomic nodes. It also needlessly used IRGraphMutators, which slowed things down, didn't handle LargeBuffers (could overflow in the allocation), incorrectly thought every producer/consumer node was associated with an output buffer, and didn't print the realization name when printing the atomic node (the body of an atomic node is only atomic w.r.t. a specific realization). I noticed all this because it stuck out in a profile. For resnet 50, the rewrite that changed to a linear algorithm took this stage from 185ms down to 6.7ms, and then skipping it entirely when it doesn't find any atomic nodes added 1.5 for the single IRVisitor check. For local laplacian with 100 pyramid levels (which contains many nested allocate nodes due to a large number of skip connections), the times are 5846 ms -> 16 ms -> 4.6 ms This is built on top of #8103 * Fix unintentional mutation of interval in scope --------- Co-authored-by: Steven Johnson <srj@google.com> 12 March 2024, 16:49:26 UTC
3c2d809 Use python itself to get the extension suffix, not python-config (#8148) * Use python itself to get the extension suffix, not python-config * Add a comment 12 March 2024, 00:05:44 UTC
30f29fc Merge branch 'main' into xtensa-codegen 08 March 2024, 21:20:30 UTC
009fe7a Handle loads of broadcasts in FlattenNestedRamps (#8139) With sufficiently perverse schedules, it's possible to end up with a load of a broadcast index (rather than a broadcast of a scalar load). This made FlattenNestedRamps divide by zero. Unfortunately this happened in a complex production pipeline, so I'm not entirely sure how to reproduce it. For that pipeline, this change fixes it and produces correct output. 08 March 2024, 16:50:20 UTC
8cc4f02 Fix for top-of-tree LLVM (#8145) 08 March 2024, 02:13:56 UTC
22868a4 Add sobel in hexagon benchmarks app for CMake builds (#8127) * Add sobel in hexagon_benchmarks app for CMake builds Resolved compilation errors caused by the eliminate interleave pass, which changed the instruction from halide.hexagon.pack_satub.vuh to halide.hexagon.trunc_satub.vuh. The latter is only available in v65 or later. This commit ensures compatibility with v65 and later versions. * Minor fix to address the issue. --------- Co-authored-by: Steven Johnson <srj@google.com> 06 March 2024, 21:40:00 UTC
754e6ec [vulkan] Add conform API methods to memory allocator to fix block allocations (#8130) * Add conform API methods to block and region allocator classes Override conform requests for Vulkan memory allocator Cleanup memory requirement constraints for Vulkan Add conform test cases to block_allocator runtime test. * Clang format/tidy pas * Fix unsigned int comparisons * Clang format pass * Fix other unsigned int comparisons * Fix mismatched template types for max() * Fix whitespace for clang format --------- Co-authored-by: Derek Gerstmann <dgerstmann@adobe.com> 06 March 2024, 19:46:23 UTC
aad94de Disable halide_xtensa_mul_add_f32 temporarily 05 March 2024, 22:43:44 UTC
b6449b3 Merge branch 'main' into xtensa-codegen 05 March 2024, 17:54:38 UTC
10e07e6 Add class template type deduction guides to avoid CTAD warning. (#8135) * Add class template type dedeuction guides to avoid CTAD warning. * Formatting. 05 March 2024, 17:53:29 UTC
05ae15a Make gpu thread and block for loop names opaque (#8133) This is one of our largest remaining type of magic name. These were explicitly constructed in lots of places and then explicitly checked for with ends_with in lots of places. This PR makes the names opaque. Only CanonicalizeGPUVars.cpp knows what they are, and they don't have to be a single fixed thing as long as they're consistent within a process. Also reduced the number of GPU dimensions to three more uniformly. We were already asserting this, but there was lots of dead code in lowering passes after gpu loop validation that allowed for four. Also fixed a bug I found in is_block_uniform. It didn't consider that the dependence on a gpu thread variable in a load index could be because a let variable encountered depends on a gpu thread variable. 05 March 2024, 17:50:19 UTC
back to top