Revision history - refs/heads/srj/apps-hamal - origin: https://github.com/halide/Halide

visit type:

Revision	Author	Date	Message	Commit Date
1fbdf07	Steven Johnson	18 April 2024, 22:54:28 UTC	Initial push	18 April 2024, 22:54:28 UTC
7994e70	Andrew Adams	16 April 2024, 21:27:43 UTC	Fix corner case in if_then_else simplification (#8189) Co-authored-by: Steven Johnson <srj@google.com>	16 April 2024, 21:27:43 UTC
f4c7831	Andrew Adams	11 April 2024, 22:07:20 UTC	Don't print on parallel task entry/exit with -debug flag (#8185) Fixes #8184	11 April 2024, 22:07:20 UTC
dc83707	Steven Johnson	11 April 2024, 18:04:42 UTC	Add .npy support to debug_to_file() (#8177) * Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add .npy support to debug_to_file() Built on top of https://github.com/halide/Halide/pull/8175, this adds .npy as an option. This is actually pretty great because it's easy to do something like ``` ss = numpy.load("my_file.npy") print(ss) ``` in Python and get nicely-formatted output, which can sometimes be a lot easier for debugging that inserting lots of print() statements (see https://github.com/halide/Halide/issues/8176) Did a drive-by change to the correctness test to use this format instead of .mat. * Add float16 support * Add support for Float16 images in npy * Assume little-endian * Remove redundant halide_error() * naming convention * naming convention * Test both mat and npy * Don't call halide_error() * Use old-school parser * clang-tidy	11 April 2024, 18:04:42 UTC
8f3f6cf	Fabian Schuetze	11 April 2024, 16:58:36 UTC	Update Hexagon Install Instructions (#8182) update Hexagon install instructions	11 April 2024, 16:58:36 UTC
e3d3c8c	Martijn Courteaux	08 April 2024, 15:29:33 UTC	Fix unused variable. (#8180)	08 April 2024, 15:29:33 UTC
35f0c29	Steven Johnson	06 April 2024, 15:17:25 UTC	Add .npy support to halide_image_io (#8175) * Add .npy support to halide_image_io The .npy format is NumPy's native format for storing multidimensional arrays (aka tensors/buffers). Being able to load/save in this format makes it (potentially) a lot easier to interchange data with the Python ecosystem, as well as providing a file format that support floating-point data more robustly than any of the others that we current support. This adds load/save support for a useful subset: - We support the int/uint/float types common in Halide (except for f16/bf16 for now) - We don't support reading or writing files that are in `fortran_order` - We don't support any object/struct/etc files, only numeric primitives - We only support loading files that are in the host's endianness (typically little-endian) Note that at present this doesn't support f16 / bf16 formats, but that could likely be added with minimal difficulty. The tricky bit of this is that the reading code has to parse a (limited) Python dict in text form. Please review that part carefully. TODO: we could probably add this as an option for `debug_to_file()` without too much pain in a followup PR. * clang-tidy * clang-tidy * Address review comments * Allow for "keys" as well as 'keys' * Add float16 support * Use old-school parser * clang-tidy	06 April 2024, 15:17:25 UTC
14ae082	Andrew Adams	05 April 2024, 16:39:07 UTC	Clarify the meaning of Shuffle::is_broadcast() (#8158) * Fix horrifying bug in lossless_cast of a subtract * A 'broadcast' shuffle is more complex than it seems I was poking at the Shuffle node, and checking its usage, and it seems that despite the comment, Shuffles that return true for is_broadcast are not the same as a Broadcast node. Instead of repeating the input vector some number of times, it repeats a shuffle of the input vector. This means IRPrinter was incorrect. None of the other usages were bad. This PR makes this clearer in the comment, and fixes IRPrinter. * Revert accidental change	05 April 2024, 16:39:07 UTC
a462044	Alexander Root	05 April 2024, 16:38:46 UTC	Tighten bounds of abs() (#8168) * Tighten bounds of abs() * make abs bounds tight for non-int32 too * make int32 min expression match non-int32 min expression	05 April 2024, 16:38:46 UTC
7d99357	Steven Johnson	05 April 2024, 16:07:05 UTC	Add conversion code for Float16 that was missed in #8174 (#8178) * Add conversion code for Float16 that was missed in #8174 * Don't sniff for _Float16 when building ASAN * Update HalideRuntime.h	05 April 2024, 16:07:05 UTC
3b8a532	Steven Johnson	04 April 2024, 17:19:13 UTC	Add some missing _Float16 support (#8174) (Changes extracted from https://github.com/halide/Halide/pull/8169, which may or may not land in its current form) Some missing support for _Float16 that will likely be handy: - Allow _Float16 to be detected for Clang 15 (since my local XCode Clang 15 definitely supports it) - Expr(_Float16) - HALIDE_DECLARE_EXTERN_SIMPLE_TYPE(_Float16); - Add _Float16 to the convert matrix in halide_image_io.h	04 April 2024, 17:19:13 UTC
a4158c0	Andrew Adams	03 April 2024, 19:28:25 UTC	fix ub in lower rounding shift right (#8173) * Avoid out-of-range shifts in lower_rounding_shift_left/right Consider `lower_rounding_shift_right(a, (uint8)0)` The term b - 1 becomes 255, and now you have an out-of-range shift, which causes the simplifier to inject a signed_integer_overflow intrinsic, and compilation to fail. This is a little annoying because if b == 0, b_positive is a zero mask, so the result isn't used anyway (this is also why this change is legal). In llvm, it's a poison value, not UB, so masking it off works. If the simplifier were smarter, it might just drop the signed_integer_overflow intrinsic on detecting that it was being bitwise-and-ed with zero. But the safest thing to do is not overflow. saturating_add/sub are typically as cheap as add/sub. 99.9% of the time b is some positive constant anyway, so it's going to get constant-folded. * Add test	03 April 2024, 19:28:25 UTC
8864e8a	Roman Lebedev	18 March 2024, 23:09:09 UTC	Python bindings: `add_python_test()`: do set `HL_JIT_TARGET` too (#8156) This one took quite a bit of digging. I wanted to enable opencl tests on debian package, and `boundary_conditions.py`+`division.py` were failing when run with `HL_TARGET=host OCL_ICD_VENDORS=no-opencl-please.missing` env variables with `clGetPlatformIDs failed`, which made no sense to me. Empty `HL_JIT_TARGET` results in `opencl` being detected, unsurprisingly.	18 March 2024, 23:09:09 UTC
a132246	Andrew Adams	15 March 2024, 21:04:44 UTC	Fix two compute_with bugs. (#8152) * Fix two compute_with bugs. This PR fixes a bug in compute_with, and another bug I found while fixing it (we could really use a compute_with fuzzer). The first bug is that you can get into situations where the bounds of a producer func will refer directly to the loop variable of a consumer func, where the consumer is in a compute_with fused group. In main, that loop variable may not be defined because fused loop names have been rewritten to include the token ".fused.". This PR adds let stmts to define it just inside the fused loop body. The second bug is that not all parent loops in compute_with fused groups were having their bounds expanded to cover the region to be computed of all children, because the logic for deciding which loops to expand only considered the non-specialized pure definition. So e.g. compute_with applied to an update stage would fail to compute values of the child Func where they do not overlap with the parent Func. This PR visits all definitions of the parent Func of the fused group, instead of just the unspecialized pure definition of the parent Func. Fixes #8149 * clang-tidy	15 March 2024, 21:04:44 UTC
76a7dd4	Zalman Stern	15 March 2024, 20:01:51 UTC	Support for ARM SVE2. (#8051) * Checkpoint SVE2 restart. * Remove dead code. Add new test. * Update cmake for new file. * Checkpoint progress on SVE2. * Checkpoint ARM SVE2 support. Passes correctness_simd_op_check_sve2 test at 128 and 256 bits. * Remove an opportunity for RISC V codegen to change due to SVE2 support. * Ensure SVE intrinsics get vscale vectors and non-SVE ones get fixed vectors. Use proper prefix for neon intrinsics. Comment cleanups. * Checkpoint SVE2 work. Generally passes test, though using both NEON and SVE2 with simd_op_check_sve2 fails as both posibilities need to be allowed for 128-bit or smaller operations. * Remove an unfavored implementation possibility. * Fix opcode recognition in test to handle some cases that show up. Change name of test class to avoid confusion. * Formatting fixes. Replace internal_error with nop return for CodeGen_LLVM::match_vector_type_scalable called on scalar. * Formatting fix. * Limit SVE2 test to LLVM 19. Remove dead code. * Fix a degenerate case asking for zero sized vectors via a HAlide type with lanes of zero, which is not correct. * Fix confusion about Neon64/Neon128 and make it clear this is just the width multiplier applied to intrinsics. * REmove extraneous commented out line. * Address some review feedback. Mostly comment fixes. * Fix missed conflict resolution. * Fix some TODOs in SVE code. Move utility function to Util.h and common code the other obvious use. * Formatting. * Add missed refactor change. * Add issue to TODO comment. * Remove TODOs that don't seem necessary. * Add issue for TODO. * Add issue for TODO. * Remove dubious looking FP to int code that was ifdef'ed out. Doesn't look like a TODO is needed anymore. * Add issues for TODOs. * Update simd_op_check_sve2.cpp * Make a deep copy of each piece of test IR so that we can parallelize * Fix two clang-tidy warnings * Remove try/catch block from simd-op-check-sve2 * Don't try to run SVE2 code if vector_bits doesn't match host. * Add support for fcvtm/p, make scalars go through pattern matching too (#8151) * Don't do arm neon instruction selection on scalars This revealed a bug. FindIntrinsics was not enabled for scalars anyway, so it was semi-pointless. --------- Co-authored-by: Zalman Stern <zalman@macbook-pro.lan> Co-authored-by: Steven Johnson <srj@google.com> Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>	15 March 2024, 20:01:51 UTC
f841a27	Volodymyr Kysenko	14 March 2024, 19:53:17 UTC	Bound allocation extents for hoist_storage using loop variables one-by-one (#8154) * Bound allocation extents using loop variable one-by-one * Use emplace_back	14 March 2024, 19:53:17 UTC
83616f2	Andrew Adams	13 March 2024, 00:00:49 UTC	Fix three nits (#8137) 1) has_gpu_feature already includes Vulkan, so there's no need to check for it. 2) Use emplace(...) instead of insert(make_pair(...)) 3) Fixed a place that should be using a ScopedValue	13 March 2024, 00:00:49 UTC
4988ab5	Martijn Courteaux	12 March 2024, 23:58:14 UTC	Feature: mark a Func as no_profiling, to prevent injection of profiling. (2nd implementation) (#8143) * Small feature to allow you to specify that a (typically small inner loop) Func should not be profiled. * Simplified the tuple name handling. * Optimize tuple name normalization in Profiling.cpp * Clang-format * Feedback on Function already being a pointer. Bump the Patch version of the serialization.	12 March 2024, 23:58:14 UTC
bf0d611	Andrew Adams	12 March 2024, 16:49:26 UTC	Rewrite the pass that adds mutexes for atomic nodes (#8105) * Avoid redundant scope lookups This pattern has been bugging me for a long time: ``` if (scope.contains(key)) { Foo f = scope.get(key); } ``` This redundantly looks up the key in the scope twice. I've finally gotten around to fixing it. I've introduced a find method that either returns a const pointer to the value, if it exists, or null. It also searches any containing scopes, which are held by const pointer, so the method has to return a const pointer. ``` if (const Foo f = scope.find(key)) { } ``` For cases where you want to get and then mutate, I added shallow_find, which doesn't search enclosing scopes, but returns a mutable pointer. We were also doing redundant scope lookups in ScopedBinding. We stored the key in the helper object, and then did a pop on that key in the ScopedBinding destructor. This commit changes Scope so that Scope::push returns an opaque token that you can pass to Scope::pop to have it remove that element without doing a fresh lookup. ScopedBinding now uses this. Under the hood it's just an iterator on the underlying map (map iterators are not invalidated on inserting or removing other stuff). The net effect is to speed up local laplacian lowering by about 5% I also considered making it look more like an stl class, and having find return an iterator, but it doesn't really work. The iterator it returns might point to an entry in an enclosing scope, in which case you can't compare it to the .end() method of the scope you have. Scopes are different enough from maps that the interface really needs to be distinct. Pacify clang-tidy * Rewrite the pass that injects mutexes to support atomics For O(n) nested allocate nodes, this pass was quadratic in n, even if there was no use of atomics. This commit rewrites it to use a linear-time algorithm, and skips it entirely after the first validation pass if there aren't any atomic nodes. It also needlessly used IRGraphMutators, which slowed things down, didn't handle LargeBuffers (could overflow in the allocation), incorrectly thought every producer/consumer node was associated with an output buffer, and didn't print the realization name when printing the atomic node (the body of an atomic node is only atomic w.r.t. a specific realization). I noticed all this because it stuck out in a profile. For resnet 50, the rewrite that changed to a linear algorithm took this stage from 185ms down to 6.7ms, and then skipping it entirely when it doesn't find any atomic nodes added 1.5 for the single IRVisitor check. For local laplacian with 100 pyramid levels (which contains many nested allocate nodes due to a large number of skip connections), the times are 5846 ms -> 16 ms -> 4.6 ms This is built on top of #8103 * Fix unintentional mutation of interval in scope --------- Co-authored-by: Steven Johnson <srj@google.com>	12 March 2024, 16:49:26 UTC
3c2d809	Andrew Adams	12 March 2024, 00:05:44 UTC	Use python itself to get the extension suffix, not python-config (#8148) * Use python itself to get the extension suffix, not python-config * Add a comment	12 March 2024, 00:05:44 UTC
009fe7a	Andrew Adams	08 March 2024, 16:50:20 UTC	Handle loads of broadcasts in FlattenNestedRamps (#8139) With sufficiently perverse schedules, it's possible to end up with a load of a broadcast index (rather than a broadcast of a scalar load). This made FlattenNestedRamps divide by zero. Unfortunately this happened in a complex production pipeline, so I'm not entirely sure how to reproduce it. For that pipeline, this change fixes it and produces correct output.	08 March 2024, 16:50:20 UTC
8cc4f02	Steven Johnson	08 March 2024, 02:13:56 UTC	Fix for top-of-tree LLVM (#8145)	08 March 2024, 02:13:56 UTC
22868a4	Prasoon Mishra	06 March 2024, 21:40:00 UTC	Add sobel in hexagon benchmarks app for CMake builds (#8127) * Add sobel in hexagon_benchmarks app for CMake builds Resolved compilation errors caused by the eliminate interleave pass, which changed the instruction from halide.hexagon.pack_satub.vuh to halide.hexagon.trunc_satub.vuh. The latter is only available in v65 or later. This commit ensures compatibility with v65 and later versions. * Minor fix to address the issue. --------- Co-authored-by: Steven Johnson <srj@google.com>	06 March 2024, 21:40:00 UTC
754e6ec	Derek Gerstmann	06 March 2024, 19:46:23 UTC	[vulkan] Add conform API methods to memory allocator to fix block allocations (#8130) * Add conform API methods to block and region allocator classes Override conform requests for Vulkan memory allocator Cleanup memory requirement constraints for Vulkan Add conform test cases to block_allocator runtime test. * Clang format/tidy pas * Fix unsigned int comparisons * Clang format pass * Fix other unsigned int comparisons * Fix mismatched template types for max() * Fix whitespace for clang format --------- Co-authored-by: Derek Gerstmann <dgerstmann@adobe.com>	06 March 2024, 19:46:23 UTC
10e07e6	Zalman Stern	05 March 2024, 17:53:29 UTC	Add class template type deduction guides to avoid CTAD warning. (#8135) * Add class template type dedeuction guides to avoid CTAD warning. * Formatting.	05 March 2024, 17:53:29 UTC
05ae15a	Andrew Adams	05 March 2024, 17:50:19 UTC	Make gpu thread and block for loop names opaque (#8133) This is one of our largest remaining type of magic name. These were explicitly constructed in lots of places and then explicitly checked for with ends_with in lots of places. This PR makes the names opaque. Only CanonicalizeGPUVars.cpp knows what they are, and they don't have to be a single fixed thing as long as they're consistent within a process. Also reduced the number of GPU dimensions to three more uniformly. We were already asserting this, but there was lots of dead code in lowering passes after gpu loop validation that allowed for four. Also fixed a bug I found in is_block_uniform. It didn't consider that the dependence on a gpu thread variable in a load index could be because a let variable encountered depends on a gpu thread variable.	05 March 2024, 17:50:19 UTC
d33ffa2	Andrew Adams	05 March 2024, 17:50:07 UTC	Make realization order invariant to unique_name suffixes (#8124) * Make realization order invariant to unique_name suffixes * Add test * definition_order -> uint64 everywhere * Use visitation order instead of definition order --------- Co-authored-by: Steven Johnson <srj@google.com>	05 March 2024, 17:50:07 UTC
8b3312c	Martijn Courteaux	05 March 2024, 16:16:06 UTC	Add support for setting the default allocator and deallocator functions in Halide::Runtime::Buffer. (#8132)	05 March 2024, 16:16:06 UTC
7636c44	Andrew Adams	27 February 2024, 02:03:33 UTC	Remove two dead vars from the Makefile (#8125) These appear to be unused	27 February 2024, 02:03:33 UTC
36d74a8	Andrew Adams	27 February 2024, 01:56:59 UTC	Rewrite the skip stages lowering pass (#8115) * Avoid redundant scope lookups This pattern has been bugging me for a long time: ``` if (scope.contains(key)) { Foo f = scope.get(key); } ``` This redundantly looks up the key in the scope twice. I've finally gotten around to fixing it. I've introduced a find method that either returns a const pointer to the value, if it exists, or null. It also searches any containing scopes, which are held by const pointer, so the method has to return a const pointer. ``` if (const Foo f = scope.find(key)) { } ``` For cases where you want to get and then mutate, I added shallow_find, which doesn't search enclosing scopes, but returns a mutable pointer. We were also doing redundant scope lookups in ScopedBinding. We stored the key in the helper object, and then did a pop on that key in the ScopedBinding destructor. This commit changes Scope so that Scope::push returns an opaque token that you can pass to Scope::pop to have it remove that element without doing a fresh lookup. ScopedBinding now uses this. Under the hood it's just an iterator on the underlying map (map iterators are not invalidated on inserting or removing other stuff). The net effect is to speed up local laplacian lowering by about 5% I also considered making it look more like an stl class, and having find return an iterator, but it doesn't really work. The iterator it returns might point to an entry in an enclosing scope, in which case you can't compare it to the .end() method of the scope you have. Scopes are different enough from maps that the interface really needs to be distinct. Pacify clang-tidy * Fix unintentional mutation of interval in scope * Fix accidental Scope::get * Rewrite the skip stages lowering pass Skip stages was slow due to crappy computational complexity (quadratic?) I reworked it into a two-pass linear-time algorithm. The first part remembers which pieces of IR are actually relevant to the task, and the second pass performs the task using a bounds-inference-like algorithm. On main resnet50 spends 519 ms in this pass. This commit reduces it to 40 ms. Local laplacian with 100 pyramid levels spends 7.4 seconds in this pass. This commit reduces it to ~3 ms. This commit also moves the cache store for memoized Funcs into the produce node, instead of at the top of the consume node, because it naturally places it inside a condition you inject into the produce node. * clang-tidy fixes * Fix skip stages interaction with compute_with * Unify let visitors, and use fewer stack frames for them * Fix accidental leakage of .used into .loaded * Visit the bodies of uninteresting let chains * Another used -> loaded * Fix hoist_storage not handling condition correctly. --------- Co-authored-by: Steven Johnson <srj@google.com>	27 February 2024, 01:56:59 UTC
2b5beb3	Andrew Adams	27 February 2024, 01:11:47 UTC	Fix hoist_storage not handling condition correctly. (#8123) The allocation condition wasn't getting relaxed over the scope and loop vars like the extents were.	27 February 2024, 01:11:47 UTC
aae84f6	Andrew Adams	26 February 2024, 17:56:17 UTC	Use a caching version of stmt_uses_vars in TightenProducerConsumer nodes (#8102) We were making a very large number stmt_uses_vars queries that covered the same sub-stmts. I solved it by adding a cache. Speeds up local laplacian lowering by 10% by basically removing this pass from the profile. Also a drive-by typo fix in Lower.cpp	26 February 2024, 17:56:17 UTC
4399ed8	Zalman Stern	23 February 2024, 04:07:47 UTC	Add Intel APX and AVX10 target flags and LLVM attribute setting. (#8052) * Add target flag and LLVM enables support for Intel AVX10. * Go ahead and add APX support as well. Correct spelling of APX target attributes. * Implement AVX10 and APX cpu feature detection. (As yet untested.) * Expand target feature flags for AVX10. --------- Co-authored-by: Steven Johnson <srj@google.com>	23 February 2024, 04:07:47 UTC
57164df	Andrew Adams	22 February 2024, 18:52:54 UTC	Avoid redundant scope lookups (#8103) * Avoid redundant scope lookups This pattern has been bugging me for a long time: ``` if (scope.contains(key)) { Foo f = scope.get(key); } ``` This redundantly looks up the key in the scope twice. I've finally gotten around to fixing it. I've introduced a find method that either returns a const pointer to the value, if it exists, or null. It also searches any containing scopes, which are held by const pointer, so the method has to return a const pointer. ``` if (const Foo *f = scope.find(key)) { } ``` For cases where you want to get and then mutate, I added shallow_find, which doesn't search enclosing scopes, but returns a mutable pointer. We were also doing redundant scope lookups in ScopedBinding. We stored the key in the helper object, and then did a pop on that key in the ScopedBinding destructor. This commit changes Scope so that Scope::push returns an opaque token that you can pass to Scope::pop to have it remove that element without doing a fresh lookup. ScopedBinding now uses this. Under the hood it's just an iterator on the underlying map (map iterators are not invalidated on inserting or removing other stuff). The net effect is to speed up local laplacian lowering by about 5% I also considered making it look more like an stl class, and having find return an iterator, but it doesn't really work. The iterator it returns might point to an entry in an enclosing scope, in which case you can't compare it to the .end() method of the scope you have. Scopes are different enough from maps that the interface really needs to be distinct.	22 February 2024, 18:52:54 UTC
ef31bf9	Andrew Adams	22 February 2024, 17:13:43 UTC	Do less redundant work in UnpackBuffers (#8104) We were redundantly creating a handle Variable every time we encountered something like foo.stride.0, instead of just the first time we encounter a Variable that refers to an input Parameter/Buffer. Speeds up this already-fast lowering pass by 10% or so. No measurable impact on total lowering time.	22 February 2024, 17:13:43 UTC
4613217	Andrew Adams	22 February 2024, 17:13:15 UTC	Optionally print the time taken by each lowering pass (#8116) * Optionally print the time taken by each lowering pass I've been copy-pasting this from branch to branch, but I should just check it in. This is useful for performance optimization of the compiler itself.	22 February 2024, 17:13:15 UTC
c4d56c6	Tarushii Goel	19 February 2024, 22:46:15 UTC	Small Tutorial Fix (#8111) * Update lesson_17_predicated_rdom.cpp * Update lesson_17_predicated_rdom.cpp	19 February 2024, 22:46:15 UTC
4fc1e57	Zalman Stern	16 February 2024, 21:58:23 UTC	Fix an issue where the Halide compiler hits an internal error for bool types in widening intrinsics. (#8099) * Fix an issue where the Halide compiler hits an internal error when bool types are used with e.g. widening_mul. This situation did not arise from user code doing this directly, but rather through some chain o lowering with float16 types. The test cases added to correctness_intrinsics target the issue directly and do fail without the fix. I did not add broader coverage for bool types and intrinsics as it would require more thinking. Most of them overflow for the true/true case and thus are of questionable use, however widening operations cannot overflow... Certainly we could define the language to forbid this, but currently the frontend does not do so. As indicated above, the use case driving this was not using bool arithmetic to begin with. * Formatting.	16 February 2024, 21:58:23 UTC
d9668c5	Steven Johnson	15 February 2024, 17:57:16 UTC	Fix clang-tidy error in runtime.printer.h (parameter shadows member) (#8074)	15 February 2024, 17:57:16 UTC
2855ca3	Andrew Adams	15 February 2024, 17:06:36 UTC	Strip asserts right at the end of lowering (#8094) The simplifier exploits asserts to make simplification. When compiling with NoAsserts, certain assertions aren't ever introduced, which means that the simplifier can't exploit certain things that we know to be true. Mostly this has a negative effect on code size. E.g. tail cases get generated even though they are actually dead code. This PR keeps all the assertions right until the end of lowering, when it strips them in a dedicated pass. This reduces object file size for a large production blob of Halide code by ~10%, without measurably affecting runtime.	15 February 2024, 17:06:36 UTC
e6e1b6f	Alex Reinking	15 February 2024, 01:58:55 UTC	Ensure string(REPLACE) is called with the right number of arguments (#8097)	15 February 2024, 01:58:55 UTC
9a740b5	Derek Gerstmann	14 February 2024, 22:41:51 UTC	[Vulkan] Region allocator fixes for memory requirements and allocations (#8087) * Add region allocator tests that check alignment, nearest_multiple and collect routines * Fix can_split() routine to use conformed sizes so that split allocation matches Fix region size accounting so that coalesce never has zero size regions to merge * Fix aligned_offset() routine to check for zero alignment (which means no constraint) * Fix ifdef for internal debugging * Clean up debug internal log messages * Use memory_requirements to determine nearest_multiple during initialization Query memory_requirements for each region, and reallocate if driver requires additional device memory * Formatting pass --------- Co-authored-by: Derek Gerstmann <dgerstmann@adobe.com>	14 February 2024, 22:41:51 UTC
b582561	Andrew Adams	14 February 2024, 21:57:09 UTC	Fix reduce_expr_modulo of vector in Solve.cpp (#8089) * Fix reduce_expr_modulo of vector in Solve.cpp * Fix test	14 February 2024, 21:57:09 UTC
f2d750f	Roman Lebedev	14 February 2024, 20:35:52 UTC	tests: correctness/float16_t: mark `__extendhfsf2` with default visibility (#8084) ``` [2336/4154] /usr/bin/clang++-17 -DHALIDE_ENABLE_RTTI -DHALIDE_VERSION_MAJOR=17 -DHALIDE_VERSION_MINOR=0 -DHALIDE_VERSION_PATCH=0 -DHALIDE_WITH_EXCEPTIONS -I/build/halide-17.0.0/test/common -I/build/halide-17.0.0/tools -I/build/halide-17.0.0/build/stage-1/halide/include -g -fdebug-default-version=4 -fprofile-use=/build/halide-17.0.0/build-profile/default.profdata -fcs-profile-generate -Xclang -mllvm -Xclang -vp-counters-per-site=100.0 -fuse-ld=lld-17 -Wl,--build-id=sha1 -std=c++17 -flto=thin -fPIE -fvisibility=hidden -fvisibility-inlines-hidden -Winvalid-pch -Xclang -include-pch -Xclang /build/halide-17.0.0/build/stage-1/halide/test/CMakeFiles/_test_internal.dir/cmake_pch.hxx.pch -Xclang -include -Xclang /build/halide-17.0.0/build/stage-1/halide/test/CMakeFiles/_test_internal.dir/cmake_pch.hxx -MD -MT test/correctness/CMakeFiles/correctness_float16_t.dir/float16_t.cpp.o -MF test/correctness/CMakeFiles/correctness_float16_t.dir/float16_t.cpp.o.d -o test/correctness/CMakeFiles/correctness_float16_t.dir/float16_t.cpp.o -c /build/halide-17.0.0/test/correctness/float16_t.cpp <...> ld.lld-17: error: undefined hidden symbol: __extendhfsf2 >>> referenced by float16_t.cpp:391 (/build/halide-17.0.0/test/correctness/float16_t.cpp:391) >>> lto.tmp:(main) >>> did you mean: __extendbfsf2 >>> defined in: /lib/x86_64-linux-gnu/libgcc_s.so.1 clang++-17: error: linker command failed with exit code 1 (use -v to see invocation) ```	14 February 2024, 20:35:52 UTC
40a622f	Roman Lebedev	14 February 2024, 20:34:23 UTC	clang does not support `_Float16` when targeting i386 (#8085) See https://github.com/halide/Halide/issues/7678	14 February 2024, 20:34:23 UTC
6edea16	Steven Johnson	14 February 2024, 20:26:27 UTC	Allow disabling of mutlithreading in simd op check (#8096) simd_op_check_xtensa is not threadsafe at present	14 February 2024, 20:26:27 UTC
c8f43f3	Andrew Adams	13 February 2024, 21:47:19 UTC	Parallelize some tests (#8078) * Parallelize some tests This reduces the time taken to run all correctness tests from 8:15 to 3:15 on my machine. * The FIXME is actually fine * Remove debug print * Fix when we're willing to run x86 code in simd_op_check * Use separate imageparams per task * Deep-copy the LoopLevels * Make float16_t neon op check test at least build * Revert accidental serialization * Throw return values from callable into the void We don't have a custom error handler in place, so they're always zero * Skip test under ASAN * Fix unintentional change to test	13 February 2024, 21:47:19 UTC
d8cfed6	Andrew Adams	13 February 2024, 21:47:09 UTC	Forward the partition methods from generator outputs (#8090)	13 February 2024, 21:47:09 UTC
ada6345	Andrew Adams	12 February 2024, 18:10:00 UTC	Fix rfactor adding too many pure loops (#8086) When you rfactor an update definition, the new update definition must use all the pure vars of the Func, even though the one you're rfactoring may not have used them all. We also want to preserve any scheduling already done to the pure vars, so we want to preserve the dims list and splits list from the original definition. The code accounted for this by checking the dims list for any missing pure vars and adding them at the end (just before Var::outermost()), but this didn't account for the fact that they may no longer exist in the dims list due to splits that didn't reuse the outer name. In these circumstances we could end up with too many pure loops. E.g. if x has been split into xo and xi, then the code was adding a loop for x even though there were already loops for xo and xi, which of course produces garbage output. This PR instead just checks which pure vars are actually used in the update definition up front, and then uses that to tell which ones should be added. Fixes #7890	12 February 2024, 18:10:00 UTC
9c3615b	Andrew Adams	11 February 2024, 18:41:01 UTC	Add checks to prevent people from using negative split factors (#8076) * Add checks to prevent people from using negative split factors Our analysis passes assume that loop maxes are greater than loop mins, so negative split factors cause sufficient havoc that not even output bounds queries are safe. These are therefore checked on pipeline entry. This is a new way for output bounds queries to throw errors (in addition to the buffer pointers themselves being null, and maybe some buffer constraints). Testing this, I realized these errors were getting thrown twice, because the output buffer bounds query in Pipeline::realize was built around two recursive calls to realize, and both were calling the custom error handler. In addition to reporting errors in this class twice, this implies several other inefficiencies, e.g. jit call args were being prepped twice. I reworked it to be built around two calls to call_jit_code instead. Fixes #7938 * Add test to cmakelists * Remove pointless target arg to call_jit_code It has to be the same as the cached target in the receiving object anyway	11 February 2024, 18:41:01 UTC
22581bf	Steven Johnson	11 February 2024, 18:40:09 UTC	Remove OpenGLCompute (#8077) * Remove OpenGLCompute This was supposed to be removed in Halide 17 (oops), removing for Halide 18 * Update dynamic_allocation_in_gpu_kernel.cpp * Update dynamic_allocation_in_gpu_kernel.cpp * Update halide_ir.fbs	11 February 2024, 18:40:09 UTC
a3baa5d	James Price	09 February 2024, 18:39:21 UTC	[WebGPU] Update to latest native headers (#8081) * [WebGPU] Update to latest native headers * Remove #ifdef for `requiredFeature[s]Count` * Pass nullptr to wgpuCreateInstance * Emscripten currently requires this * Dawn accepts it too * Use nullptr for another wgpuCreateInstance call	09 February 2024, 18:39:21 UTC
de8e39d	Steven Johnson	09 February 2024, 16:55:00 UTC	Bump serialization version to 18.0.0 (#8080) * Bump serialization version to 18.0.0 As a matter of policy, we should probably bump the version of the serialization format for every version of Halide -- even if changes are minimal-to-nonexistent -- to reinforce the fact that this isn't intended in any way as a long-term archival format. This PR suggests that we bump the major version to match the main Halide version, but I'm open for other suggestions. * Update halide_ir.fbs	09 February 2024, 16:55:00 UTC
55dfa39	Zalman Stern	07 February 2024, 18:23:46 UTC	Add an easy way to print vectors in debug output. (#8072) * Add helper to print containers, or at least vectors, in debug info. * Add documentation comments. * Formatting. * Name change.	07 February 2024, 18:23:46 UTC
39e5c08	Andrew Adams	07 February 2024, 17:49:06 UTC	Better validation of gpu schedules (#8068) * Update makefile to use test/common/terminate_handler.cpp This means we actually print error messages when using exceptions and the makefile * Better validate of GPU schedules GPU loop constraints were checked in two different places. Checking them in ScheduleFunctions was incorrect because it didn't consider update definitions and specializations. Checking them in FuseGPUThreadLoops was too late, because the Var names have gone (they've been renamed to things like __thread_id_x). Furthermore, some problems were internal errors or runtime errors when they should have been user errors. We allowed 4d thread and block dimensions, but then hit an internal error. This PR centralizes checking of GPU loop structure in CanonicalizeGPUVars and adds more helpful error messages that print the problematic loop structure. E.g: ``` Error: GPU thread loop over f$8.s0.v0 is inside three other GPU thread loops. The maximum number of nested GPU thread loops is 3. The loop nest is: compute_at for g$8: for g$8.s0.v7: for g$8.s0.v6: for g$8.s0.v5: for g$8.s0.v4: gpu_block g$8.s0.v3: gpu_block g$8.s0.v2: gpu_thread g$8.s0.v1: gpu_thread g$8.s0.v0: store_at for f$8: compute_at for f$8: gpu_thread f$8.s0.v1: gpu_thread f$8.s0.v0: ``` Fixes the bug found in #7946 * Delete dead code * Actually clear the ostringstream	07 February 2024, 17:49:06 UTC
37153a9	Derek Gerstmann	07 February 2024, 17:43:58 UTC	Fix bool conversion bug in Vulkan code generator (#8067) * Fix bug in Vulkan code generator that was incorrectly passing the address of a byte vector, instead of its contents to builder.declare_constant() * Add bool_predicate_cast correctness test to verify bool conversion for Vulkan codegen works as expected --------- Co-authored-by: Derek Gerstmann <dgerstmann@adobe.com>	07 February 2024, 17:43:58 UTC
78a0762	Prasoon Mishra	07 February 2024, 17:41:51 UTC	Add hexagon_benchmarks app for CMake builds (#8069) * Add hexagon_benchmarks app for CMake builds * Removed unnecessary -lc++abi flag from GCC build	07 February 2024, 17:41:51 UTC
84fe565	Steven Johnson	07 February 2024, 17:41:21 UTC	Outsmart the LLVM optimizer (#8073) The old definitions of bool_1, bool_2, bool_3 in simd_op_check_x86 (etc) all referred to the same entry in in_f32; as of https://github.com/llvm/llvm-project/pull/76367, the LLVM optimizer is smart enough to realize that (eg) bool1 != bool2 by construction, and optimizes away the code that tests their conditions, such as the one for andps and orps. Initing them from different locations is enough to outsmart the compiler. (bug was only noticed in the x86 test, but I updated the other tests to guard against future improvements there too.)	07 February 2024, 17:41:21 UTC
665804c	Steven Johnson	06 February 2024, 23:34:29 UTC	Don't require Halide_WebGPU when using wasm (#8063) (#8065) * Don't require Halide_WebGPU when using wasm (#8063) * trigger buildbots	06 February 2024, 23:34:29 UTC
93bff95	Teo	06 February 2024, 23:34:02 UTC	add unsafe_promise_clamped (#8071) add unsafe_promise_clamp	06 February 2024, 23:34:02 UTC
80e2081	Andrew Adams	05 February 2024, 22:25:05 UTC	Update makefile to use test/common/terminate_handler.cpp (#8066) This means we actually print error messages when using exceptions and the makefile	05 February 2024, 22:25:05 UTC
e2448fe	Andrew Adams	01 February 2024, 17:46:10 UTC	Fix type error in VectorizeLoops (#8055)	01 February 2024, 17:46:10 UTC
47378ee	Steven Johnson	29 January 2024, 01:28:13 UTC	Enable `bugprone-switch-missing-default-case` (#8048) * Upgrade clang-format and clang-tidy to use LLVM 17 * trigger buildbots * trigger buildbots * trigger buildbots * trigger buildbots * Enable `bugprone-switch-missing-default-case` ...and fix existing warnings. * Update .clang-tidy * Update Parameter.cpp * Update .clang-tidy * Update .clang-tidy * Update .clang-tidy * Update .clang-tidy * Update CPlusPlusMangle.cpp	29 January 2024, 01:28:13 UTC
4b2d211	Steven Johnson	27 January 2024, 00:33:24 UTC	Upgrade clang-format and clang-tidy to use LLVM 17 (#8042) * Upgrade clang-format and clang-tidy to use LLVM 17 * trigger buildbots * trigger buildbots * trigger buildbots * trigger buildbots	27 January 2024, 00:33:24 UTC
45d7850	Andrew Adams	26 January 2024, 20:01:41 UTC	Track whether or not let expressions failed to solve in solver (#7982) * Track whether or not let expressions failed to solve in solver After mutating an expression, the solver needs to know two things: 1) Did the expression contain the variable we're solving for 2) Was the expression successfully "solved" for the variable. I.e. the variable only appears once in the leftmost position. We need to know this to know property 1 of any subexpressions (i.e. does the right child of the expression contain the variable). This drives what transformations we do in ways that are guaranteed to terminate and not take exponential time. We were tracking property 1 through lets but not property 2, and this meant we were doing unhelpful transformations in some cases. I found a case in the wild where this made a pipeline take > 1 hour to compile (I killed it after an hour). It may have been in an infinite transformation loop, or it might have just been exponential. Not sure. * Remove surplus comma * Fix use of uninitialized value that could cause bad transformation	26 January 2024, 20:01:41 UTC
3657cf5	Andrew Adams	26 January 2024, 17:26:12 UTC	Fix bounds_of_nested_lanes (#8039) * Fix bounds_of_nested_lanes bounds_of_nested_lanes assumed that one layer of nested vectorization could be removed at a time. When faced with the expression: min(ramp(x8(a), x8(b), 5), x40(27)) It panicked, because on the left hand side it reduced the bounds to x8(a) ... x8(a) + x8(b) * 4, and on the right hand side it reduced the bounds to 27. It then attempted to take a min of mismatched types. In general we can't assume that binary operators on nested vectors have the same nesting structure on both sides, so I just rewrote it to reduce directly to a scalar. Fixes #8038	26 January 2024, 17:26:12 UTC
4590a09	Andrew Adams	26 January 2024, 01:07:40 UTC	Fix for llvm trunk: Force-include more runtime types (#8045) * Fix for llvm trunk: Force-include more runtime types * Include the force-include-types module first * Fix comment * Expand comment	26 January 2024, 01:07:40 UTC
c1923f3	Steven Johnson	24 January 2024, 23:53:28 UTC	HALIDE_VERSION_MAJOR -> 18 (#8044)	24 January 2024, 23:53:28 UTC
6177e51	Steven Johnson	24 January 2024, 20:04:19 UTC	Update Halide version to 18 (#8043)	24 January 2024, 20:04:19 UTC
9b9dfaf	Andrew Adams	24 January 2024, 19:12:17 UTC	Update Makefile for llvm 19 (#8040)	24 January 2024, 19:12:17 UTC
90e909d	Steven Johnson	24 January 2024, 18:44:47 UTC	Allow LLVM 19 in CMake (#8041)	24 January 2024, 18:44:47 UTC
e0e9f63	Steven Johnson	22 January 2024, 21:43:00 UTC	Tweak the Printer code in runtime for smaller code (#8023) * Tweak the Printer code in runtime for smaller code TL;DR: template expansion meant that we had more replicated code than expected from the inline expansion of code in Printer and friends. Restructured and added NEVER_INLINE to try to make the call sites as small as possible. It's a modest code-size savings but nonzero... e.g., the linux-x86-64 .o output from correct_cross_compilation drops from 164280 bytes to 162936 bytes. * Update printer.h * debug * Update HalideTestHelpers.cmake * Update printer.h * fixes	22 January 2024, 21:43:00 UTC
22f9bb9	Steven Johnson	17 January 2024, 16:26:43 UTC	Add test for #8029 (#8032) Tweak correctness_float16_t so that it uses one of the transcendal functions (sqrt) that were missing in Metal.	17 January 2024, 16:26:43 UTC
3a77204	Steven Johnson	17 January 2024, 15:35:07 UTC	Require LLVM >= 16.0 (#8003) * Require LLVM >= 16.0 Per policy, we only support top-of-tree LLVM, plus two versions back; let's update to require LLVM >= 16, and drop workarounds for older versions. * LLVM_VERSION < 170	17 January 2024, 15:35:07 UTC
d2eed57	Steven Johnson	16 January 2024, 20:00:36 UTC	Fix build breakage for wasm targets (#8031) Update HalideTestHelpers.cmake	16 January 2024, 20:00:36 UTC
8d3c12e	Mike Woodworth	16 January 2024, 18:55:53 UTC	adds mappings for f16 variants of halide float math (#8029) * adds mappings for f16 variants of halide float math * fix clang format errors * trigger buildbots --------- Co-authored-by: Steven Johnson <srj@google.com>	16 January 2024, 18:55:53 UTC
91b063d	Volodymyr Kysenko	09 January 2024, 04:57:15 UTC	Stronger chain detection in LoopCarry pass (#8016) * Stronger chain detection in LoopCarry * Make sure that types are the same * Add a comment * Run CSE before calling can_prove * Test for loop carry * clang-tidy * Add missing override * Update comments	09 January 2024, 04:57:15 UTC
cdebeb8	Tom Westerhout	09 January 2024, 01:33:08 UTC	Fix -Wstrict-prototype warnings in HalideRuntime.h (#8027) When HalideRuntime.h is included in a C file, funtions that are declared with `()` instead of `(void)` for their arguments change meaning. These may cause issues downstream because different code is generated.	09 January 2024, 01:33:08 UTC
21accad	Steven Johnson	04 January 2024, 17:04:34 UTC	Set warnings on tests as well as src (#8022) * Don't use variable-length arrays There was a rogue use of VLAs (an extension we don't want to use) in one of the runtime tests. Fixed the test. I'll follow up with a separate PR to ensure this warning is enabled everywhere to flush out other usages. * Set warnings on tests as well as src	04 January 2024, 17:04:34 UTC
daf011d	Steven Johnson	04 January 2024, 17:04:18 UTC	Don't use variable-length arrays (#8021) There was a rogue use of VLAs (an extension we don't want to use) in one of the runtime tests. Fixed the test. I'll follow up with a separate PR to ensure this warning is enabled everywhere to flush out other usages.	04 January 2024, 17:04:18 UTC
b661c8d	Zalman Stern	04 January 2024, 01:49:56 UTC	Quick fix for crash that is occurring in SVE2 tests. (#8020) Broken out into separate PR for ease of review and isolated test/tracking.	04 January 2024, 01:49:56 UTC
d2da007	Steven Johnson	03 January 2024, 20:05:37 UTC	Fix for top-of-tree LLVM (Fix #8017) (#8018) Fix for top-of-tree LLVM	03 January 2024, 20:05:37 UTC
8024bdc	Volodymyr Kysenko	02 January 2024, 22:52:53 UTC	Don't add ring_buffer semaphores if the function is not scheduled as async (#8015) Don't add ring_buffer semaphores if the function is not scheduled as asybc Co-authored-by: Steven Johnson <srj@google.com>	02 January 2024, 22:52:53 UTC
6f26b04	Tyler Hou	02 January 2024, 18:27:51 UTC	Change startswith -> starts_with (#8013) startswith was deprecated in llvm/lvm-project#75491, which means that Halide fails to compile using LLVM 18 (deprecation warning).	02 January 2024, 18:27:51 UTC
61b8d38	Volodymyr Kysenko	19 December 2023, 22:14:05 UTC	Scheduling directive to support ring buffering (#7967) * Half-plumbed * Revert "Half-plumbed" This reverts commit eb9dd02c6c607f0b49c95258ae67f58fe583ff44. * Interface for double buffer * Update Provides, Calls and Realizes for double buffering * Proper sync for double buffering * Use proper name for the semaphor and use correct initial value * Rename the class * Pass expression for index * Adds storage for double buffering index * Use a separate index to go through the double buffer * Failing test * Better handling of hoisted storage in all of the async-related passes * New test and clean-up the generated IR * More tests * Allow double buffering without async and add corresponding test * Filter out incorrect double_buffer schedules * Add tests to the cmake files * Clean up * Update the comment * Clean up * Clean up * Update serialization * complete_x86_target() should enable F16C and FMA when AVX2 is present (#7971) All known AVX2-enabled architectures definitely have these features. * Add two new tail strategies for update definitions (#7949) * Add two new tail strategies for update definitions * Stop printing asm * Update expected number of partitions for Partition::Always * Add a comment explaining why the blend safety check is per dimension * Add serialization support for the new tail strategies * trigger buildbots * Add comment --------- Co-authored-by: Steven Johnson <srj@google.com> * Add appropriate mattrs for arm-32 extensions (#7978) * Add appropriate mattrs for arm-32 extensions Fixes #7976 * Pull clauses out of if * Move canonical version numbers into source, not build system (#7980) (#7981) * Move canonical version numbers into source, not build system (#7980) * Fixes * Silence useless "Insufficient parallelism" autoscheduler warning (#7990) * Add a notebook with a visualization of the aprrox_* functions and their errors (#7974) * Add a notebook with a visualization of the aprrox_* functions and their errors * Fix spelling error * Make narrowing float->int casts on wasm go via wider ints (#7973) Fixes #7972 * Fix handling of assert statements whose conditions get vectorized (#7989) * Fix handling of assert statements whose conditions get vectorized * Fix test name * Fix all "unscheduled update()" warnings in our code (#7991) * Fix all "unscheduled update()" warnings in our code And also fix the Mullapudi scheduler to explicitly touch all update stages. This allows us to mark this warning as an error if we so choose. * fixes * fixes * Update recursive_box_filters.cpp * Silence useless 'Outer dim vectorization of var' warning in Mullapudi… (#7992) Silence useless 'Outer dim vectorization of var' warning in Mullapudi scheduler * Add a tutorial for async and double_buffer * Renamed double_buffer to ring_buffer * ring_buffer() now expects an extent Expr * Actually use extent for ring_buffer() * Address some of the comments * Provide an example of the code structure for producer-consumer async example * Comments updates * Fix clang-format and clang-tidy * Add Python binding for Func::ring_buffer() * Don't use a separate index for ring buffer + add a new test * Rename the tests * Clean up the old name * Add & * Move test to the right folder * Move expr * Add comments for InjectRingBuffering * Improve ring_buffer doc * Fix comments * Comments * A better error message * Mention that extent is expected to be a positive integer * Add another code structure and explain how the indices for ring buffer are computed * Expand test comments * Fix spelling --------- Co-authored-by: Steven Johnson <srj@google.com> Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>	19 December 2023, 22:14:05 UTC
6bcb695	Steven Johnson	15 December 2023, 00:27:56 UTC	Update Halide version in setup.py to 17.0.0 (#8010)	15 December 2023, 00:27:56 UTC
6d29ad5	Steven Johnson	13 December 2023, 17:02:37 UTC	Add missing Python bindings for various recent additions to Func and Stage (#8002) * Add missing Python bindings for various recent additions to Func and Stage We have been sloppy about maintaining these. Also added a bit of testing. * Update PyEnums.cpp	13 December 2023, 17:02:37 UTC
3d5cf40	Martijn Courteaux	12 December 2023, 17:50:56 UTC	Inject profiling for function calls to 'halide_copy_to_host' and 'halide_copy_to_device'. (#7913) * Inject profiling for function calls to 'halide_copy_to_host' and 'halide_copy_to_device'. * WIP: I get segfaults. The device_interface pointer is bogus. * Figured it out... * Allow global sync on d3d12. * Cleanly time all buffer copies as well. * Cleanup old comment. * Following Andrews suggestion for suffixing buffer copies in the profiler. * Sort the profiler report lines into three sections: funcs, buffer copy to device, and buffer copy to host. * Inject profiling for function calls to 'halide_copy_to_host' and 'halide_copy_to_device'. * WIP: I get segfaults. The device_interface pointer is bogus. * Figured it out... * Allow global sync on d3d12. * Cleanly time all buffer copies as well. * Cleanup old comment. * Following Andrews suggestion for suffixing buffer copies in the profiler. * Sort the profiler report lines into three sections: funcs, buffer copy to device, and buffer copy to host. * Attempt to fix output parsing. * Fix crash for copy_to_device * halide_device_sync_global(NULL) -> success * Fixed the buffer copy bug. Added a new test that will cause buffer copies in two directions within the compiled pipeline. This will catch this better in the future. Tweaked the profile report section header printing. * Clang-format, my dear friend...	12 December 2023, 17:50:56 UTC
357e646	Steven Johnson	08 December 2023, 19:17:30 UTC	Do some basic validation of Target Features (#7986) (#7987) * Do some basic validation of Target Features (#7986) * Update Target.cpp * Update Target.cpp * Fixes * Update Target.cpp * Improve error messaging. * format * Update Target.cpp	08 December 2023, 19:17:30 UTC
9c099c2	Andrew Adams	08 December 2023, 17:53:04 UTC	Teach unrolling to exploit conditions in enclosing ifs (#7969) * Teach unrolling to exploit conditions in enclosing ifs Fixes #7968 * Handle vectorization as well * Remove unused usings * Add missing print	08 December 2023, 17:53:04 UTC
9643518	Steven Johnson	08 December 2023, 17:50:32 UTC	Add join_strings() call and use it from mattrs() (#7997) * Add join_strings() call and use it from mattrs() This is a super-nit kind of fix, but the fact that we had rerolled a join-strings algo in a half-dozen places made my teeth hurt, so I decided to fix it: - Add join_strings() to Util.h - revise the mattrs() calls to use it instead of the janky mess they used This doesn't move the needle on code size or speed but it is less weird. Probably other places we could/should use this too. (Does C++20 have join/split strings in the std library yet? If not, why not?) * Update Util.h * Update Util.h * clang-tidy	08 December 2023, 17:50:32 UTC
19c1c81	Steven Johnson	08 December 2023, 16:50:01 UTC	Make wasm +sign-ext and +nontrapping-fptoint the default (#7995) * Make wasm +sign-ext and +nontrapping-fptoint the default These have been supported in ~all wasm runtimes for a while now, and +nontrapping-fptoint in particular can make a big performance difference. We should enable these by default, and add a new backdoor (wasm_mvponly) for code paths that need to use the original wasm Minimum Viable Product spec only. * Update simd_op_check_wasm.cpp	08 December 2023, 16:50:01 UTC
5aa891a	Steven Johnson	07 December 2023, 18:03:06 UTC	Silence useless 'Outer dim vectorization of var' warning in Mullapudi… (#7992) Silence useless 'Outer dim vectorization of var' warning in Mullapudi scheduler	07 December 2023, 18:03:06 UTC
df36139	Steven Johnson	07 December 2023, 18:02:42 UTC	Fix all "unscheduled update()" warnings in our code (#7991) * Fix all "unscheduled update()" warnings in our code And also fix the Mullapudi scheduler to explicitly touch all update stages. This allows us to mark this warning as an error if we so choose. * fixes * fixes * Update recursive_box_filters.cpp	07 December 2023, 18:02:42 UTC
83febb0	Andrew Adams	07 December 2023, 17:46:27 UTC	Fix handling of assert statements whose conditions get vectorized (#7989) * Fix handling of assert statements whose conditions get vectorized * Fix test name	07 December 2023, 17:46:27 UTC
d1ecc1f	Andrew Adams	07 December 2023, 16:06:57 UTC	Make narrowing float->int casts on wasm go via wider ints (#7973) Fixes #7972	07 December 2023, 16:06:57 UTC
6e57d6c	Volodymyr Kysenko	07 December 2023, 16:06:31 UTC	Add a notebook with a visualization of the aprrox_* functions and their errors (#7974) * Add a notebook with a visualization of the aprrox_* functions and their errors * Fix spelling error	07 December 2023, 16:06:31 UTC
9f6ec17	Steven Johnson	07 December 2023, 00:59:53 UTC	Silence useless "Insufficient parallelism" autoscheduler warning (#7990)	07 December 2023, 00:59:53 UTC
17b7366	Steven Johnson	06 December 2023, 23:03:14 UTC	Move canonical version numbers into source, not build system (#7980) (#7981) * Move canonical version numbers into source, not build system (#7980) * Fixes	06 December 2023, 23:03:14 UTC
209ec02	Andrew Adams	05 December 2023, 22:15:23 UTC	Add appropriate mattrs for arm-32 extensions (#7978) * Add appropriate mattrs for arm-32 extensions Fixes #7976 * Pull clauses out of if	05 December 2023, 22:15:23 UTC

Newer
Older