https://github.com/halide/Halide
Revision 474b9e193c460b7b620b610f58f9c7a075128548 authored by Andrew Adams on 13 October 2020, 23:01:42 UTC, committed by Andrew Adams on 13 October 2020, 23:01:42 UTC
Also it's better to vectorize at the native width, for both the new and
baseline schedules. New inner loop (which does the same number of
multiply-adds, but in a 16x4 tile instead of an 8x8 tile):

```
	vmovdqu64	(%rcx,%rdx,8), %zmm5
	vmovq	(%r14,%rdx,8), %xmm6    # xmm6 = mem[0],zero
	vpermw	%zmm5, %zmm0, %zmm7
	vpermw	%zmm5, %zmm1, %zmm5
	vpermd	%zmm6, %zmm2, %zmm8
	vpermd	%zmm8, %zmm3, %zmm8
	vpmaddwd	%zmm8, %zmm7, %zmm7
	vpbroadcastd	%xmm6, %zmm6
	vpmaddwd	%zmm6, %zmm5, %zmm5
	vpaddd	%zmm5, %zmm4, %zmm4
	vpaddd	%zmm7, %zmm4, %zmm4
	incq	%rdx
	cmpq	$32, %rdx

```
1 parent 21b7492
History
Tip revision: 474b9e193c460b7b620b610f58f9c7a075128548 authored by Andrew Adams on 13 October 2020, 23:01:42 UTC
Use a binary reduction tree for outer iterations
Tip revision: 474b9e1
File Mode Size
.github
apps
cmake
dependencies
doc
packaging
python_bindings
src
test
tools
tutorial
util
.clang-format -rw-r--r-- 1.5 KB
.clang-format-ignore -rw-r--r-- 265 bytes
.clang-tidy -rw-r--r-- 506 bytes
.gitattributes -rw-r--r-- 342 bytes
.gitignore -rw-r--r-- 1.1 KB
.gitmodules -rw-r--r-- 0 bytes
CMakeLists.txt -rw-r--r-- 4.5 KB
CODE_OF_CONDUCT.md -rw-r--r-- 3.5 KB
LICENSE.txt -rw-r--r-- 3.2 KB
Makefile -rw-r--r-- 95.7 KB
README.md -rw-r--r-- 19.6 KB
README_cmake.md -rw-r--r-- 66.5 KB
README_rungen.md -rw-r--r-- 12.1 KB
README_webassembly.md -rw-r--r-- 7.5 KB

README.md

back to top