Revision 36d74a8cbf9c4129f608cd97d231961f1bd99c4c authored by Andrew Adams on 27 February 2024, 01:56:59 UTC, committed by GitHub on 27 February 2024, 01:56:59 UTC
* Avoid redundant scope lookups

This pattern has been bugging me for a long time:

```
if (scope.contains(key)) {
  Foo f = scope.get(key);
}
```

This redundantly looks up the key in the scope twice. I've finally
gotten around to fixing it. I've introduced a find method that either
returns a const pointer to the value, if it exists, or null. It also
searches any containing scopes, which are held by const pointer, so the
method has to return a const pointer.

```
if (const Foo *f = scope.find(key)) {
}
```

For cases where you want to get and then mutate, I added shallow_find,
which doesn't search enclosing scopes, but returns a mutable pointer.

We were also doing redundant scope lookups in ScopedBinding. We stored
the key in the helper object, and then did a pop on that key in the
ScopedBinding destructor. This commit changes Scope so that Scope::push
returns an opaque token that you can pass to Scope::pop to have it
remove that element without doing a fresh lookup. ScopedBinding now uses
this. Under the hood it's just an iterator on the underlying map (map
iterators are not invalidated on inserting or removing other stuff).

The net effect is to speed up local laplacian lowering by about 5%

I also considered making it look more like an stl class, and having find
return an iterator, but it doesn't really work. The iterator it returns
might point to an entry in an enclosing scope, in which case you can't
compare it to the .end() method of the scope you have. Scopes are
different enough from maps that the interface really needs to be
distinct.

* Pacify clang-tidy

* Fix unintentional mutation of interval in scope

* Fix accidental Scope::get

* Rewrite the skip stages lowering pass

Skip stages was slow due to crappy computational complexity (quadratic?)

I reworked it into a two-pass linear-time algorithm. The first part
remembers which pieces of IR are actually relevant to the task, and the
second pass performs the task using a bounds-inference-like algorithm.

On main resnet50 spends 519 ms in this pass. This commit reduces it to
40 ms. Local laplacian with 100 pyramid levels spends 7.4 seconds in
this pass. This commit reduces it to ~3 ms.

This commit also moves the cache store for memoized Funcs into the
produce node, instead of at the top of the consume node, because it
naturally places it inside a condition you inject into the produce node.

* clang-tidy fixes

* Fix skip stages interaction with compute_with

* Unify let visitors, and use fewer stack frames for them

* Fix accidental leakage of .used into .loaded

* Visit the bodies of uninteresting let chains

* Another used -> loaded

* Fix hoist_storage not handling condition correctly.

---------

Co-authored-by: Steven Johnson <srj@google.com>
1 parent 2b5beb3
Raw File
README_vulkan.md
# Vulkan Support for Halide

Halide supports the Khronos Vulkan framework as a compute API backend for GPU-like 
devices, and compiles directly to a binary SPIR-V representation as part of its 
code generation before submitting it to the Vulkan API. Both JIT and AOT usage 
are supported via the `vulkan` target flag (e.g. `HL_JIT_TARGET=host-vulkan`).

Vulkan support is actively under development, and considered *BETA* quality
at this stage.  Tests are passing, but performance tuning and user testing is needed 
to identify potential issues before rolling this into production.  

See [below](#current-status) for details.

# Compiling Halide w/Vulkan Support

You'll need to configure Halide and enable the cmake option TARGET_VULKAN (which is now ON by default).

For example, on Linux & OSX:

```
% cmake -G Ninja -DTARGET_VULKAN=ON -DCMAKE_BUILD_TYPE=Release -DLLVM_DIR=$LLVM_ROOT/lib/cmake/llvm 
% cmake --build build --config Release
```

On Windows, you may need to specify the location of the Vulkan SDK if the paths aren't resolved by CMake automatically.  For example (assuming the Vulkan SDK is installed in the default path):

```
C:\> cmake -G Ninja -DTARGET_VULKAN=ON -DCMAKE_BUILD_TYPE=Release -DLLVM_DIR=$LLVM_ROOT/lib/cmake/llvm -DVulkan_LIBRARY=C:\VulkanSDK\1.3.231.1\Lib\vulkan-1.lib -DVulkan_INCLUDE_DIR=C:\VulkanSDK\1.3.231.1\Include\vulkan -S . -B build
C:\> cmake --build build --config Release

```

# Vulkan Runtime Environment:

Halide has no direct dependency on Vulkan for code-generation, but the runtime
requires a working Vulkan environment to run Halide generated code. Any valid 
Vulkan v1.0+ device driver should work.

Specifically, you'll need:

-   A vendor specific Vulkan device driver
-   The generic Vulkan loader library

For AMD & NVIDIA & Intel devices, download and install the latest graphics driver 
for your platform. Vulkan support should be included.

## Windows 

To build Halide AOT generators, you'll need the Vulkan SDK (specifically the Vulkan loader library and headers):
https://sdk.lunarg.com/sdk/download/latest/windows/vulkan-sdk.exe

For Vulkan device drivers, consult the appropriate hardware vendor for your device.  A few common ones are listed below.

-   [AMD Vulkan Driver](https://www.amd.com/en/technologies/vulkan)
-   [NVIDIA Vulkan Driver](https://developer.nvidia.com/vulkan-driver)
-   [INTEL Vulkan Driver](https://www.intel.com/content/www/us/en/download-center/home.html)

## Linux 

On Ubuntu Linux v22.04, the vulkan runtime is distributed in the `vulkan-tools` package. For earlier versions of Ubuntu (e.g. v20.x or v18.x) the contents of the `vulkan-tools` package was distributed as `vulkan-utils` so use that package instead.

Proprietary drivers can be installed via 'apt' using PPA's for each vendor. Examples for AMD and NVIDIA are provided below.

For AMD on Ubuntu v22.04:
```
$ sudo add-apt-repository ppa:oibaf/graphics-drivers
$ sudo apt update
$ sudo apt upgrade
$ sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools
```

For NVIDIA on Ubuntu v22.04:
```
$ sudo add-apt-repository ppa:graphics-drivers/ppa
$ sudo apt update
$ sudo apt upgrade
# - replace ### with latest driver release (e.g. 515)
$ sudo apt install nvidia-driver-### nvidia-settings vulkan vulkan-tools
```

Note that only valid drivers for your system should be installed since there are
reports of the Vulkan loader segfaulting just by having a non-supported driver present. 
Specifically, the seemingly generic `mesa-vulkan-drivers` actually includes the AMD 
graphics driver, which can cause problems if installed on an NVIDIA-only system. 

## Mac

You're better off using Halide's Metal backend instead, but it is possible to run 
Vulkan apps on a Mac via the MoltenVK library:

-   [MoltenVK Project](https://github.com/KhronosGroup/MoltenVK)

The easiest way to get the necessary dependencies is to use the official MoltenVK SDK
installer provided by LunarG:

-   [MoltenVK SDK (Latest Release)](https://sdk.lunarg.com/sdk/download/latest/mac/vulkan-sdk.dmg)

Alternatively, if you have the [Homebrew](https://brew.sh/) package manager installed 
for MacOS, you can use it to install the Vulkan Loader and MoltenVK compatibility 
layer:

```
$ brew install vulkan-loader molten-vk
```

# Testing Your Vulkan Environment

You can validate that everything is configured correctly by running the `vulkaninfo`
app (bundled in the vulkan-utils package) to make sure your device is detected (eg):

```
$ vulkaninfo
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.224


Instance Extensions: count = 19
===============================
	...

Layers: count = 10
==================
VK_LAYER_KHRONOS_profiles (Khronos Profiles layer) Vulkan version 1.3.224, layer version 1:
	Layer Extensions: count = 0
	Devices: count = 1
		GPU id = 0 (NVIDIA GeForce RTX 3070 Ti)
		Layer-Device Extensions: count = 1

...

```

Make sure everything looks correct before continuing!

# Targetting Vulkan

To generate Halide code for Vulkan, simply add the `vulkan` flag to your target as well as any other optional device specific features you wish to enable for Halide:

| Target Feature | Description | 
| --             | --          |
| `vulkan`       | Enables the vulkan backend |
| `vk_int8`      | Allows 8-bit integer storage types to be used |
| `vk_int16`     | Allows 16-bit integer storage types to be used |
| `vk_int64`     | Allows 64-bit integer storage types to be used |
| `vk_float16`   | Allows 16-bit floating-point values to be used for computation |
| `vk_float64`   | Allows 64-bit floating-point values to be used for computation |
| `vk_v10`       | Generates code compatible with the Vulkan v1.0+ API |
| `vk_v12`       | Generates code compatible with the Vulkan v1.2+ API |
| `vk_v13`       | Generates code compatible with the Vulkan v1.3+ API |

Note that 32-bit integer and floating-point types are always available. All other optional device features are off by default (since they are not required by the Vulkan API, and thus must be explicitly enabled to ensure that the code being generated will be compatible with the device and API version being used for execution). 

For AOT generators add `vulkan` (and any other flags you wish to use) to the target command line option:

```
$ ./lesson_15_generate -g my_first_generator -o . target=host-vulkan-vk_int8-vk_int16
```

For JIT apps use the `HL_JIT_TARGET` environment variable:

```
$ HL_JIT_TARGET=host-vulkan-vk_int8-vk_int16 ./tutorial/lesson_01_basics
```

# Useful Runtime Environment Variables

To modify the default behavior of the runtime, the following environment 
variables can be used to adjust the configuration of the Vulkan backend 
at execution time:

`HL_VK_LAYERS=...` will tell Halide to choose a suitable Vulkan instance
that supports the given list of layers. If not set, `VK_INSTANCE_LAYERS=...` 
will be used instead. If neither are present, Halide will use the first 
Vulkan compute device it can find.  Multiple layers can be specified using 
the appropriate environment variable list delimiter (`:` on Linux/OSX/Posix, 
or `;` on Windows).

`HL_VK_DEVICE_TYPE=...` will tell Halide to choose which type of device
to select for creating the Vulkan instance. Valid options are 'gpu', 
'discrete-gpu', 'integrated-gpu', 'virtual-gpu', or 'cpu'. If not set,
Halide will search for the first 'gpu' like device it can find, or fall back
to the first compute device it can find.

`HL_VK_ALLOC_CONFIG=...` will tell Halide to configure the Vulkan memory
allocator use the given constraints specified as 5x integer values 
separated by the appropriate environment variable list delimiter 
(e.g. `N:N:N:N:N` on Linux/OSX/Posix, or `N;N;N;N;N` on Windows). These values 
correspond to `maximum_pool_size`, `minimum_block_size`, `maximum_block_size`, 
`maximum_block_count` and `nearest_multiple`. 

The `maximum_pool_size` constraint will tell Halide to configure the 
Vulkan memory allocator to never request more than N megabytes for the
entire pool of allocations for the context. This includes all resource 
blocks used for suballocations. Setting this to a non-zero value will 
limit the amount device memory used by Halide, which may be useful when
other applications and frameworks are competing for resources. 
Default is 0 ... meaning no limit.

The `minimum_block_size` constraint will tell Halide to configure the 
Vulkan memory allocator to always request a minimum of N megabytes for 
a resource block, which will be used as a pool for suballocations.  
Increasing this value may improve performance while sacrificing the amount 
of available device memory. Default is 32MB.

The `maximum_block_size` constraint will tell Halide to configure the 
Vulkan memory allocator to never exceed a maximum of N megabytes for a 
resource block.  Decreasing this value may free up more memory but may 
impact performance, and/or restrict allocations to be unusably small. 
Default is 0 ... meaning no limit.

The `maximum_block_count` constraint will tell Halide to configure the 
Vulkan memory allocator to never exceed a total of N block allocations.  
Decreasing this value may free up more memory but may impact performance, 
and/or restrict allocations. Default is 0 ... meaning no limit.

The `nearest_multiple` constraint will tell Halide to configure the 
Vulkan memory allocator to always round up the requested allocation sizes
to the given integer value. This is useful for architectures that
require specific alignments for subregions allocated within a block.
Default is 32 ... setting this to zero means no constraint. 

# Debug Environment Variables

The following environment variables may be useful for tracking down potential
issues related to Vulkan:

`HL_DEBUG_CODEGEN=3` will print out debug info that includees the SPIR-V
code generator used for Vulkan while it is compiling.

`HL_SPIRV_DUMP_FILE=...` specifies a file to dump the binary SPIR-V generated
during compilation. Useful for debugging CodeGen issues. Can be inspected,
validated and disassembled via the SPIR-V tools:

https://github.com/KhronosGroup/SPIRV-Tools


# Current Status

All correctness tests are now passing on tested configs for Linux & Windows using the target `host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13` on LLVM v14.x. 

MacOS passes most tests but encounters internal MoltenVK code translation issues for wide vectors, and ambiguous function calls.

Python apps, tutorials and correctness tests are now passing, but the AOT cases are skipped since the runtime environment needs to be customized to locate the platform specific Vulkan loader library.

Android platform support is currently being worked on.

# Caveats:

-   Other than 32-bit floats and integers, every other data type is optional per the Vulkan spec
-   Float 64-bit types can be enabled, but there aren't any native math functions available in SPIR-V
-   Only one dynamically sized shared memory allocation can be used, but any number of 
    fixed sized allocation are supported (up to the maximum amount allowed by the device)

# Known TODO:

-   Performance tuning of CodeGen and Runtime
-   More platform support (Android is work-in-progress, RISC-V, etc)
-   Adapt unsupported types to supported types (if missing vk_int8 then promote to uint32_t)?
-   Better debugging utilities using the Vulkan debug hooks.
-   Allow debug symbols to be stripped from SPIR-V during codegen to reduce
    memory overhead for large kernels.
-   Investigate floating point rounding and precision (v1.3 adds more controls)
-   Investigate memory model usage (can Halide gain anything from these?)

back to top