Content - 08819ac17eb965a472943a4557a3fc93e17e9ddf - 988538c/docs/target-compatibility.md

visit type:
Tip revision: 1b6cea2219307f6271e131c43d6e8f48910bd435 authored by Yong He on 21 April 2022, 18:03:30 UTC
Made translation units visible to transitive `import`s. (#2197)
Tip revision: 1b6cea2
target-compatibility.md
Slang Target Compatibility 
==========================

Shader Model (SM) numbers are D3D Shader Model versions, unless explicitly stated otherwise.
OpenGL compatibility is not listed here, because OpenGL isn't an officially supported target. 

Items with a + means that the feature is anticipated to be added in the future.
Items with ^ means there is some discussion about support later in the document for this target.

| Feature                     |    D3D11     |    D3D12     |     VK     |      CUDA     |    CPU
|-----------------------------|--------------|--------------|------------|---------------|---------------
| Half Type                   |     No       |     Yes ^    |   Yes      |     Yes ^     |    No +
| Double Type                 |     Yes      |     Yes      |   Yes      |     Yes       |    Yes
| Double Intrinsics           |     No       |   Limited +  |  Limited   |     Most      |    Yes
| u/int8_t Type               |     No       |   No         |   Yes ^    |     Yes       |    Yes
| u/int16_t Type              |     No       |   Yes ^      |   Yes ^    |     Yes       |    Yes
| u/int64_t Type              |     No       |   Yes ^      |   Yes      |     Yes       |    Yes
| u/int64_t Intrinsics        |     No       |   No         |   Yes      |     Yes       |    Yes
| int matrix                  |     Yes      |   Yes        |   No +     |     Yes       |    Yes
| tex.GetDimension            |     Yes      |   Yes        |   Yes      |     No        |    Yes
| SM6.0 Wave Intrinsics       |     No       |   Yes        |  Partial   |     Yes ^     |    No
| SM6.0 Quad Intrinsics       |     No       |   Yes        |   No +     |     No        |    No
| SM6.5 Wave Intrinsics       |     No       |   Yes ^      |   No +     |     Yes ^     |    No
| WaveMask Intrinsics         |     Yes ^    |   Yes ^      |   Yes +    |     Yes       |    No
| WaveShuffle                 |     No       |   Limited ^  |   Yes      |     Yes       |    No
| Tesselation                 |     Yes ^    |   Yes ^      |   No +     |     No        |    No
| Graphics Pipeline           |     Yes      |   Yes        |   Yes      |     No        |    No
| Ray Tracing DXR 1.0         |     No       |   Yes ^      |   Yes ^    |     No        |    No
| Ray Tracing DXR 1.1         |     No       |   Yes        |   No +     |     No        |    No
| Native Bindless             |     No       |    No        |   No       |     Yes       |    Yes
| Buffer bounds               |     Yes      |   Yes        |   Yes      |   Limited ^   |    Limited ^
| Resource bounds             |     Yes      |   Yes        |   Yes      | Yes (optional)|    Yes
| Atomics                     |     Yes      |   Yes        |   Yes      |     Yes       |    Yes
| Group shared mem/Barriers   |     Yes      |   Yes        |   Yes      |     Yes       |    No + 
| TextureArray.Sample float   |     Yes      |   Yes        |   Yes      |     No        |    Yes
| Separate Sampler            |     Yes      |   Yes        |   Yes      |     No        |    Yes
| tex.Load                    |     Yes      |   Yes        |   Yes      |  Limited ^    |    Yes
| Full bool                   |     Yes      |   Yes        |   Yes      |     No        |    Yes ^ 
| Mesh Shader                 |     No       |   No +       |   No +     |     No        |    No
| `[unroll]`                  |     Yes      |   Yes        |   Yes ^    |     Yes       |    Limited + 
| Atomics                     |     Yes      |   Yes        |   Yes      |     Yes       |    No + 
| Atomics on RWBuffer         |     Yes      |   Yes        |   Yes      |     No        |    No + 
| Sampler Feedback            |     No       |   Yes        |   No +     |     No        |    Yes ^
| RWByteAddressBuffer Atomic  |     No       |   Yes ^      |   Yes ^    |     Yes       |    No +

## Half Type

There appears to be a problem writing to a StructuredBuffer containing half on D3D12. D3D12 also appears to have problems doing calculations with half.

In order for half to work in CUDA, NVRTC must be able to include `cuda_fp16.h` and related files. Please read the [CUDA target documentation](cuda-target.md) for more details. 

## u/int8_t Type

Not currently supported in D3D11/D3D12 because not supported in HLSL/DXIL/DXBC. 

Supported in Vulkan via the extensions `GL_EXT_shader_explicit_arithmetic_types` and `GL_EXT_shader_8bit_storage`.

## u/int16_t Type

Requires SM6.2 which requires DXIL and therefore DXC and D3D12. For DXC this is discussed [here](https://github.com/Microsoft/DirectXShaderCompiler/wiki/16-Bit-Scalar-Types).

Supported in Vulkan via the extensions `GL_EXT_shader_explicit_arithmetic_types` and `GL_EXT_shader_16bit_storage`.

## u/int64_t Type

Requires SM6.0 which requires DXIL for D3D12. Therefore not available with DXBC on D3D11 or D3D12.

## int matrix

Means can use matrix types containing integer types. 

## tex.GetDimensions

tex.GetDimensions is the GetDimensions method on 'texture' objects. This is not supported on CUDA as CUDA has no equivalent functionality to get these values. GetDimensions work on Buffer resource types on CUDA.

## SM6.0 Wave Intrinsics

CUDA has premliminary support for Wave Intrinsics, introduced in [PR #1352](https://github.com/shader-slang/slang/pull/1352). Slang synthesizes the 'WaveMask' based on program flow and the implied 'programmer view' of exectution. This support is built on top of WaveMask intrinsics with Wave Intrinsics being replaced with WaveMask Intrinsic calls with Slang generating the code to calculate the appropriate WaveMasks.

Please read [PR #1352](https://github.com/shader-slang/slang/pull/1352) for a better description of the status.

## SM6.5 Wave Intrinsics

SM6.5 Wave Intrinsics are supported, but requires a downstream DXC compiler that supports SM6.5. As it stands the DXC shipping with windows does not. 

## WaveMask Intrinsics

In order to map better to the CUDA sync/mask model Slang supports 'WaveMask' intrinsics. They operate in broadly the same way as the Wave intrinsics, but require the programmer to specify the lanes that are involved. To write code that uses wave intrinsics acrosss targets including CUDA, currently the WaveMask intrinsics must be used. For this to work, the masks passed to the WaveMask functions should exactly match the 'Active lanes' concept that HLSL uses, otherwise the result is undefined. 

The WaveMask intrinsics are not part of HLSL and are only available on Slang.

## WaveShuffle

`WaveShuffle` and `WaveBroadcastLaneAt` are Slang specific intrinsic additions to expand the options available around `WaveReadLaneAt`. 

To be clear this means they will not compile directly on 'standard' HLSL compilers such as `dxc`, but Slang HLSL *output* (which will not contain these intrinsics) can (and typically is) compiled via dxc.

The difference between them can be summarized as follows

* WaveBroadcastLaneAt - laneId must be a compile time constant 
* WaveReadLaneAt - laneId can be dynamic but *MUST* be the same value across the Wave ie 'dynamically uniform' across the Wave
* WaveShuffle - laneId can be truly dynamic (NOTE! That it is not strictly truly available currently on all targets, specifically HLSL)

Other than the different restrictions on laneId they act identically to WaveReadLaneAt.

`WaveBroadcastLaneAt` and `WaveReadLaneAt` will work on all targets that support wave intrinsics, with the only current restriction being that on GLSL targets, only scalars and vectors are supported.

`WaveShuffle` will always work on CUDA/Vulkan. 

On HLSL based targets currently `WaveShuffle` will be converted into `WaveReadLaneAt`. Strictly speaking this means it *requires* the `laneId` to be `dynamically uniform` across the Wave. In practice some hardware supports the loosened usage, and others does not. In the future this may be fixed in Slang and/or HLSL to work across all hardware. For now if you use `WaveShuffle` on HLSL based targets it will be necessary to confirm that `WaveReadLaneAt` has the loosened behavior for all the hardware intended. If target hardware does not support the loosened restrictions it's behavior is undefined. 

## Tesselation

Although tesselation stages should work on D3D11 and D3D12 they are not tested within our test framework, and may have problems. 

## Native Bindless  

Bindless is possible on targets that support it - but is not the default behavior for those targets, and typically require significant effort in Slang code. 

'Native Bindless' targets use a form of 'bindless' for all targets. On CUDA this requires the target to use 'texture object' style binding and for the device to have 'compute capability 3.0' or higher.

## Resource bounds 

For CUDA this is optional as can be controlled via the SLANG_CUDA_BOUNDARY_MODE macro in the `slang-cuda-prelude.h`. By default it's behavior is `cudaBoundaryModeZero`.

## Buffer Bounds

This is the feature when accessing outside of the bounds of a Buffer there is well defined behavior - on read returning all 0s, and on write, the write being ignored.

On CPU there is only bounds checking on debug compilation of C++ code. This will assert if the access is out of range.

On CUDA out of bounds accesses default to element 0 (!). The behavior can be controlled via the SLANG_CUDA_BOUND_CHECK macro in the `slang-cuda-prelude.h`. This behavior may seem a little strange - and it requires a buffer that has at least one member to not do something nasty. It is really a 'least worst' answer to a difficult problem and is better than out of range accesses or worse writes.

## TextureArray.Sample float 

When using 'Sample' on a TextureArray, CUDA treats the array index parameter as an int, even though it is passed as a float.

## Separate Sampler

This feature means that a multiple Samplers can be used with a Texture. In terms of the HLSL code this can be seen as the 'SamplerState' being a parameter passed to the 'Sample' method on a texture object. 

On CUDA the SamplerState is ignored, because on this target a 'texture object' is the Texture and Sampler combination.

## Graphics Pipeline

CPU and CUDA only currently support compute shaders. 

## Ray Tracing DXR 1.0

Vulkan does not support a local root signature, but there is the concept of a 'shader record'. In Slang a single constant buffer can be marked as a shader record with the `[[vk::shader_record]]` attribute, for example:

```
[[vk::shader_record]]
cbuffer ShaderRecord
{
	uint shaderRecordID;
} 
```

In practice to write shader code that works across D3D12 and VK you should have a single constant buffer marked as 'shader record' for VK and then on D3D that constant buffer should be bound in the local root signature on D3D. 

## tex.Load

tex.Load is only supported on CUDA for Texture1D. Additionally CUDA only allows such access for linear memory, meaning the bound texture can also not have mip maps. Load *is* allowed on RWTexture types of other dimensions including 1D on CUDA.

## Full bool

Means fully featured bool support. CUDA has issues around bool because there isn't a vector bool type built in. Currently bool aliases to an int vector type. 

On CPU there are some issues in so far as bool's size is not well defined in size an alignment. Most C++ compilers now use a byte to represent a bool. In the past it has been backed by an int on some compilers. 

## `[unroll]`

The unroll attribute allows for unrolling `for` loops. At the moment the feature is dependent on downstream compiler support which is mixed. In the longer term the intention is for Slang to contain it's own loop unroller - and therefore not be dependent on the feature on downstream compilers. 

On C++ this attribute becomes SLANG_UNROLL which is defined in the prelude. This can be predefined if there is a suitable mechanism, if there isn't a definition SLANG_UNROLL will be an empty definition. 

On GLSL and VK targets loop unrolling uses the [GL_EXT_control_flow_attributes](https://github.com/KhronosGroup/GLSL/blob/master/extensions/ext/GL_EXT_control_flow_attributes.txt) extension.

Slang does have a cross target mechanism to [unroll loops](language-reference/06-statements.md), in the section `Compile-Time For Statement`.

## Atomics on RWBuffer

For VK the GLSL output from Slang seems plausible, but VK binding fails in tests harness.

On CUDA RWBuffer becomes CUsurfObject, which is a 'texture' type and does not support atomics. 

On the CPU atomics are not supported, but will be in the future.

## Sampler Feedback

The HLSL [sampler feedback feature](https://microsoft.github.io/DirectX-Specs/d3d/SamplerFeedback.html) is available for DirectX12. The features requires shader model 6.5 and therefore a version of [DXC](https://github.com/Microsoft/DirectXShaderCompiler) that supports that model or higher. The Shader Model 6.5 requirement also means only DXIL binary format is supported. 

There doesn't not appear to be a similar feature available in Vulkan yet, but when it is available support should be addeed.

For CPU targets there is the IFeedbackTexture interface that requires an implemention for use. Slang does not currently include CPU implementations for texture types.  

## RWByteAddressBuffer Atomic

The additional supported methods on RWByteAddressBuffer are...

```
void RWByteAddressBuffer::InterlockedAddF32(uint byteAddress, float valueToAdd, out float originalValue);
void RWByteAddressBuffer::InterlockedAddF32(uint byteAddress, float valueToAdd);

void RWByteAddressBuffer::InterlockedAddI64(uint byteAddress, int64_t valueToAdd, out int64_t originalValue);
void RWByteAddressBuffer::InterlockedAddI64(uint byteAddress, int64_t valueToAdd);

void RWByteAddressBuffer::InterlockedCompareExchangeU64(uint byteAddress, uint64_t compareValue, uint64_t value, out uint64_t outOriginalValue);

uint64_t RWByteAddressBuffer::InterlockedExchangeU64(uint byteAddress, uint64_t value);

uint64_t RWByteAddressBuffer::InterlockedMaxU64(uint byteAddress, uint64_t value);
uint64_t RWByteAddressBuffer::InterlockedMinU64(uint byteAddress, uint64_t value);

uint64_t RWByteAddressBuffer::InterlockedAndU64(uint byteAddress, uint64_t value);
uint64_t RWByteAddressBuffer::InterlockedOrU64(uint byteAddress, uint64_t value);
uint64_t RWByteAddressBuffer::InterlockedXorU64(uint byteAddress, uint64_t value);
```

On HLSL based targets this functionality is achieved using [NVAPI](https://developer.nvidia.com/nvapi). Support for NVAPI is described
in the separate [NVAPI Support](nvapi-support.md) document.  

On Vulkan, for float the [`GL_EXT_shader_atomic_float`](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_shader_atomic_float.html) extension is required. For int64 the [`GL_EXT_shader_atomic_int64`](https://raw.githubusercontent.com/KhronosGroup/GLSL/master/extensions/ext/GL_EXT_shader_atomic_int64.txt) extension is required.

CUDA requires SM6.0 or higher for int64 support.