Content - 413a713879a8d6657996a70747979fd59b6c40f4 - 1ed0695/docs/cpu-target.md

visit type:
Tip revision: bc0d0f9e2a5a7297d8559a2ee1e3bd59454813cf authored by Tim Foley on 27 August 2020, 17:46:50 UTC
Clean up the way that lookup "through" a base type is encoded (#1519)
Tip revision: bc0d0f9
cpu-target.md
Slang CPU Target Support
========================

Slang has preliminary support for producing CPU source and binaries. 

# Features

* Can compile C/C++/Slang to binaries (executables and or shared libraries)
* Can compile Slang source into C++ source code
* Supports compute style shaders 
* C/C++ backend abstracts the command line options, and parses the compiler errors/out such that all supported compilers output available in same format 
* Once compilation is complete can optionally access and run CPU code directly

# Limitations

These limitations apply to Slang transpiling to C++. 

* Barriers are not supported (making these work would require an ABI change)
* Atomics are not supported
* Complex resource types (such as Texture2d) are work in progress
* Out of bounds access to resources has undefined behavior 

For current C++ source output, the compiler needs to support partial specialization. 

# How it works

The initial version works by adding 'back end' compiler support for C/C++ compilers. Currently this is tested to work with Visual Studio, Clang and G++/Gcc on Windows and Linux. The C/C++ backend can be directly accessed much like 'dxc', 'fxc' of 'glslang' can, using the pass-through mechanism with the following new backends... 

```
SLANG_PASS_THROUGH_CLANG,                   ///< Clang C/C++ compiler 
SLANG_PASS_THROUGH_VISUAL_STUDIO,           ///< Visual studio C/C++ compiler
SLANG_PASS_THROUGH_GCC,                     ///< GCC C/C++ compiler
SLANG_PASS_THROUGH_GENERIC_C_CPP,           ///< Generic C or C++ compiler, which is decided by the source type
```

Sometimes it is not important which C/C++ compiler is used, and this can be specified via the 'Generic C/C++' option. This will aim to use the compiler that is most likely binary compatible with the compiler that was used to build the Slang binary being used. 

To make it possible for Slang to produce CPU code, we now need a mechanism to convert Slang code into C/C++. The first iteration only supports C++ generation. If source is desired instead of a binary this can be specified via the SlangCompileTarget. These can be specified on the `slangc` command line as `-target c` or `-target cpp`.

Note that when using the 'pass through' mode for a CPU based target it is currently necessary to set an entry point, even though it's basically ignored. 

In the API the `SlangCompileTarget`s are 

```
SLANG_C_SOURCE             ///< The C language
SLANG_CPP_SOURCE           ///< The C++ language
```        
   
If a CPU binary is required this can be specified as a `SlangCompileTarget` of 
   
```   
SLANG_EXECUTABLE           ///< Executable (for hosting CPU/OS)
SLANG_SHARED_LIBRARY       ///< A shared library/Dll (for hosting CPU/OS)
SLANG_HOST_CALLABLE        ///< A CPU target that makes the compiled code available to be run immediately
```

These can also be specified on the Slang command line as `-target exe` and `-target dll` or `-target sharedlib`. `-target callable` or `-target host-callable` is also possible, but is typically not very useful from the command line, other than to test such code can be loaded for host execution.

In order to be able to use the Slang code on CPU, there needs to be binding via values passed to a function that the C/C++ code will produce and export. How this works is described in the ABI section. 

If a binary target is requested, the binary contents will be returned in a ISlangBlob just like for other targets. To use the CPU binary typically it must be saved as file and then potentially marked for execution by the OS before executing. It may be possible to load shared libraries or dlls from memory - but is a non standard feature, that requires unusual work arounds. 

Under the covers when Slang is used to generate a binary via a C/C++ compiler, it must do so through the file system. Currently this means that the source (say generated by Slang) and the binary (produced by the C/C++ compiler) must all be files. To make this work Slang uses temporary files. The reasoning for hiding this mechanism - and for example not return filenames, is so that in the future when binaries are produced directly (for example with LLVM), nothing will need to change.  

Executing CPU Code
==================

In typical Slang operation when code is compiled it produces either source or a binary that can then be loaded by another API such as a rendering API. With CPU code the binary produced could be saved to a file and then executed as an exe or a shared library/dll. In practice though it is common to want to be able to execute compiled code immediately. Having to save off to a file and then load again can be awkward. It is also not necessarily the case that code needs to be saved to a file to be executed. 

To handle being able call code directly, code can be compiled using the `SLANG_HOST_CALLABLE` code target type. To access the code that has been produced use the function

```
    SLANG_API SlangResult spGetEntryPointHostCallable(
        SlangCompileRequest*    request,
        int                     entryPointIndex,
        int                     targetIndex,
        ISlangSharedLibrary**   outSharedLibrary);
```        

This outputs a `ISlangSharedLibrary` which whilst in scope, any contained functions remain available (even if the request or session go out of scope). The contained functions can then be accessed via the `findFuncByName` method on the `ISlangSharedLibrary` interface. Finding the entry point names, can be achieved using reflection, if not directly known to the client.

The returned function pointer should be cast to the appropriate function signature before calling. For entry points - the function will appear under the same name as the entry point name. See the ABI section for what is the appropriate signature for entry points. 

For pass through compilation of C/C++ this mechanism allows any functions marked for export to be directly queried. 

For a complete example on how to execute CPU code using `spGetEntryPointHostCallable` look at code in `example/cpu-hello-world`. 

ABI
===

Say we have some Slang source like the following:

```
struct Thing { int a; int b; }

Texture2D<float> tex;
SamplerState sampler;
RWStructuredBuffer<int> outputBuffer;        
ConstantBuffer<Thing> thing3;        
        
[numthreads(4, 1, 1)]
void computeMain(
    uint3 dispatchThreadID : SV_DispatchThreadID, 
    uniform Thing thing, 
    uniform Thing thing2)
{
   // ...
}
```

When compiled into a shared library/dll - how is it invoked? An entry point in the slang source code produces several exported functions. The 'default' exported function has the same name as the entry point in the original source. It has the signature  

```
void computeMain(ComputeVaryingInput* varyingInput, UniformEntryPointParams* uniformParams, UniformState* uniformState);
```

ComputeVaryingInput is defined in the prelude as 

```
struct ComputeVaryingInput
{
    uint3 startGroupID;
    uint3 endGroupID;
};
```

`ComputeVaryingInput` allows specifying a range of groupIDs to execute - all the ids in a grid from startGroup to endGroup, but not including the endGroupIDs. Most compute APIs allow specifying an x,y,z extent on 'dispatch'. This would be equivalent as having startGroupID = { 0, 0, 0} and endGroupID = { x, y, z }. The exported function allows setting a range of groupIDs such that client code could dispatch different parts of the work to different cores. This group range mechanism was chosen as the 'default' mechanism as it is most likely to achieve the best performance.

There are two other functions that consist of the entry point name postfixed with `_Thread` and `_Group`. For the entry point 'computeMain' these functions would be accessable from the shared library interface as `computeMain_Group` and `computeMain_Thread`. `_Group` has the same signature as the listed for computeMain, but it doesn't execute a range, only the single group specified by startGroupID (endGroupID is ignored). That is all of the threads within the group (as specified by `[numthreads]`) will be executed in a single call. 

It may be desirable to have even finer control of how execution takes place down to the level of individual 'thread's and this can be achieved with the `_Thread` style. The signiture looks as follows

```
struct ComputeThreadVaryingInput
{
    uint3 groupID;
    uint3 groupThreadID;
};

void computeMain_Thread(ComputeThreadVaryingInput* varyingInput, UniformEntryPointParams* uniformParams, UniformState* uniformState);
```

When invoking the kernel at the `thread` level it is a question of updating the groupID/groupThreadID, to specify which thread of the computation to execute. For the example above we have `[numthreads(4, 1, 1)]`. This means groupThreadID.x can vary from 0-3 and .y and .z must be 0. That groupID.x indicates which 'group of 4' to execute. So groupID.x = 1, with groupThreadID.x=0,1,2,3 runs the 4th, 5th, 6th and 7th 'thread'. Being able to invoke each thread in this way is flexible - in that any specific thread can specified and executed. It is not necessarily very efficient because there is the call overhead and a small amount of extra work that is performed inside the kernel. 

Note that the `_Thread` style signature is likely to change to support 'groupshared' variables in the near future.

In terms of performance the 'default' function is probably the most efficient for most common usages. The `_Group` style allows for slightly less loop overhead, but with many invocations this will likely be drowned out by the extra call/setup overhead. The `_Thread` style in most situations will be the slowest, with even more call overhead, and less options for the C/C++ compiler to use faster paths. 

The UniformState and UniformEntryPointParams struct typically vary by shader. UniformState holds 'normal' bindings, whereas UniformEntryPointParams hold the uniform entry point parameters. Where specific bindings or parameters are located can be determined by reflection. The structures for the example above would be something like the following... 

```
struct UniformEntryPointParams
{
    Thing thing;
    Thing thing2;
};

struct UniformState
{
    Texture2D<float > tex;
    SamplerState sampler;
    RWStructuredBuffer<int32_t> outputBuffer;
    Thing* thing3;
};   
```

Notice that of the entry point parameters `dispatchThreadID` is not part of UniformEntryPointParams and this is because it is not uniform.

`ConstantBuffer` and `ParameterBlock` will become pointers to the type they hold (as `thing3` is in the above structure).
 
`StructuredBuffer<T>`,`RWStructuredBuffer<T>` become

```
    T* data;
    size_t count;
```    

`ByteAddressBuffer`, `RWByteAddressBuffer` become 

```
    uint32_t* data;
    size_t sizeInBytes;
```    


Resource types become pointers to interfaces that implement their features. For example `Texture2D` become a pointer to a `ITexture2D` interface that has to be implemented in client side code. Similarly SamplerState and SamplerComparisonState become `ISamplerState` and `ISamplerComparisonState`.  
 
The actual definitions for the interfaces for resource types, and types are specified in 'slang-cpp-types.h' in the `prelude` directory.

## Unsized arrays

Unsized arrays can be used, which are indicated by an array with no size as in `[]`. For example 

```
    RWStructuredBuffer<int> arrayOfArrays[];
```

With normal 'sized' arrays, the elements are just stored contiguously within wherever they are defined. With an unsized array they map to `Array<T>` which is...

```
    T* data;
    size_t count;
```    

Note that there is no method in the shader source to get the `count`, even though on the CPU target it is stored and easily available. This is because of the behavior on GPU targets 

* That the count has to be stored elsewhere (unlike with CPU) 
* On some GPU targets there is no bounds checking - accessing outside the bound values can cause *undefined behavior*
* The elements may be laid out *contiguously* on GPU

In practice this means if you want to access the `count` in shader code it will need to be passed by another mechanism - such as within a constant buffer. It is possible in the future support may be added to allow direct access of `count` work across targets transparently. 

It is perhaps worth noting that the CPU allows us to have an indirection (a pointer to the unsized arrays contents) which has the potential for more flexibility than is possible on GPU targets. GPU target typically require the elements to be placed 'contiguously' from their location in their `container` - be that registers or in memory. This means on GPU targets there may be other restrictions on where unsized arrays can be placed in a structure for example, such as only at the end. If code needs to work across targets this means these restrictions will need to be followed across targets. 

## Prelude

For C++ targets, the code to support the code generated by Slang must be defined within the 'prelude'. The prelude is inserted text placed before the generated C++ source code. For the Slang command line tools as well as the test infrastructure, the prelude functionality is achieved through a `#include` in the prelude text of the `prelude/slang-cpp-prelude.h` specified with an absolute path. Doing so means other files the `slang-cpp-prelude.h` might need can be specified relatively, and include paths for the backend C/C++ compiler do not need to be modified. 

The prelude needs to define 

* 'Built in' types (vector, matrix, 'object'-like Texture, SamplerState etc) 
* Scalar intrinsic function implementations
* Compiler based definations/tweaks 

For the Slang prelude this is split into the following files...

* 'prelude/slang-cpp-prelude.h' - Header that includes all the other requirements & some compiler tweaks
* 'prelude/slang-cpp-scalar-intrinsics.h' - Scalar intrinsic implementations
* 'prelude/slang-cpp-types.h' - The 'built in types' 
* 'slang.h' - Slang header is used for majority of compiler based definitions

For a client application - as long as the requirements of the generated code are met, the prelude can be implemented by whatever mechanism is appropriate for the client. For example the implementation could be replaced with another implementation, or the prelude could contain all of the required text for compilation. Setting the prelude text can be achieved with the method on the global session...

```
/** Set the 'prelude' for generated code for a 'downstream compiler'.
@param passThrough The downstream compiler for generated code that will have the prelude applied to it. 
@param preludeText The text added pre-pended verbatim before the generated source

That for pass-through usage, prelude is not pre-pended, preludes are for code generation only. 
*/
virtual SLANG_NO_THROW void SLANG_MCALL setDownstreamCompilerPrelude(
SlangPassThrough passThrough,
const char* preludeText) = 0;
```

It may be useful to be able to include `slang-cpp-types.h` in C++ code to access the types that are used in the generated code. This introduces a problem in that the types used in the generated code might clash with types in client code. To work around this problem, you can wrap all of the types defined in the prelude with a namespace of your choosing. For example 

```
#define SLANG_PRELUDE_NAMESPACE CPPPrelude
#include "../../prelude/slang-cpp-types.h"
``` 

Would wrap all the Slang prelude types in the namespace `CPPPrelude`, such that say a `StructuredBuffer<int32_t>` could be specified in C++ source code as `CPPPrelude::StructuredBuffer<int32_t>`.

The code that sets up the prelude for the test infrastucture and command line usage can be found in ```TestToolUtil::setSessionDefaultPrelude```. Essentially this determines what the absolute path is to `slang-cpp-prelude.h` is and then just makes the prelude `#include "the absolute path"`.

Language aspects
================

# Arrays passed by Value

Slang follows the HLSL convention that arrays are passed by value. This is in contrast the C/C++ where arrays are passed by reference. To make generated C/C++ follow this convention an array is turned into a 'FixedArray' struct type. Sinces classes by default in C/C++ are passed by reference the wrapped array is also. 

To get something more similar to C/C++ operation the array can be marked in out or inout to make it passed by reference. 

Limitations
===========

# Out of bounds access

In HLSL code if an access is made out of bounds of a StructuredBuffer, execution proceceeds. If an out of bounds read is performed, a zeroed value is returned. If an out of bounds write is performed it's effectively a noop, as the value is discarded. 

On the CPU target this behavior is *NOT* supported. For a debug CPU build an out of bounds access will assert, for a release build the behaviour is undefined. 

The reason for this is that such an access is difficult and/or slow to implement on the CPU. The underlying reason is that `operator[]` typically returns a reference to the contained value. If this is out of bounds - it's not clear what to return, in particular because the value may be read or written and moreover elements of the type might be written. In practice this means a global zeroed value cannot be returned. 

This could be somewhat supported if code gen worked as followed for say

```
RWStructuredBuffer<float4> values;
values[3].x = 10;
```

Produces

```
template <typename T>
struct RWStructuredBuffer
{
    T& at(size_t index, T& defValue) { return index < size ? values[index] : defValue; } 

    T* values;
    size_t size;
};

RWStructuredBuffer<float4> values;

// ...
Vector<float, 3> defValue = {};         // Zero initialize such that read access returns default values
values.at(3).x = 10;
```

Note that [] would be turned into the `at` function, which takes the default value as a paramter provided by the caller. If this is then written to then only the defValue is corrupted.  Even this mechanism not be quite right, because if we write and then read again from the out of bounds reference in HLSL we may expect that 0 is returned, whereas here we get the value that was last written.

TODO
====

# Main

* groupshared is not yet supported
* Complete support (in terms of interfaces) for 'complex' resource types - such as Texture
* Output of header files 
* Output multiple entry points

# Internal Slang compiler features

These issues are more internal Slang features/improvements 

* Currently only generates C++ code, it would be fairly straight forward to support C (especially if we have 'intrinsic definitions')
* Have 'intrinsic definitions' in standard library - such that they can be generated where appropriate 
  + This will simplify the C/C++ code generation as means Slang language will generate must of the appropriate code
* Currently 'construct' IR inst is supported as is, we may want to split out to separate instructions for specific scenarios
* Refactoring around swizzle. Currently in emit it has to check for a variety of scenarios - could be simplified with an IR pass and perhaps more specific instructions.
Browse the archive

https://github.com/shader-slang/slang