Tuesday, August 4, 2009

Technology:DirectX Compute Support on NVIDIA’s CUDA Architecture GPUs

  • Microsoft’s DirectX Compute is a new GPU Computing API that runs on NVIDIA’s current CUDA architecture under both Windows VISTA and Windows 7. DirectX Compute is supported on current DX10 class GPU’s and future DX11 GPU’s. It allows developers to harness the massive parallel computing power of NVIDIA GPU’s to create compelling computing applications in consumer and professional markets.

  • Compute Shader
    The Compute Shader is an additional stage independent of the Direct3D 11 pipeline that
    enables general purpose computing on the GPU.

    In addition to all shader features provided by the unified shader core, the Compute Shader also supports
    scattered reads and writes to resources through Unordered Access Views, a shared memory pool within a group of executing threads, synchronization primitives, atomic operators, and many other advanced data-parallel features. A variant of the Direct3D 11 Compute Shader has been enabled that can operate on Direct3D 10-class hardware. It is therefore possible to developing Compute Shaders on actual hardware, but an updated driver is required. The full functionality of the Direct3D 11 Compute Shader will is intended for support of Direct3D 11-class hardware, so in order to evaluate the full functionality, developers will need to use the Reference Rasterizer until such hardware is available.


  • Multithreaded Rendering
    The key API difference from Direct3D 10 in Direct3D 11 is the addition of deferred contexts, which enables scalable execution of Direct3D commands distributed over multiple cores. A Deferred Context captures and assembles actions like state changes and draw submissions that can be executed on the actual device at a later time. By utilizing Deferred Contexts on multiple threads, an application can distribute the CPU overhead needed in the Direct3D11 runtime and the driver to multiple cores, enabling better use of an end-user's machine configuration. This feature is available for use on current Direct3D 10 hardware as well as the reference rasterizer.

  • Dynamic Shader Linkage
    In order to address the combinatorial explosion problem seen in specializing shaders for performance, Direct3D 11 introduces a limited form of runtime shader linkage that allows for near-optimal shader specialization during execution of an application. This is achieved by specifying the implementations of specific functions in shader code when the shader is assigned to the pipeline, allowing the driver to inline native shader instructions quickly rather than forcing the driver to recompile the intermediate language into native instructions with the new configuration. Shader development is exposed through the introduction of classes and interfaces to HLSL.


  • DirectX Compute Shader
    • New shader type supported in D3D11
    • Designed for general purpose processing
    • Doesn’t require a separate API -integrated with D3D
    • Shares memory resources with graphics shaders
    • Thread invocation is decoupled from input or output domains
    • Single thread can process one or many data elements
    • Can share data between threads
    • Supports random access memory writes
  • Compute Shaders on D3D10 Hardware
    • Subset of the D3D11 compute shader functionality that runs on current D3D10.x hardware
    • Drivers available now from NVIDIA and AMD
    • You can start experimenting with compute shaders today.
  • Compute Shader 4.0
    • New shader models -CS4.0/CS4.1
  • What’s Missing in CS4.0 Compared to CS5.0?
    • Atomic operations
    • Append/consume
    • Typed UAV (unordered access view)
    • Double precision
    • DispatchIndirect()
    • Only a single output UAV allowed (Not a huge restriction in practice)
    • Thread group grid dimensions limited to 65535
    • Thread group size is restricted to maximum of 768 threads total (1024 on D3D11 hardware)
    • Thread group shared memory restricted to 16KB total (32Kb on D3D11 hardware)

      Still a lot you can do.

  • So What DoesCS4.x Give Me?
    • Scattered writes
    • Thread Group Shared Memory
      - Allows sharing data between threads
      - Much faster than texture or buffer reads, saves bandwidth
      - Fast reductions, prefix sum (scan)
      - Efficient interoperability with D3D graphics


0 comments:

Post a Comment

 
Add to Technorati Favorites