Technology as a way of life

Life is the art of drawing without an eraser...

"What is DirectX?"

Microsoft DirectX is a collection of application programming interfaces (APIs) for handling tasks related to multimedia, especially game programming and video, on Microsoft platforms. Originally, the names of these APIs all began with Direct, such as Direct3D, DirectDraw, DirectMusic, DirectPlay, DirectSound, and so forth. DirectX, then, was the generic term for all of these APIs and became the name of the collection. After the introduction of the Xbox, Microsoft has also released multiplatform game development APIs such as XInput, which are designed to supplement or replace individual DirectX components.

Direct3D (the 3D graphics API within DirectX) is widely used in the development of video games for Microsoft Windows, Microsoft Xbox, and Microsoft Xbox 360. Direct3D is also used by other software applications for visualization and graphics tasks such as CAD/CAM engineering. As Direct3D is the most widely publicized component of DirectX, it is common to see the names "DirectX" and "Direct3D" used interchangeably.

The DirectX software development kit (SDK) consists of runtime libraries in redistributable binary form, along with accompanying documentation and headers for use in coding. Originally, the runtimes were only installed by games or explicitly by the user. Windows 95 did not launch with DirectX, but DirectX was included with Windows 95 OEM Service Release 2.[1] Windows 98 and Windows NT 4.0 both shipped with DirectX, as has every version of Windows released since. The SDK is available as a free download. While the runtimes are proprietary, closed-source software, source code is provided for most of the SDK samples.

The latest versions of Direct3D, namely, Direct3D 10 and Direct3D 9Ex, are only officially available for Windows Vista, because each of these new versions was built to depend upon the new Windows Display Driver Model that was introduced for Windows Vista. The new Vista/WDDM graphics architecture includes a new video memory manager that supports virtualizing graphics hardware to multiple applications and services such as the Desktop Window Manager

Several components are needed in DirectX

DirectX 10 was introduced with Windows Vista exclusively because previous versions of Windows such as Windows XP are not able to officially run DirectX 10-exclusive applications.

DirectX 10.1 is an incremental update of DirectX 10 which is shipped with, and requires, Windows Vista Service Pack 1.[8] This release mainly sets a few more image quality standards for graphics vendors, while giving developers more control over image quality. It also adds support for parallel cube mapping and requires that the video card supports Shader Model 4.1 or higher and 32-bit floating-point operations. Direct3D 10.1 still fully supports Direct3D 10 hardware, but in order to utilize all of the new features, updated hardware is required.

DirectX 11 was unveiled at the Gamefest 08 event in Seattle, with the major scheduled features including GPGPU support, tessellation[11][12] support, and improved multi-threading support to assist video game developers in developing games that better utilize multi-core processors. Direct3D 11 will run on Windows Vista and its successor Windows 7. Parts of the new API such as multi-threaded resource handling can be supported on Direct3D 9/10/10.1-class hardware. Hardware tessellation and Shader Model 5.0 will require Direct3D 11 supporting hardware.

This is the DirectX 10 pipeline

And this is DirectX 11

Many of the enhancements mean higher performance for features already available in DX10 but less used. Tessellation (made up of the hull shader, tessellator and domain shader) and the Compute Shader are major developments that could close the gap between reality and unreality.

Along with the pipeline changes, we see a whole host of new tweaks and adjustments. DirectX 11 is actually a strict superset of DirectX 10.1, meaning that all of those features are completely encapsulated in and unchanged by DirectX 11. This simple fact means that all DX11 hardware will include the changes required to be DX 10.1 compliant and in addition to these tweaks, there are also new extensions:

While changes in the pipeline allow developers to write programs to accomplish different types of tasks, these more subtle changes allow those programs to be more complex, higher quality, and/or higher performance.

DX11 And The Multi-Threaded Game Engine

In spite of the fact that multi-threaded programming has been around for decades, mainstream programmers didn't start focusing on parallel programming until multi-core CPUs started coming along. Much general purpose code is straightforward as a single thread; extracting performance via parallel programming can be difficult and isn't always obvious. Even with talented programmers, Amdahl's Law is a bitch: your speed up from parallelization is limited by the percent of code that is necessarily sequential.

No matter what anyone does, some stuff in the renderer will need to be sequential. Programs, textures, and resources must be loaded up; geometry happens before pixel processing; draw calls intended to be executed while a certain state is active must have that state set first and not changed until completion. Even in such a massively parallel machine, order must be maintained for many things. But order doesn't always matter.

Making more things thread-safe through an extended device interface using multiple contexts and making a lot of synchronization overhead the responsibility of the API and/or graphics driver, Microsoft has enabled game developers to more easily and effortlessly thread not only their rendering code, but their game code as well. These things will also work on DX10 hardware running on a system with DX11, though some missing hardware optimizations will reduce the performance benefit. But the fundamental ability to write code differently will go a long way to getting programmers more used to and better at parallelization. Let's take a look at the tools available to accomplish this in DX11.

First up is free threaded asynchronous resource loading. That's a bit of a mouthful, but this feature gives developers the ability to upload programs, textures, state objects, and all resources in a thread-safe way and, if desired, concurrent with the rendering process. This doesn't mean that all this stuff will get pushed up in parallel with rendering, as the driver will manage what gets sent to the GPU and when based on priority, but it does mean the developer no longer has to think about synchronizing or manually prioritizing resource loading. Multiple threads can start loading whatever resources they need whenever they need them. The fact that this can also be done concurrently with rendering could improve performance for games that stream in data for massive open worlds in addition to enabling multi-threaded opportunities.

In order to enable this and other threading, the D3D device interface is now split into three separate interfaces: the Device, the Immediate Context, and the Deferred Context. Resource creation is done through the Device. The Immediate Context is the interface for setting device state, draw calls, and queries. There can only be one Device and one Immediate Context. The Deferred Context is another interface for state and draw calls, but many can exist in one program and can be used as the per-thread interface (Deferred Contexts themselves are thread unsafe though). Deferred Contexts and the free threaded resource creation through the device are where DX11 gets it multi-threaded benefit.

Multiple threads submit state and draw calls to their Deferred Context which complies a display list that is eventually executed by the Immediate Context. Games will still need a render thread, and this thread will use the Immediate Context to execute state and draw calls and to consume the display lists generated by Deferred Contexts. In this way, the ultimate destination of all state and draw calls is the Immediate Context, but fine grained synchronization is handled by the API and the display driver so that parallel threads can be better used to contribute to the rendering process. Some limitations on Deferred Contexts include the fact that they cannot query the device and they can't download or read back anything from the GPU. Deferred Contexts can, however, consume the display lists generated by other Deferred Contexts.

The end result of all this is that the future will be more parallel friendly. As two and four core CPUs become more and more popular and 8 and 16 (logical) core CPUs are on the horizon, we need all the help we can get when trying to extract performance from parallelism. This is a good move for DirectX and we hope it will help push game engines to more fully utilize more than two or even four cores when the time comes

The DX11 Compute Shader and OpenCL/OpenGL

Enter DirectX11 and the CS. Developers have the option to pass data structures over to the Compute Shader and run more general purpose algorithms on them. The Compute Shader, like the other fully programmable stages of the DX10 and DX11 pipeline, will share a single set of physical resources (shader processors).

This hardware will need to be a little more flexible than it currently is as when it runs CS code it will have to support random reads and writes and irregular arrays (rather than simple streams or fixed size 2D arrays), multiple outputs, direct invocation of individual or groups of threads as per the programmer's needs, 32k of shared register space and thread group management, atomic instructions, synchronization constructs, and the ability to perform unordered IO operations.

At the same time, the CS loses some features as well. As each thread is no longer treated as a pixel, so the association with geometry is lost (unless specifically passed in a data structure). This means that, although CS programs can still use texture samplers, automatic trilinear LOD calculations are not automatic (LOD must be specified). Additionally, depth culling, anti-aliasing, alpha blending, and other operations that have no meaning to generic data cannot be performed inside a CS program.

The type of new applications opened up by the CS are actually infinite, but the most immediate interest will come from game developers looking to augment their graphics engines with fancy techniques not possible in the Pixel Shader. Some of these applications include A-Buffer techniques to allow very high quality anti-aliasing and order independent transparency, more advanced deferred shading techniques, advanced post processing effects and convolution, FFTs (fast Fourier transforms) for frequency domain operations, and summed area tables.

Beyond the rendering specific applications, game developers may wish to do things like IK (inverse kinematics), physics, AI, and other traditionally CPU specific tasks on the GPU. Having this data on the GPU by performing calculations in the CS means that the data is more quickly available for use in rendering and some algorithms may be much faster on the GPU as well. It might even be an option to run things like AI or physics on both the GPU and the CPU if algorithms that always yield the same result on both types of processors can be found (which would essentially substitute compute power for bandwidth).

Even though the code will run on the same hardware, PS and CS code will perform very differently based on the algorithms being implemented. One of the interesting things to look at is exposure and histogram data often used in HDR rendering. Calculating this data in the PS requires several passes and tricks to take all the pixels and either bin them or average them. Despite the fact that sharing data is going to slow things down quite a bit, sharing data can be much faster than running many passes and this makes the CS an ideal stage for such algorithms.

So What's a Tessellator?

This has been covered before now in other articles about DirectX 11, but we first touched on the subject with the R600 launch. Both R6xx and R7xx hardware have tessellators, but since these are proprietary implementations, they won't be directly compatible with DirectX 11 which uses a much more sophisticated setup. While neither AMD nor the DX11 tessellator itself are programmable, DX11 includes programmable input to and output from the tesselator (TS) through two additional pipeline stages called the Hull Shader (HS) and the Domain Shader (DS).

The tessellator can take coarse shapes and break them up into smaller parts. It can also take these smaller parts and reshape them to form geometry that is much more complex and that more closely approximates reality. It can take a cube and turn it into a sphere with very little overhead and much fewer space requirements. Quality, performance and manageability benefit.

The Hull Shader takes in patches and control points out outputs data on how to configure the tessellator. Patches are a new primitive (like vertices and pixels) that define a segment of a plane to be tessellated. Control points are used to define the parametric shape of the desired surface (like a curve or something). If you've ever used the pen tool in Photoshop, then you know what control points are: these just apply to surfaces (patches) instead of lines. The Hull Shader uses the control points to determine how to set up the tessellator and then passes them forward to the Domain Shader.

The tessellator just tessellates: it breaks up patches fed to it by the Hull Shader based on the parameters set by the Hull shader per patch. It outputs a stream of points to the Domain Shader, which then needs to finish up the process. While programmers must write HS programs for their code, there isn't any programming required for the TS. It's just a fixed function block that processes input based on parameters.

The Domain Shader takes points generated by the tessellator and manipulates them to form the appropriate geometry based on control points and/or displacement maps. It performs this manipulation by running developer designed DS programs which can manipulate how the newly generated points are further shifted or displaced based on control points and textures. The Domain Shader, after processing a point, outputs a vertex. These vertices can be further processed by a Geometry Shader, which can also feed them back up to the Vertex Shader using stream out functionality. More likely than heading back up for a second pass, we will probably see most output of the Domain Shader head straight on to rasterization so that its geometry can be broken down into screen space fragments for Pixel Shader processing.

That covers what the basics of what the tesselator can do and how it does it. But do you find your self wondering: "self, can't the Geometry Shader just be used to create tessellated surfaces and move the resulting vertices around?" Well, you would be right. That is technically possible, but not practical at this point.

Tessellation: Because The GS Isn't Fast Enough

Microsoft and AMD tend to get the most excited about tessellation whenever the topic of DX11 comes up. AMD jumped on the tessellation bandwagon long ago, and perhaps it does make sense for consoles like the XBox 360. Adding fixed function hardware to quickly and efficiently handle a task that improves memory footprint has major advantages in the living room. We still aren't sold on the need for a tessellator on the desktop, but who's to argue with progress?

Or is it really progressive? The tessellator itself is fixed function rather than programmable. Sure, the input to and output of the tessellator can be manipulated a bit through the Hull Shader and Domain Shader, but the heart of the beast is just not that flexible. The Geometry Shader is the programmable block in the pipeline that is capable of tessellation as well as much more, but it just doesn't have the power to do tessellation on any useful scale. So while most everything has been moving towards programmability in the rendering pipe, we have sort of a step backward here. But why?

The argument between fixed function and programmable hardware is always one of performance versus flexibility and usefulness. In the beginning, fixed function was necessary to get the desired performance. As time went on, it became clear that adding in more fixed function hardware to graphics chips just wasn't feasible. The transistors put into specialized hardware just go unused if developers don't program to take advantage of it. This made a shift toward architectures where expanding the pool of compute resources that could be shared and used for many different tasks became a much more attractive way to go. In the general case anyway. But that doesn't mean that fixed function hardware doesn't have it's place.

We do still have the problem that all the transistors put into the tessellator are worthless unless developers take advantage of the hardware. But the reason it makes sense is that the ROI (return on investment: what you get for what you put in) on those transistors is huge if developers do take advantage of the hardware: it's much easier to get huge tessellation performance out of a fixed function tessellator than to put the necessary resources into the Geometry Shader to allow it to be capable of the same tessellation performance programmatically. This doesn't mean we'll start to see a renaissance of fixed function blocks in our graphics hardware; just that significantly advanced features going forward may still require the sacrifice of programability in favor of early adoption of a feature. The majority of tasks will continue to be enabled in a flexible programmable way, and in the future we may see more flexibility introduced into the tessellator until it becomes fully programmable as well (or ends up just being merged into some future version of the Geometry Shader).

Now don't let this technical assessment of fixed function tessellation make you think we aren't interested in reaping the benefits of the tessellator. Currently, artists need to create different versions of their objects for different LODs (Level of Detail -- reducing or increasing complexity as the object moves further or nearer the viewer), and geometry simulation through texturing at each LOD needs to be done by pixel shaders. This requires extra work from both artists and programmers and costs a good bit in terms of performance. There are also some effects than can only be done with more geometry.

Tessellation is a great way to get that geometry in there for more detail, shadowing, and smooth edges. High geometry also allows really cool displacement mapping effects. Currently, much geometry is simulated through textures and techniques like bump mapping or parallax occlusion mapping or some other technique. Even with high geometry, we will want to have large normal maps for our lighting algorithms to use, but we won't need to do so much work to make things like cracks, bumps, ridges, and small detail geometry appear to be there when it isn't because we can just tessellate and displace in a single pass through the pipeline. This is fast, efficient, and can produce very detailed effects while freeing up pixel shader resources for other uses. With tessellation, artists can create one sub division surface that can have a dynamic LOD free of charge; a simple hull shader and a displacement map applied in the domain shader will save a lot of work, increase quality, and improve performance quite a bit.

If developers adopt tessellation, we could see cool things, and with the move to DX11 class hardware both NVIDIA and AMD will be making parts with tessellation capability. But we may not see developers just start using tessellation (or the compute shader for that matter) right away. Because DirectX 11 will run on down level hardware and at the release of DX11 we will already have a huge number cards on the market capable of running a subset of DX11 bringing with it a better, more refined, programming language in the new version of HLSL and seamless parallelization optimizations, we will very likely see the first DX11 games only implementing features that can run completely on DX10 hardware.

Of course, at that point developers can be fully confident of exploiting all the aspects of DX10 hardware, which they still aren't completely taking advantage of. Many people still want and need a DX9 path because of Vista's failure, which means DX10 code tends to be more or less an enhanced DX9 path rather than something fundamentally different. So when DirectX 11 finally debuts, we will start to see what developers could really do with DX10.

Certainly there will be developers experimenting with tessellation, but these will probably just be simple amplification to get rid of those jagged edges around curved surfaces at first. It will take time for the real advanced tessellation techniques everyone is excited about to come to fruition.

The final bit of DX11 we'll touch on is the update to HLSL (MS's High Level Shader Language) in version 5.0 which brings some very developer friendly adjustments. While HLSL has always been similar in syntax to C, 5.0 adds support for classes and interfaces. We still don't get to use pointers though.

These changes are being made because of the sheer size of shader code. Programmers and artists need to build or generate either a single massive shader or tons of smaller shader programs for any given game. These code resources are huge and can be hard to manage without OOP (Object Oriented Programming) constructs. But there are some differences to how things work in other OOP languages. For instance, there is no need for memory management (because there are no pointers) or constructors / destructors in HLSL. Tasks like initialization are handled through updates to constant buffers, which generally reflect member data.

Aside from the programmability aspect, classes and interfaces were added to support dynamic shader linkage to combat the intricacy of developing with huge numbers of resources and effects. Dynamic linking allows the application to decide at runtime what shaders to compile and link and enables interfaces to be left ambiguous until runtime. At runtime, shaders are dynamically linked and based on what is linked all possible function bodies are then compiled and optimized. Compiled hardware-native code isn't inlined until the appropriate SetShader function is called.

The flexibility this provides will enable development of much more complex and dynamic shader code, as it won't all need to be in one giant block with lots of "ifs", nor will there need to be thousands of smaller shaders cluttering up the developers mind. Performance of the shaders will still limit what can be done, but with this step DirectX helps reduce code complexity as a limiting factor in development.

With all of this - the ability to perform unordered memory accesses, multi-threading, tessellation, and the Compute Shader - DX11 is pretty aggressive. The complexity of the upgrade, however, is mitigated by the fact that this is nothing like the wholesale changes made in the move from DX9 to DX10: DX11 is really just a superset of DX10 in terms of features. This enables the ability for DX11 to run on down-level hardware (where DX11 specific features are not used), which when combined with the enhancements to HLSL with OOP and dynamic shader linking mean that developers should really have fewer qualms about moving from DX10 to DX11 than we saw with the transition from DX9. (Of course, that's nothing new: the first DX8 games shipped when DX9 was out, and it wasn't until DX10 that we saw a reasonable number of DX9 titles.)

To be fair, the OS upgrade requirement also threw a wrench in the gears. That won't be a problem this time, as Vista still sucks but will be getting DX11 support and Windows 7 looks like a better upgrade option for XP users than Vista. Developers who haven't already moved from DX9 may well skip DX10 altogether in favor of DX11 depending on the predicted ship dates of their titles; all signs point to DX11 as setting the time frame when we start to see the revolution promised with the move to DX10 take place.


Piotr Klimczyk said...

Nice article, pleasant to read.

Post a Comment