Speculation: pipelining geometry shaders

A followup to the older "discussion" about how/why geometry shaders would be okay/slow:

The graphics hardware has been quite successful so far at hiding memory latencies (i.e. when sampling textures). It does so (according to my understanding) by having a looong pixel pipeline, where hundreds (or thousands) pixels might be at one or another processing stage. ATI talks about this in big letters (R520 dispatch processor) and speculations suggest that GeForceFX had something like that (article). I have no idea about the older cards, but presumably they did something similar as well.

I am not sure how the vertex texture fetches are pipelined - pretty slow performance on GeForce6/7 suggest that they aren't :) Probably vertex shaders in current cards operate in a simpler way - just fetch the vertices and run whole shaders on them (in contrast to pixel shaders, which seem to run just several instructions, then go to another pixels, return back, etc.).

With DX10, we have arbitrary memory fetches in any stage of the pipeline. Even the boundary between different fetch types is somewhat blurry (constant buffers vs. arbitrary buffers vs. textures) - perhaps they will differ only in bandwidth/latency (e.g. constant buffers live near the GPU while textures live in video memory).

So, with arbitrary memory fetches anywhere (and some of them being high latency), everything needs to have long pipelines (again, just my guess). This is all great, but the longer the pipeline, the worse it performs in non-friendly scenarios: pipeline flush is more expensive, drawing just a couple of "things" (primitives, vertices, pixels) is inefficient, etc.

I guess we'll just learn a new set of performance rules for tomorrow's hardware!

Back to GS pipelining: I imagine that the "slow" scenarios would be like this: vertices have shaders with dynamic branches or memory fetches differing vastly in execution lengths - so GS has to wait for all vertex shaders of the current primitive (optional: plus topology) to finish; and then each GS has dynamic branches or memory fetches, and outputs different number of primitives to the rasterizer. If I'd were hardware, I'd be scared :)

cite from the same Beyond3DL: Unlike ATI's Xenos graphics processors for the XBOX 360, the R520 architecture is a little more traditional in that it doesn't make use of a unified architecture and hence has distinct vertex shader and pixel shader processors. Probably can be used as a bit of a hint.
Yeah, I know (=heard :)) that Xenos has unified shader pipes, with dynamic allocation happening somehow. In fact, both MS and ATI had stated that multiple times.

The future hardware also seems to be going in that direction, as most of "stuff" in the shaders will be of the same capabilities (with some exceptions: break/emit is GS only; derivatives/automaticmiplevel is PS only). Maybe the hardware is just a big array of "computation/sampling units", and someone "from above" allocates them on demand.

Speculating is fun :)
Post a Comment

<< Home