64k coding continued

I'm making a steady, but very slow progress on "my" 64k intro. Over the last week I couldn't get over 13 kilobytes, so you can see that the progress is really slow. Not because I don't code anything, but all code increase was cancelled by data size optimizations.

So far coding and data design for small sizes is not that much pain at all. Just, well, code and, well, keep your data small :) We're only talking about the size of initial data, not the runtime size though.

A few obvious or new notes:
  • Code to construct a cylinder is more complex than the one to construct a sphere. That's what I expected. However, code to construct a box with multiple segments per side is the most complex of all!
  • Dropping last byte from floats is usually okay. And instant 25% save! For some of the numbers, I plan to switch to half-style float (2 bytes) if space becomes a concern.
  • Storing quaternions in 4 bytes (byte per component) is good. Actually, now that I think of it, it makes more sense to store three components at 10 bits each, and just store the sign of 4th component - better precision for the same size.
  • This intro literally has the most complex and most automated "art pipeline" of any demo/game I (directly) worked on! I've got maxscripts generating C++ sources, custom commandline tools preprocessing C++ sources (mostly floats packing - due to lack of maxscript functionality), lua scripts for batch-compiling HLSL shaders, "development code" generating .obj models for import back into max, etc. It's wacky, weird and cool!
  • Compiling HLSL in two steps (HLSL->asm and asm->bytecode) instead of direct (HLSL->bytecode) gets rid of the constant table, some copyright strings and hence is good. (thanks blackpawn!)
  • Getting FFD code to behave remotely similar to how 3dsmax does FFD is hard :)
The best thing so far is that I've got the music track from x_dynamics - it's already done in V2 synth, takes small amount of space and is really good. Now I "just" have to finish the intro...



Speculation: pipelining geometry shaders

A followup to the older "discussion" about how/why geometry shaders would be okay/slow:

The graphics hardware has been quite successful so far at hiding memory latencies (i.e. when sampling textures). It does so (according to my understanding) by having a looong pixel pipeline, where hundreds (or thousands) pixels might be at one or another processing stage. ATI talks about this in big letters (R520 dispatch processor) and speculations suggest that GeForceFX had something like that (article). I have no idea about the older cards, but presumably they did something similar as well.

I am not sure how the vertex texture fetches are pipelined - pretty slow performance on GeForce6/7 suggest that they aren't :) Probably vertex shaders in current cards operate in a simpler way - just fetch the vertices and run whole shaders on them (in contrast to pixel shaders, which seem to run just several instructions, then go to another pixels, return back, etc.).

With DX10, we have arbitrary memory fetches in any stage of the pipeline. Even the boundary between different fetch types is somewhat blurry (constant buffers vs. arbitrary buffers vs. textures) - perhaps they will differ only in bandwidth/latency (e.g. constant buffers live near the GPU while textures live in video memory).

So, with arbitrary memory fetches anywhere (and some of them being high latency), everything needs to have long pipelines (again, just my guess). This is all great, but the longer the pipeline, the worse it performs in non-friendly scenarios: pipeline flush is more expensive, drawing just a couple of "things" (primitives, vertices, pixels) is inefficient, etc.

I guess we'll just learn a new set of performance rules for tomorrow's hardware!

Back to GS pipelining: I imagine that the "slow" scenarios would be like this: vertices have shaders with dynamic branches or memory fetches differing vastly in execution lengths - so GS has to wait for all vertex shaders of the current primitive (optional: plus topology) to finish; and then each GS has dynamic branches or memory fetches, and outputs different number of primitives to the rasterizer. If I'd were hardware, I'd be scared :)



Reading DX10 docs...

Reading DirectX10 preview documentation right now (you know, it's released with Dec2005 SDK). It is pretty impressive, I must say! Seems like a huge leap forward. Back to reading!




(the lack of updates recently is because I have lots of stuff here going on)

I few weeks ago I was visiting OTEE and over the weekend we were jamming on a small game called Pakimono! The idea of the game was pretty cool - you're the naked guy and have to ruin tourists' photos :)

The whole experience was great. It was my first time using a Mac, first time working with Unity (their game development tool) etc. I coded&tuned most of the bullet-time character controller, where you drag your limbs with a mouse, trying to cover as much of the sight as possible.

The coding was a bit unusual - most of my coding life I was doing pretty low-level C++ programming. This time it was completely different - I'd setup "the game" directly in the editor, write some short C# scripts and boom! - everything works, without me having to worry about any of the low-level stuff. No recompiling or any of that stuff. Cool.

Ironically, I did not see the final Pakimono build yet. I left earlier than the others and do not have a Mac anywhere nearby. But the guys promised me a windows build!