AxisOxy wrote:Hey Douglas,
Hi! It's always good to meet another traveler on the same road
Thanks for the comments and summary of some of your own experiences here...
AxisOxy wrote:And I know what I´m talking about. I did something very similiar for AGA-Amigas some time ago.
I still work a bit on them from time to time.
Funnily I was also focusing on Doom and Quake 2. Perhaps they are just the sweet-spot for this kind of hardware.
I think that's probably true - I had a sort of checklist of reasons to focus on Q2 and it covered quite a lot of ground. I think most of them held firm as well (so far, anyway!).
On the one hand the original engines were a special kind of genius to begin with but they also left plenty of rich pickings for explorers
Not just code optimizations but interchangeable techniques as well. We'll discuss one here (spheres!) and a couple of other things I did in my own version.
AxisOxy wrote:I´d like to share some of the things I stumbled upon during the development.
Perhaps it helps a bit on squeezing more out of your engine.
Cool! I'll offer up what I can here also.
AxisOxy wrote:Looks like I had a little bit different approach than you. If I followed it right, you based your engine on the original code and optimized it.
Like porting from float to int and porting parts from C to ASM/DSP-ASM.
That's mostly true, but the process overall was quite complicated and happened in differently-shaped stages. I'll try to summarize it a bit so it makes more sense.
There are a few reasons I wanted to start with the original code (or at least, a correctly functioning reference version, original or otherwise!). I had some quite painful experiences with the earlier Doom project because the rendering part had been created before I got anywhere near ID's source, and worked quite differently from the original. Remember all those cool tricks mapmakers were able to exploit to produce interesting effects? Yep - lots of those failed to work in my version because the techniques didn't match, the bugs didn't match and figuring out the differences was in some cases just too much trouble for me. By the end it got pretty close but still not 100%, and a lot of time was burned getting it sensible.
Doom is internally quite convoluted, compared with Q2 (which I think is more elegant/simple relative to what it outputs even if it does require careful study!) so with hindsight it probably wouldn't have been so bad. I think only the lightmap dimension estimation needs to be 100% right (a lot of people had trouble with this it seems) and the rest can reproduced more or less independently - at least it seems manageable without lots of divergence. But I approached the Q2 source pretty cold - I had last visited Q1 source in 1997 and didn't remember much detail - and adapting the existing code was a good way to catch up again. I only remembered a few important properties about the renderer that were enough to make me think it might translate well.
Anyway I planned to start with ID's source but immediately ran into problems with RAM - both physical machines and emulation. So I had to scrap that and adopt an alternate engine as the reference (that was Alexey Goloshubin's PolyEngine). This is a much leaner, simplified framework which I could run natively without all the data and the same early demand for RAM.
Unfortunately I discovered that PolyEngine contained a number of quite serious bugs and spent nearly as much time diagnosing and fixing those as I could have spent reducing the memory & data footprint of ID's original. But that kind of thing always happens to me
One of the bugs actually turned out to be the lightmap extents calculation being wrong, causing every 11'th map to have 1 or 2 skewed lightmaps. Just enough not to notice until too late
Other bugs were more insidious... particularly those involving clipping....
But I can't complain because it got the whole process started and was a win overall.
So between the ID source and PolyEngine, I cobbled together a lean-ish reference engine that would display maps correctly as Q2 did (i.e. without the bugs, glitches, and with an actual BSP algorithm driving the scene, which PolyEngine did not use - it used a flat list of clusters only, seemingly aimed more at hardware).
From there I started converting all of the render-side float work (and related data) into fixedpoint, and made sure the changes could be confirmed against the original with each progress made. I didn't include the texture transform math in this conversion - since it would need replaced by a different solution, so it remained as floats until later on. I did however implement the per-pixel steps using integers only, as a sub-project of the whole thing. I won't detail that here because it would take too long but it was extremely useful to have a C reference version for that!
Once I was sure that the main parts (vertex pipeline, clipping, spans) were all float-free, I started to review the algorithms/functional block being used and began to replace these with alternative solutions. This is where it starts to look a lot more like a 'from scratch' project. I'll explain a couple of specific cases below.
Once all the functional blocks had been replaced with more Falcon-friendly methods I started to create 68k versions of those to see how they would map to the CPU cache etc. and occasionally would go back and redo the C reference if necessary to keep the two versions closely related.
When that seemed settled, I'd start on a 68k/DSP hybrid implementation and in some cases a pure-DSP implementation where it made sense (most are hybrid). In a few cases I changed the technique again and would rewrite all 3 versions (ouch) but that was rare
By the time all that was done, the engine has quite a different structure and methods from either ID's or PolyEngine, and I did eventually lose sync with the reference C version of the renderer (because it was becoming a lot of work to maintain it) so it is now quite far behind the Falcon implementation - has no transparency support etc. I keep it around mainly to help debug the loader when adding a new format, and will refer to it again when I start working properly on the dynamic models.
So yes - this one is not a from-cold, from-scratch engine, but OTOH there is no part of it left (the renderer) which is from any other engine - most of the final blocks are essentially from scratch (err... except for a small number of functions e.g. R_PushDLights, which I lifted from Q2 and kept pretty much as is
. The whole process was slow and incremental but has resulted in something quite different in structure from ID's.
AxisOxy wrote:I started a complete new engine from scratch with only the fileformat as shared base.
And even for the fileformat I used a converter on the PC, which improved loading times and also enabled some more aggressive optimizations of the data-structures.
Well that would have been a tough but rewarding task
I think I did everything I could with the loader without actually preprocessing the maps offline. I thought of a number of convert steps which could have helped but decided to see how far I could get without it. It's now at the point though where further progress would need to involve an offline step (or a lot of waiting at runtime! already the case with coloured lightmaps...).
The new loader is c++ with a specialization per format - and once loaded, the map data is split into that which is needed by the game layer (e.g. collision stuff) and that which is needed by the renderer, since optimizing for each is different. So there is already separation between what's in the BSP file and what is in memory. This could be a startpoint for offline preprocessing later on.
AxisOxy wrote:A big bottleneck in my engine was the BSP-traversal and culling of the node bounding boxes. Moving from an AABB-tree to a minimum fit sphere-tree helped alot on that. It cut the amount of needed dot-products for the culling by about 50%. Another thing that helped was removing the recursion from the BSP-traversal. The alternative is to use a loop with a stack. Michael Abrash wrote an article about that technique in his "Zen of Graphics programming" book. Dunno why they dont used the tech in Quake, perhaps it just didnt matter on Pentium class CPU´s.
You're dead right about the BSP as a bottleneck - I spent quite a lot of time getting it to work sensibly.
I mentioned a few posts back that AABBs could perhaps be dumped for spheres but wasn't sure about the change in culling effectiveness on typical node shapes. Hadn't done the experiment. Sounds like it really has been a win for you though - I will have to try it now
(I did try it in Doom but the gain was limited because AABBs were more often optimal in 2D for those maps)
The Falcon's DSP is very good at dotproducts so that part isn't too much of an issue. Dynamically ordering the AABB min/max points however is very painful and it also involves (slowly) transmitting 6 words per node box, (vs 4 per sphere), which makes a mockery of DSP-side performance. I even tried some nasty SMC on the DSP to do the reordering but it just moved cost around and didn't help overall.
Removing the BSP recursion - I did exactly the same thing. It uses a data stack with one child on the stack and the other always in a register. However I had to do a number of other things with the BSP algorithm to get it to cache properly and still cover all the necessary aspects.
- The original BSP algorithm implements multiple duties at once. All mixed up: Depth-ordering (priority keys), backface culling, PVS intersections and drawing related work. I split these duties up into separate blocks of work, each narrowing the results of the previous (and outputting a finer grain than the previous). This allowed each block to fit in the tiny i-cache. The priority keys are essential later when trying to draw models without a zbuffer :-/ so I needed this side of things not to get lost.
- The DSP is better at dotproducts and math than the CPU, but can't access main RAM, so I brewed an algorithm which had them working synchronously, each with their own mirror of the BSP stack. This allows each chip to predict to some extent what the other will do next, in order to limit stalls.
- The original algorithm implements backface removal by testing the 'side' flag of each face, in relation to the current node 'side' being visited. This involves looking up faces and testing words etc. per face which I didn't like (not to mention the drawing part) so I rewired the node face lists into pairs of winding pools and index the correct winding pool with the active node side. This gets all the visible faces at once without any conditions (admittedly not many at once - but still helped quite a bit).
- The 68030 can cache aligned longword writes, so I fiddle the PMMU to make nodes non-cacheable while the stack remains cacheable, so the pushes get buffered and the pops get read from the d-cache. This offers a small speed advantage.
- I ended up with 3 different in-memory representations for planes. One float (really for the un-converted collision code), one storing 15bit normals for the BSP algorithm, and one storing 23bit normals, for precise texture transforms. A bit wasteful but was necessary for speed.
AxisOxy wrote:Another thing I stumbled upon is the fact, that there are alot of "unneccesary" points in the polygons. Like 3 or more points along the same edge, which is caused by splitting the polygons for the BSP-generation. But I didnt got them away. My problem was, removing them worked fine when rendering the edges in float, but produced ugly artifacts (gaps) when rendering them in int. I guess that caused by T-junctions. But normally everything that works fine in float should also be possible in int, if you are careful enough about the normalization of your values.
I was aware of the extra colinear points - but didn't try to remove them for that very reason. In fact I've been juggling with the idea of introducing extra points to seal existing t-junctions. I still have problems with texture sparkles and most are caused by texture addressing imprecision (because the pixel colours are wild) - but some structures when viewde at great distances appear a bit like sky colour and might be t-juncts opening up. So we'll see what happens there....
I suppose having 3 options isn't bad either - 1) optimize colinear points away, 2) leave it as is, 3) close existing t-juncts with new points. I suppose that which one suits will depend on the use case!
AxisOxy wrote:Thats the first things I remember from my work on the Quake renderer. I guess, there will be more to come.
So, greetings from the Amiga-Scene,
Greetings straight back from the Atari-Scene. Thanks again for chipping in - I'm definitely thinking seriously about the bounding spheres again and the whole offline thing