Another update from last night after a couple of hours of hacking.
I highly recommend to anyone interested in 3D (or graphics programming, or any performance programming) to have a go at adapting your own version of one of these engines and try to improve them. I don't mean just assembler-optimizing the usual bits and pieces that everyone does for a port - I mean trying to change how the engine works by understanding each step properly. It is a useful learning process even if you have a lot of experience with this stuff already. Something new always turns up.
It's interesting that I have still found many optimizations despite so many amazing tricks in there. But I would have badly screwed it up by now without properly getting every single last detail, down to a single innocent looking line of code that doesn't seem to matter much.
There isn't a line out of place. If anything they just stopped improving it when it stopped mattering for a fast PC.
Three optimizations I did use, as examples:
1) Reformatting the PVS to use a memory bitfield instruction pointing at the PVS line, instead of addressing arithmetic, shifts and bit masks on bytes. This was mentioned before, and was used in BadMood also. The main benefit is size - it can be used within other code without stealing registers or cache space.
2) The backface removal step has to test winding flags on each face to see which side of the surface plane they are attached to, before visibility can be determined. This happens within the face submission loop and therefore happens a lot. I changed this to reorganise the faces linked under each node into two sets - front and back. The result of the BSP hyperplane which-side-of-plane test indexes the correct group and just loops it. This completely removes the winding test for each face. This cost a little memory because the reorganization is indirected via more indices, to avoid messing up other parts of the engine - the resulting code is faster, smaller and easier to optimize than the original version.
3) There is an optimization which tries to shortcut the side-of-plane test for axis-aligned planes, by checking their flags. This also gets used a lot. It just happens that the checking is unnecessary since the flag codes match the indexing of vectors wanted. e.g.
Code: Select all
switch (iplane->type)
{
case PLANE_X:
dot = ccam->c[0] - plane->dist;
break;
case PLANE_Y:
dot = ccam->c[1] - plane->dist;
break;
case PLANE_Z:
dot = ccam->c[2] - plane->dist;
break;
..becomes...
Code: Select all
idot = ccam->ic[type] - iplane->dist;
There are lots of cases like this which can be used to make the code smaller and make whole processes fit in the cache. I've been trying to find as many as possible in the areas which can't be effectively DSP'd later, to raise the performance ceiling imposed by 16MHz on bulk processing and random access to big data.
Most of the 030 optimizations though on the main code are concerned with taking big sprawling algorithms and breaking them up into several smaller processes with queues between them, so each step can cache while communicating as little information to the next stage as possible (usually a compacted array of indexes or pointers, where each process puts out less than it took in).
This nearly always works well, because it's usually easy to ensure the density of data written to (and then read from) the queue is far smaller than the density of instructions fetched from RAM by the code if not cached properly. e.g. if each iteration of a routine fetches 300+ instructions and incurs 100 cache misses, that's 100 additional fetches of word-pairs per iteration. If the routine can be cached, it will fetch 0 additional words per iteration, and write just one to the queue. That's an enormous bus saving in some cases and especially in truecolour mode on F030 where the bus is much less available for instruction fetching. There are a few cases where it doesn't work so well - if the same context needs to be set up multiple times at some cost, or if a lot of info needs put in the queue - but it's usually east to tell when that's the case.
This is best done by changing the C code first, and then converting each stage into 68k while using the other stages to test the changes.