I had a little time to look at some strategies for optimizing per-face setup.
In the original code, every face gets set up on the way into the rasterization step - partly so the z-plane can be used to generate the zbuffer for scenery surfaces.
By eliminating the zbuffer, it's been possible to move face setup math to the last stage - for faces with physical spans (at least one pixel needing drawn). This cuts down the number of setups quite a bit. But its still a lot of work and still FPU-bound.
The same is done for surface cache filling, but this was always done at the last stage - I didn't change that part.
It turns out that there are quite a lot of options for speeding it up, just that none of them are quick and easy to test.
1) just move it to the DSP (will take a while to move it all, quite a lot of it). this will create a small delay before each face is drawn - not much parallelism, but it will at least be done fast. it may however not be so accurate and cause more sparkles.
2) have the scanconvertor report to the CPU on the first span generated on any new face during scanning, so the CPU/FPU can set up the face math in parallel during scanning. this will slow down scanning a little bit, but gain parallelism. it also allows the CPU cache to be dedicated to one tight loop of code for that one task instead of fragmenting among many tasks as it is now. however it still requires a pretty optimal version on the CPU/FPU side and might still be too expensive.
3) some kind of split - do half of it in parallel during scanning, and half just before drawing. more complex to get working, might involve transferring more values to DSP per face.
4) leave it as it is, but optimize the hell out of it and recommend overclocking the FPU
but that would put another FPU-shaped dent in the 'stock machine' thing wouldn't it?
The surface cache filling would definitely benefit from knowing which faces are to be drawn in advance, in order to fill multiple at a time and cache better - although if not done carefully it would be possible to kick out textures needed during the same frame, before they get drawn (inter-frame thrashing). Another strategy could be to move a 'copy' of the surface cache filler to the outer part of the engine, iterating over the PVS a few faces per frame, round-robin, to predictively fill the cache. This would spread out the work better (instead of chugging when you look round a new corner - since many of the faces will be pre-fetched before you get there).
I also had a look at the BSP storage for two of the inputs to the surface plane math - the 3D plane array, and texinfo array. Both are indexed off each face, and combined with the view to generate the texture plane in realtime.
I was trying to figure out if there are fewer combinations of those two things, than there are faces - if so it might be worth caching the resulting planes and reusing the results between faces. e.g. if you have a wall with 10 different faces on it, but in fact made up of only 2 textures, it suggests there is 1 plane and 2 texinfos shared across 10 faces, and therefore a lot of reuse.
A quick test with this idea using just-planes or just-texinfos didn't appear to work so either I did it wrong or it has to be done with pair-patterns. When I get more time I'll try it.
In the end that may not work - the Carmack/Abrash surface rasterization algorithm has the capability to rasterize non-convex and even disassociated faces (one 'face' can be made up of separated primitives, but using a single edgelist). This suggests the optimization could have been done offline already, and attempting it here is futile. But I'll soon find out if that's the case - perhaps they avoided dealing with disassociated faces for other reasons elsewhere in the engine, in which case I could take advantage of it.
No real coding time ATM but still time to think about stuff