Quake 2 on Falcon030

dml · Post by **dml** » Sat Aug 02, 2014 7:24 pm

So the 2D clipper is working now, and that means most of what's needed for the first part is already on the DSP.

There is still some crap left on the CPU side, to collect clipped edges and pretend to do what the C version was doing, very inefficiently. It's spending half of its time now just doing that. It is a bit faster than before but not a lot, because of the rubbish that still needs cleaned up.

grab.jpg

The clipping has been set to 16 pixels from the 320x200 perimeter for testing.

While the CPU appears to be under some stress, the DSP seems to be idling despite doing all the hard work.

Code: Select all

Used cycles:
  86.12%                   123100680                       _unused_
   5.23%                     7475174                       R_IndexedEdge3DAddToFrustum
   5.10%   5.10%   5.30%     7287768   7287768   7575120   R_Edge2DAddToViewport
   3.31%                     4729922                       R_XFormProjectVertices
   0.15%   0.15%   0.15%      216792    216792    216792   R_Line2DIntersectXY
   0.05%   0.05%   0.05%       70560     70560     70560   R_Edge2DCacheCull
   0.05%   0.05%   0.05%       67200     67200     67200   R_EdgeIntersectZ
   0.00%                         256                       R_AdvanceDate

Still no real optimizations done - so far just transferring and converting PC code to Falcon DSP. Re-engineering things a bit to make things fit, and do chunks of related work in bursts etc.

dml · Post by **dml** » Sun Aug 03, 2014 5:00 pm

Today in-between painting and decorating jobs

I found a really old laptop in a cardboard box. This was the first laptop I ever bought and was used for homebrew projects.

Digging through the HD has resurfaced good stuff. Lots of code, experiments and unfinished things. Some of it half-Atari and half-pc, probably from the time when I was writing software rasterizers for PC games. There are bits and pieces of different 3D engines and data export and preprocessing tools for Maya and Max.

One of the projects was a global illumination experiment, a tiny raytracer using a trick to manage ray quantity. It was a bit irrelevant back then, running on a CPU, but maybe more interesting now. It was an early attempt at modelling light physically using a lot of rays since even at that time I thought things would *eventually* go that way.

output-colourdiffuse.png

Didn't actually implement more than spheres though

There are 2 or 3 other interesting things here which might be useful on Atari.

CiH · Post by **CiH** » Sun Aug 03, 2014 7:59 pm

Woah!

How many more hidden treasures have you got there?

dml · Post by **dml** » Sun Aug 03, 2014 8:20 pm

CiH wrote:Woah!

How many more hidden treasures have you got there?

Too many distractions, that's how many!

There are 2 things I haven't found yet but that must be getting close to most or all of it by now.

dml · Post by **dml** » Mon Aug 04, 2014 7:07 pm

More progress on this:

https://www.youtube.com/watch?v=pqVu4F1 ... UM&index=4

All of the pieces for the second 1/4 of the graphics pipeline are now in place, although not optimized.

The first 1/4 needs attention now - the C code can't feed the geometry pipe fast enough in its current form, and the two stages are synchronous which wastes time.

There's no point in looking at the next stage until the first two stages can be made fast enough.

Zamuel_a · Post by **Zamuel_a** » Mon Aug 04, 2014 8:11 pm

Really impressive. It's almost at a "playable" framerate now!

dml · Post by **dml** » Mon Aug 04, 2014 8:24 pm

Zamuel_a wrote:Really impressive. It's almost at a "playable" framerate now!

I had hoped to get it faster than this in less time, but it is a bit like knocking down weeds. All the bits which have been replaced don't take much time anymore, but the stuff that used to be 2% is now 10%, and there are a lot of those rising to the surface.

Will rewrite some of the C to feed the DSP more effectively instead of calling a function for every face edge etc... things are getting fiddly now and the DSP ram is overrunning, causing some lines to go a bit crazy. Later I'll have to do more work to do groups of 500-1000 edges or so at a time instead of the current cap on the scene.

dml · Post by **dml** » Wed Aug 06, 2014 12:54 am

I got stuck while converting the Q2 BSP code tonight, trying to understand what was happening in one specific place during drawing. I was trying to use some of the same tricks in the original code but converted to work better on a slower architecture, smaller cache etc. But this means moving things around, changing the order of things. One detail didn't make sense to me so I couldn't really interfere with it and that put a halt on things.

Digging around on the internet for forum postings, FAQs etc. turned up nothing. In fact, the same wrong information kept turning up which is surprising because it's in conflict with the source, at least for the ref_soft rasterizer. This has to do with faces recorded at nodes vs faces recorded at leaves, and why the renderer refers to both (various articles suggest only the leaf faces are used for drawing - not true).

I was pretty sure JC would not refer to the same data twice for the same job, so there must be a good reason, and simplifying it would be causing damage.

As is usually the case, staring at it hoping for a clue didn't help - leaving it alone and doing something else usually gets a result in less time.

I now see what it is actually doing with both sets of faces, and I have to admit it is clever.

So I'm unstuck again for another go.

Zamuel_a · Post by **Zamuel_a** » Wed Aug 06, 2014 10:28 am

Wouldnt Wolfenstein be easy to convert to Falcon? Ofcourse the ST version exist but I guess it would be better to run the original code with TC graphics instead of the ST c2p routines?

dml · Post by **dml** » Wed Aug 06, 2014 10:58 am

Zamuel_a wrote:Wouldnt Wolfenstein be easy to convert to Falcon? Ofcourse the ST version exist but I guess it would be better to run the original code with TC graphics instead of the ST c2p routines?

It should be pretty easy to convert to the Falcon - it is more powerful than the original 286 machine targeted. You could even say the Falcon is over-specified for it

Which is a bit of a luxury because most of the time we're having to do crazy stuff to get things optimized for our Ataris.

It should also be possible to do some lighting and floor texturing with the spare cycles - but it would then be a bit less like Wolf3D so that would depend on the aims.

dml · Post by **dml** » Wed Aug 06, 2014 11:42 am

So the next thing I will try to do is get the BSP hierarchical clipping logic controlling the DSP.

The previous tests did stupid-clipping where every leaf and every face in every leaf gets clipped against 5 clipping planes before edges can be generated. The DSP is quite fast at the clipping but it's still 90% of the work the DSP is currently doing. The CPU also has to do a lot of bounding box tests which are unnecessary.

The Q2 engine uses BSP cleverness to shortcut a large number of tests and can do so down to a per-face level (although I didn't check if it actually uses face-level clipping info yet - I'll do it anyway because it will probably help).

Problem is the technique doesn't translate well to 030 because it involves a lot of code and is floating point, so it needs redone. I have it basically working but not fast, and its not using the information yet to properly speed up clipping.

dml · Post by **dml** » Thu Aug 07, 2014 12:12 pm

Another update from last night after a couple of hours of hacking.

I highly recommend to anyone interested in 3D (or graphics programming, or any performance programming) to have a go at adapting your own version of one of these engines and try to improve them. I don't mean just assembler-optimizing the usual bits and pieces that everyone does for a port - I mean trying to change how the engine works by understanding each step properly. It is a useful learning process even if you have a lot of experience with this stuff already. Something new always turns up.

It's interesting that I have still found many optimizations despite so many amazing tricks in there. But I would have badly screwed it up by now without properly getting every single last detail, down to a single innocent looking line of code that doesn't seem to matter much.

There isn't a line out of place. If anything they just stopped improving it when it stopped mattering for a fast PC.

Three optimizations I did use, as examples:

1) Reformatting the PVS to use a memory bitfield instruction pointing at the PVS line, instead of addressing arithmetic, shifts and bit masks on bytes. This was mentioned before, and was used in BadMood also. The main benefit is size - it can be used within other code without stealing registers or cache space.

2) The backface removal step has to test winding flags on each face to see which side of the surface plane they are attached to, before visibility can be determined. This happens within the face submission loop and therefore happens a lot. I changed this to reorganise the faces linked under each node into two sets - front and back. The result of the BSP hyperplane which-side-of-plane test indexes the correct group and just loops it. This completely removes the winding test for each face. This cost a little memory because the reorganization is indirected via more indices, to avoid messing up other parts of the engine - the resulting code is faster, smaller and easier to optimize than the original version.

3) There is an optimization which tries to shortcut the side-of-plane test for axis-aligned planes, by checking their flags. This also gets used a lot. It just happens that the checking is unnecessary since the flag codes match the indexing of vectors wanted. e.g.

Code: Select all

		switch (iplane->type)
		{
		case PLANE_X:
			dot = ccam->c[0] - plane->dist;
			break;
		case PLANE_Y:
			dot = ccam->c[1] - plane->dist;
			break;
		case PLANE_Z:
			dot = ccam->c[2] - plane->dist;
			break;

..becomes...

Code: Select all

		idot = ccam->ic[type] - iplane->dist;

There are lots of cases like this which can be used to make the code smaller and make whole processes fit in the cache. I've been trying to find as many as possible in the areas which can't be effectively DSP'd later, to raise the performance ceiling imposed by 16MHz on bulk processing and random access to big data.

Most of the 030 optimizations though on the main code are concerned with taking big sprawling algorithms and breaking them up into several smaller processes with queues between them, so each step can cache while communicating as little information to the next stage as possible (usually a compacted array of indexes or pointers, where each process puts out less than it took in).

This nearly always works well, because it's usually easy to ensure the density of data written to (and then read from) the queue is far smaller than the density of instructions fetched from RAM by the code if not cached properly. e.g. if each iteration of a routine fetches 300+ instructions and incurs 100 cache misses, that's 100 additional fetches of word-pairs per iteration. If the routine can be cached, it will fetch 0 additional words per iteration, and write just one to the queue. That's an enormous bus saving in some cases and especially in truecolour mode on F030 where the bus is much less available for instruction fetching. There are a few cases where it doesn't work so well - if the same context needs to be set up multiple times at some cost, or if a lot of info needs put in the queue - but it's usually east to tell when that's the case.

This is best done by changing the C code first, and then converting each stage into 68k while using the other stages to test the changes.

dml · Post by **dml** » Thu Aug 07, 2014 12:57 pm

Here's one of those reasons you need to use 68k everywhere

Compiler sometimes is a bit stupid.

Code: Select all

$02cba6 :             clr.l     d0                         0.18% (114650, 458600, 505)
$02cba8 :             move.w    (a0)+,d0                   0.18% (114650, 918780, 0)
$02cbaa :             move.w    d1,(a2,d0.l*2)             0.18% (114650, 917828, 0)
$02cbae :             cmpa.l    a0,a1                      0.18% (114650, 459812, 265)
$02cbb0 :             bne.s     $2cba6                     0.18% (114650, 827392, 8)

Aside from the fact the loop size is known and it could have been unrolled, why is it clearing on d0 every loop? The upper word never changes. It seems to be treating the sign extension/clearing stuff as a separate thing from data flow - optimized too late.

This is GCC 4.6 with "-O3 -fomit-frame-pointer"

Eero Tamminen · Post by **Eero Tamminen** » Thu Aug 07, 2014 1:38 pm

dml wrote:This is GCC 4.6 with "-O3 -fomit-frame-pointer"

What's the C-code it optimized?

dml · Post by **dml** » Thu Aug 07, 2014 1:52 pm

Eero Tamminen wrote:
dml wrote:This is GCC 4.6 with "-O3 -fomit-frame-pointer"
What's the C-code it optimized?

I'm working now so can't do much more than c&p at the moment

Code: Select all

{
	u16* marks;
	s16* marked;

	marks = cbsp->marksurfaces2 + leaf->firstmarksurface_idx;
	marked = cbsp->surfacemarks;

	do
	{
		u16 face = *marks++;
		marked[face] = r_framecount;
	} while (--c);
}

The sign extension of 'face' shouldn't be required for each fetch given that all results are unsigned and therefore upper word always zero.

However, in the real world the C compiler may not be well geared for local variables as the C++ mode should be (where it really is supposed to make use of local decls to shorten the lifespan of variables and reuse registers etc. with less derivation from the internal dataflow graph). I'll try it later but I suppose it's a minor gripe in the scheme of things.

kristjanga · Post by **kristjanga** » Thu Aug 07, 2014 3:08 pm

This threat will be mental to follow
I think that if you would implement flat shaded possibilities from the start to speed up things then it would help with speed (and awesomeness)
not like in bm where flat shades do not help at all, they look great but no gain in speed
it´s just a thought, i am no programmer

----
So Doug, after this project what will it be? Duke Nukem 3D

just kidding.

dml · Post by **dml** » Thu Aug 07, 2014 3:29 pm

kristjanga wrote:This threat will be mental to follow
I think that if you would implement flat shaded possibilities from the start to speed up things then it would help with speed (and awesomeness)
not like in bm where flat shades do not help at all, they look great but no gain in speed
it´s just a thought, i am no programmer

BadMood suffered from a bunch of problems which limited the usefulness of flatshading or other shortcuts.

- The project was started - and a lot of the core stuff completed - before I got a look at JC's original source. So in some areas its *completely different*. This made it a complete dog to tie up with the game code in the end. It actually loads the same data twice in two different formats in some cases. Memory management had the game and engine fighting over the same RAM.

- It was optimized from early on to assume texturing, so a lot of the code complexity and traffic is shunting texture related info around. Including to and from DSP. Shortcutting that for the benefit of flatshading is just seriously hard (and probably involves 2 versions of the DSP core) so I didn't bother.

- The original Doom engine has some design problems which make it suck badly on an old CPU with a small cache. It's constantly flipping between jobs as it goes through the scene. You can undo some of that, but only by giving up some other things (and completely changing how it works, probably breaking stuff too). This problem still exists in BadMood, and one of the main reasons it still gets choppy with complex / outdoor maps. Fixing it means re-architecting too much stuff which involved the DSP too early and is now hard to unwind. I won't be rewriting that now. Would be easier to start from scratch.

Q2 is a bit different - the architecture doesn't have the same problems, has been better planned out. I'm also more aware of those other problems in BadMood from the start. It's also easier to shortcut texturing based work from everything else without massive changes so its more likely that flatshading (or other methods) actually save time and are worthwhile.

kristjanga wrote:So Doug, after this project what will it be? Duke Nukem 3D just kidding.

Something - but it will definitely not be a porting project

Sadly most ports just involve minimal changes to code + a fast enough machine. The better ones (Atari, Amiga) go as far as replacing the drawing code with something very custom.

But to try something like this requires an awful lot more than that - in the end will be seen as a 'port' like all the others so the extra effort will go unnoticed. But it's a good way to brush up skills so still worthwhile even so

mfro · Post by **mfro** » Thu Aug 07, 2014 3:50 pm

dml wrote: The sign extension of 'face' shouldn't be required for each fetch given that all results are unsigned and therefore upper word always zero.

Not sure if my fake code resembles the complexity of the original correctly, but I played a little with it and to my astonishment, gcc seems to produce significantly better code if I just make face an s16 instead of an u16

dml · Post by **dml** » Thu Aug 07, 2014 3:58 pm

mfro wrote: Not sure if my fake code resembles the complexity of the original correctly, but I played a little with it and to my astonishment, gcc seems to produce significantly better code if I just make face an s16 instead of an u16

That's disconcerting because I'd expect the opposite. Seems to be losing track of the state of the register assigned to that var so it tries to keep clearing it. Making the var external to the scope might change that since it doesn't see it as a new var on each iter - but if it makes the problem go away it has other implications for what gcc is up to :-z

[EDIT]

...actually if s16 is used instead it probably just uses .w signed addressing instead of .l

That might be why the code improved so much. In this case though the face count can be >32k even if it is only 10k on that test map.

mfro · Post by **mfro** » Thu Aug 07, 2014 4:06 pm

the loop body actually boils down to this:

Code: Select all

<.L2>:
     17c:	3258           	moveaw %a0@+,%a1
     17e:	d3c9           	addal %a1,%a1
     180:	35b9 0001 bd18 	movew 1bd18 <__etext>,%a2@(0000000000000000,%a1:l)
     186:	9800 
     188:	b088           	cmpl %a0,%d0
     18a:	66f0           	bnes 17c <.L2>

dml · Post by **dml** » Thu Aug 07, 2014 4:08 pm

mfro wrote:the loop body actually boils down to this:

Code: Select all

<.L2>:
     17c:	3258           	moveaw %a0@+,%a1
     17e:	d3c9           	addal %a1,%a1
     180:	35b9 0001 bd18 	movew 1bd18 <__etext>,%a2@(0000000000000000,%a1:l)
     186:	9800 
     188:	b088           	cmpl %a0,%d0
     18a:	66f0           	bnes 17c <.L2>

That is indeed very different code.

It's taking advantage of the movea sign extension. Quite a change though...

dml · Post by **dml** » Thu Aug 07, 2014 7:16 pm

So when I started working on the main BSP function (R_RecursiveWorldNode) at the core of Quake2 it was generating several kbytes of mixed 68k and '882 FPU code and was thrashing the CPU cache pretty much 100% of the time. Even the important loops didn't cache properly.

Replacing the 3D math parts with assembler and breaking the rest away into separate queues reduced that to about 450 bytes. This is still far too big to cache, and the C code still operates as a recursive function, pushing stuff on the stack, saving registers and making subroutine calls to itself.

A second attempt implements all of this in 68k, and removes the recursion. This was actually quite hard because unlike Doom this thing is performing work in between the near/far child subtrees, and that work at each node is ordered against the leaves in the subtree it just dealt with. I think this is why the result didn't benefit from tail recursion as was the case for the Doom BSP code. So it's compiled much more literally, and is less efficient.

Anyway I managed to flatten it completely now - it uses a register based stack to push the current node's context and performs the near/far child walk by updating 68k registers only. It's a slightly rotated version of the version I did for Doom but a bit harder to follow because of the weird ordering.

The result is now 270 bytes for the whole dynamic visibility pass ***. Just need to find another 14 bytes and it will fit inside the Falcon's CPU!

I tested it by running the C version on top of the 68k version, and having it check that every output it attempts is matched by the same output at the same time by the 68k version. This actually helped me find an obscure bug where I treated normal nodes (coded as -1) as the only negative node type coding. It turned out that AREAPORTAL is coded as $8000 so I did occasionally get a mismatch between the two routines viewing in certain directions. Didn't take long to find the odd value and look it up in the source.... It always pays to write a decent test for something new, and avoid the 'seems to work' T-shirt.

I think I know how to make it fit even if I can't scrape enough by fiddling with the instructions. There's a higher level optimization that can be made to the plane-testing math for 95% of the planes involved which will cut the code size and use fewer cycles, at the cost of emptying the cache for occasional planes. It's more complicated so I'll start by trying to make it fit without that first.

*** The static visibility pass (PVS) is done separately, and is clever enough only to do new work when the camera moves from one cluster into another one. So the cost doesn't really register. It also manages to inhibit work done by the BSP indirectly. How this is achieved moderately blew my mind when I noticed it, because it uses a favourite trick that I have used since forever - since long before I saw any ID code - but in a way that is just newly awesome.

dml · Post by **dml** » Thu Aug 07, 2014 8:06 pm

Ok I'm getting that problem I had with Hatari's WinUAE emulation core, where it refuses to accept that a 252-byte loop is small enough to fit in a 256-byte instruction cache.

$034c40 - $034b44 = $FC, or 252 bytes. But it is counting cache misses which are in the same ballpark as the iteration count of the loop. i.e. the instruction cache is thrashing every iteration. That's just wrong.

The last time I encountered this it was 254 bytes, where reducing the loop to 252 fixed it. But this seems worse now.

Code: Select all

[...]
$034b40 :             bra       $34c40                     0.00% (12, 96, 0)
[start]
$034b44 :             move.l    -(a6),d0                   0.08% (3144, 42600, 2404 ********)
$034b46 :             bmi.s     $34b40                     0.08% (3144, 17384, 1204 ********)
$034b48 :             movea.l   d0,a0                      0.08% (3132, 12528, 0)
$034b4a :             move.w    -(a6),d7                   0.08% (3132, 25344, 72)
$034b4c :             move.w    -(a6),d2                   0.08% (3132, 25056, 0)
...
$034c30 :             eor.w     d4,d2                      0.08% (3132, 12528, 0)
$034c32 :             move.w    d2,(a6)+                   0.08% (3132, 25056, 46)
$034c34 :             move.w    d7,(a6)+                   0.08% (3132, 25112, 0)
$034c36 :             move.l    a0,(a6)+                   0.08% (3132, 37584, 45)
$034c38 :             movea.l   $20(a0,d4.w*4),a0          0.08% (3132, 55044, 1219 ********)
$034c3c :             bra       $34b5a                     0.08% (3132, 22344, 1220 ********)
[end]
$034c40 :             move.l    a5,d0                      0.00% (12, 48, 24)

Anyway I'm going to call it a success, even if WinUAE doesn't agree.

The Quake 2 BSP algorithm fits in the Falcon's CPU.

Cyprian · Post by **Cyprian** » Thu Aug 07, 2014 8:09 pm

dml wrote:The result is now 270 bytes for the whole dynamic visibility pass ***. Just need to find another 14 bytes and it will fit inside the Falcon's CPU!

270-14=256. Is it ok? Sometimes ago I heard that loop inside of 68030 cache should have max. 254 bytes in order to avoid prefetching code outside cache (cache miss?).

dml · Post by **dml** » Thu Aug 07, 2014 8:13 pm

Cyprian wrote: 270-14=256. Is it ok? Sometimes ago I heard that loop inside of 68030 cache should have max. 254 bytes in order to avoid prefetching code outside cache (cache miss?).

The latest code is 252 byes. And yes - you're right. I noticed last time that the instruction beyond the loop can cause thrashing even if it is not executed, because it gets fetched - so if it's a 4-byte op then it affects the loop size by fetching even more.

However I remembered that problem from before and stuck two 2-byte ops beyond the end of the loop. Still doesn't seem to fit with that and a 252-byte loop!

Weird.

Atari-Forum

Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030