Quake 2 on Falcon030

dml · Post by **dml** » Tue Oct 14, 2014 1:39 pm

I transferred the test binary to my F30 over lunch and tried it out - didn't work at all

Backed up 10 revisions and found that none of them render the first frame properly, holding a black screen until the mouse is moved, and then it works ok. Hatari showed the first frame correctly at all revisions.

About 6 revisions back from the latest it stops working completely - corrupt flashing display.

Anyway I was able to identify and fix the problem with the black screen so that's not so bad. The camera transform was going out of range when converted to 16bit fixedpoint for rendering - probably caused by floating point differences between Hatari and a real 68882. So a version from about 6 weeks ago works ok on F30.

The problem affecting later versions I'm less sure about, will look at it tonight when I have more time. Looks more like a DSP number range thing, so perhaps a bit more scary.

dml · Post by **dml** » Tue Oct 14, 2014 4:34 pm

The bugs weren't as serious as I thought. Both of them were typical FPU range/rounding problems. Took a couple of minutes.

I've just built a public test. I'd be interested to know if this binary works for anyone. I think it should be pretty stable on most machines with an 030 cpu, probably even with acceleration. Not so sure about 060 though, at this stage. I'm interested in feedback though from trying different machines!

You'll definitely need an FPU - preferably a 68882. You *may* need 14MB, but chances are it will work on 4MB currently with this map and no textures. I only tested on RGB - no idea if it works on VGA. There is code for it, but the framebuffers could be broken.

https://dl.dropboxusercontent.com/u/129 ... 2test1.zip

I bundled a more complex map with it - it runs more slowly but is probably more interesting to look at.

Use the mouse to steer/look and arrows to move or turn (best use mouse for turning - classic 2-handed play).

If it seems like you took a few steps and fell down a hole into a small room, that's because you fell in the water - which still looks solid. Press right mousebutton to 'fly' out of trouble.

grab0011.png

This map is quite open plan in places and lots of polygons and some small details.

grab0012.png

grab0013.png

I notice the test runs a bit faster in Hatari than on real HW - not a big difference but it can be felt. Will look into that another time. Might be the FPU stuff but could easily just be bus loading.

Scarlettkitten · Post by **Scarlettkitten** » Tue Oct 14, 2014 4:56 pm

Nice update, can't wait to try this

Anima · Post by **Anima** » Tue Oct 14, 2014 5:13 pm

dml wrote:I've just built a public test. I'd be interested to know if this binary works for anyone. I think it should be pretty stable on most machines with an 030 cpu, probably even with acceleration. Not so sure about 060 though, at this stage. I'm interested in feedback though from trying different machines!

Thanks Doug!

Works on my machine (stock Falcon + 14 MB RAM + 68882 FPU):

dml · Post by **dml** » Tue Oct 14, 2014 6:19 pm

Thanks for trying it and letting me know

I'll go back to what I was doing with it before.

jury · Post by **jury** » Tue Oct 14, 2014 8:36 pm

dml wrote:Not so sure about 060 though, at this stage. I'm interested in feedback though from trying different machines!

Tried it on CT63 ( with SV ) but it didn't even start, machine just hung.

dml · Post by **dml** » Wed Oct 15, 2014 8:15 am

jury wrote:
dml wrote:Not so sure about 060 though, at this stage. I'm interested in feedback though from trying different machines!
Tried it on CT63 ( with SV ) but it didn't even start, machine just hung.

Does that machine have an accelerated DSP also?

I have built a cpu-clean version, without Hatari detection or cache fiddles, and with full DSP handshaking. This might stand a better chance on 060. Not sure about SV or highly accelerated bus - it still uses the old blitter for filling.

https://dl.dropboxusercontent.com/u/129 ... 1clean.zip

If a non-blitter version needs tested I can do that too. Still, it's really aimed at a stock box.

dml · Post by **dml** » Wed Oct 15, 2014 9:15 am

I have a slightly better TTP version which accepts a commandline argument for the map name, so you can put BSP files in the maps folder and pass the name to the program.

e.g.

q2.ttp q2dm1

grab0015.png

I'll upload those binaries later on, probably over lunch.

I see there are still some maps which crash the engine, causing a hang. It's view/mapsize dependent so if you have a problem it's probably the map, just try another. q2dm8 is quite massive and causes a hang for me.

dml · Post by **dml** » Wed Oct 15, 2014 2:16 pm

Here is an improved version built over lunch, with both executables in the same zip (clean = safer for beefed up hardware). Some additional optimizations have been enabled/merged back in after the recent debugging exercises, and the holes in the sky are also fixed.

https://dl.dropboxusercontent.com/u/129 ... 2test2.zip

I have bundled some extra test maps (from my test directory, sorted for smallest filesizes). I expect they don't all work - some may crash it. I haven't had much time to try them properly. Archive is currently 4mb.

Performance is only loosely related to map size. It is more impacted by visible surface count + complex rooms visible through windows or doors. A larger map (more total vertices) with lower density in each room can actually run faster. This means maps can be custom made to run reasonably well without limiting their size, if anyone wants to try it.

You can toggle blitter/cpu with 'B' and 'C' keys. The 'clean' executable defaults to blitter=off, otherwise the default is on. Note that the blitter doesn't gain much - both routines are quite optimized and the gap is tiny.

When opening the .ttp, specify the map name e.g.

q2.ttp fatal1

...otherwise it tries to load a default map which probably isn't even in the archive.

(fatal1 = test map used in YT videos)

[EDIT]

..err, think I know why it may not run on CT60 etc. The framebuffers are currently hacked up as arrays - they need to be MXalloc'd if working with FastRam. For now it probably needs to be run with the execute-from-fastram flag disabled.

dml · Post by **dml** » Wed Oct 15, 2014 5:24 pm

This version should fix the TTRam problem for CTxx and includes a few minor optimizations.

It still needs 14MB (although I'm sure that will get fixed sometime - until texturing is added)

https://dl.dropboxusercontent.com/u/129 ... 2test3.zip

That's all from me for now I think.

EvilFranky · Post by **EvilFranky** » Wed Oct 15, 2014 5:27 pm

Although DSP limited we need a CT60 video

Eero Tamminen · Post by **Eero Tamminen** » Wed Oct 15, 2014 9:02 pm

Latest version works fine in Hatari, even with oldUAE CPU core!

dml · Post by **dml** » Thu Oct 16, 2014 10:04 am

Eero Tamminen wrote:Latest version works fine in Hatari, even with oldUAE CPU core!

That's good.

The q2.ttp does some dirty stuff in places for a stock machine, but nothing close to the evils inside BadMood

q2clean.ttp doesn't do anything nasty and I guess it should work on any Falcon, bugs and oversights permitting.

jury · Post by **jury** » Thu Oct 16, 2014 12:47 pm

Tried last version ( q2clean.ttp from q2test3.zip ) on CT63 ( with SV ) and this time I got:

dml · Post by **dml** » Thu Oct 16, 2014 1:11 pm

jury wrote:Tried last version ( q2clean.ttp from q2test3.zip ) on CT63 ( with SV ) and this time I got:

Ok, looks like I'll have to wait until I can try it myself before posting an 060 binary. The obvious stuff has been dealt with.

dml · Post by **dml** » Thu Oct 16, 2014 5:58 pm

So I think I have finished most of the 'level #1' optimizations, which are mainly to do with scene management and algorithms and making decent implementations of those. One of these was determining how to close left/right edges of the display efficiently on clipping, and deciding how many clipping planes to use (z plane first, or wait until transform->screenspace). I had also been laying stuff out to get overlap between CPU and DSP later.

I had already started on the 'level #2' optimizations, but not very far into that. This is concerned with reorganizing stuff in memory and on the DSP so the algorithms can be made more efficient, need less code, less space etc. Some of these are quite tricky and cause breakage so it will take a while.

The last round 'level #3' pass will be looking at the code itself and optimizing for instructions, registers etc. the usual stuff you'd do on a last pass.

Of course I might still find things that mean going back to level #1 and have to repeat work but usually that's a good thing

I have found a few more good DSP tricks since working on previous projects so I'll document those when I'm done.

dml · Post by **dml** » Fri Oct 17, 2014 9:56 am

Brief update on current state of optimization work. Further work will probably have to wait until some evenings next week.

This profile grab was taken while viewing some complex scenery with lots of faces.

Used cycles:

18.53% _R_RecursiveWorldNode4PL_68k <-- this BSP walk is already 100% optimized 68k - but I'm pretty sure it can be split with DSP for at least a moderate speedup. This worked well for BadMood, even if it is more complex here.

18.27% _R_ScanConvert_DSP56k <-- this one is 100% DSP with partial optimization but still a lot to do. It has an artificial blocking/polling step inserted so I can time it on the CPU side, otherwise it doesn't show at all in the profiler on the CPU side and causes the next piece of code to increase in time. In other words it is being made to bottleneck in order to measure its duration. The videos were recorded without the blockage, but the demo .ttp still has it enabled. This results in a slight speedup in the videos.

14.92% _R_SubmitFaceGeometry_DSP56k <-- this one is also 100% 68k and optimized, and is a bit of a hard case. I think a good bit of the time is caused by DSP causing stalls, so it might be improved by changing that balance.

13.16% _R_RecoverSurfacesBLT_DSP56k <-- optimized fill routine, not much to do here.

10.48% _R_ReindexFaceVertices_68k <-- this could either be overlapped with some DSP work, or combined into R_SubmitFaceGeometry to soak up the DSP stalls, although it is difficult to optimize a combined version so well as the separate versions.

7.62% _R_ProcessBSPVisitQueue.constpro <-- this is still in C, so it will get assemblified at some point

5.91% _Clip_CheckBrushes <-- collision detection stuff

4.88% _R_XFormProjectTaggedVertices_DS <-- can be optimized quite a lot by dropping fine fraction bits for transmission to DSP. not a bottleneck here, but may be at times when the PVS is working poorly and sending too much stuff.

0.77% _check_leaf <-- more collision detection stuff
0.76% _check_node_r <--
0.69% print_char <-- not-optimized fps printer

So there is still a lot of room to speed it up, especially on DSP side. I think it will be a while before opportunities dry up.

calimero · Post by **calimero** » Fri Oct 17, 2014 11:11 am

what are dominant instructions in R_RecursiveWorldNode4PL_68k?

and what is basic algorithm for this?

I ask this because I stump on comment at http://www.celephais.net/board/view_thr ... 01&end=125 that say that "RecursiveWorldNode" could be skipped

(not sure what original function does but he propose to "build a list of surfaces visible in the current pvs each time the view leaf changes")

dml · Post by **dml** » Fri Oct 17, 2014 12:31 pm

calimero wrote:what are dominant instructions in R_RecursiveWorldNode4PL_68k?

and what is basic algorithm for this?

I ask this because I stump on comment at http://www.celephais.net/board/view_thr ... 01&end=125 that say that "RecursiveWorldNode" could be skipped (not sure what original function does but he propose to "build a list of surfaces visible in the current pvs each time the view leaf changes")

I wrote a long reply to this just now, but got logged out before/when I submitted. So it's gone.

The answers are interesting so I'll try again later.

calimero · Post by **calimero** » Fri Oct 17, 2014 12:33 pm

offtopic: I almost always wrote long post in texteditor (e.g. Sublime Text) and than copy&paste to forums.

dml · Post by **dml** » Fri Oct 17, 2014 12:34 pm

calimero wrote:

offtopic: I almost always wrote long post in texteditor (e.g. Sublime Text) and than copy&paste to forums.

Yeah, it's good advice. Sometimes I remember to do it but not when I most need it

dml · Post by **dml** » Fri Oct 17, 2014 4:36 pm

calimero wrote:what are dominant instructions in R_RecursiveWorldNode4PL_68k?

Fortunately for us, multiplies

In fact I am not using the original algorithm from Q2 because it was way too heavy for the Falcon's CPU. I factored out anything which was not 'core' to the algorithm as separate passes, so only the core part is left and this helped speed it up a lot already.

It is fortunate that there was so much stuff in there which could be made sequential/serialized and therefore decoupled from the BSP algorithm.

This is mainly how I got it inside the 256-byte instruction cache. Would not be possible without that refactoring work (ignoring DSP, for now).

calimero wrote: and what is basic algorithm for this?

Fundamentally it is a recursive BSP descent algorithm, but it also incorporates frustum culling for geometry nodes and some other stuff. There are important details which affect us - much more than they affect PCs in recent years. I'll explain why below.

calimero wrote: I ask this because I stump on comment at http://www.celephais.net/board/view_thr ... 01&end=125 that say that "RecursiveWorldNode" could be skipped (not sure what original function does but he propose to "build a list of surfaces visible in the current pvs each time the view leaf changes")

Yes - and that is pretty much true, on a PC with a GPU / a fast hardware path for geometry and clipping. It has been true for some years. I did the same test with the code on PC before I did any Falcon bits - just drawing the PVS groups directly.

Stepping back a bit - before going into details - I picked Q2 as a target for a few reasons. One of the reasons being that it is kind of the 'last' new engine to really put effort into software rasterisation. A lot of the complexity in the map representation and rendering is aimed at software rasterisation. So much so, that it actually hurts hardware rendering performance.

It is not a coincidence that Q3 dropped software rasterisation and started supporting things like curves - and I didn't bother attacking it for Falcon

while it may be doable, it is far less likely to produce good results because no compromises were made to assist software (except, perhaps by accident or to save dev time - but not by intent).

The PVS is responsible for finding out what can probably be seen, from where you are standing. It doesn't care what direction you are looking in - it is a static lookup based on your position in 3D space. (The BSP assists with this PVS lookup, but as a separate task from its drawing responsibilities - its just used as a quick 3D search).

There is a separate step responsible for further narrowing what can be seen, in the direction you are actually looking. That is the frustum culling step. It is rolled into the RecursiveWorldNode pass. GPUs are fast enough that this is basically a waste of time on a modern PC - it takes less time to just chuck whole meshes at the GPU and let it sort out what is inside or outside of the viewport (for this kind of content anyway).

So the first useful thing RecursiveWorldNode does, is provide viewport culling for software rasteriser. It's actually quite a nice technique because it is smart enough to turn off clipping planes which don't contribute early in the BSP walk, so they dont hurt performance deeper in the walk for lots of smaller meshes.

GPUs are equipped with ZBuffer, so for the most part its possible to chuck unsorted meshes at hardware and have it draw correctly. Transparency is a bit more complicated but not that much. The software rasteriser in Q2 was equipped with a ZBuffer too, but not for sorting the scenery - that was used only for sorting fine meshes/objects with a small fill area because ZBuffer testing pixels in software is very slow.

So the second thing RecursiveWorldNode does, is generate integer depth sorting keys for scenery (map) polygons. These sort keys are used by the software scanconversion step to handle occlusion and guarantee zero overdraw in the framebuffer. Without the BSP sort keys, the only alternative is to sort individual spans by their nearest z value. Some engines do this, but the z values need to be precise, and typically floating point - which we want to avoid. It also leads to drawing errors because it's a numerical problem, not a logical problem (the BSP is mostly logical).

It is notable that Abrash mentions this in his Quake 1 tech walkthrough - sorting spans by z causes a lot of additional correctness problems, and was abandoned. It will be worse if we try to do that with integers.

The third thing RecursiveWorldNode provides, is a very nice way to perform back face removal using the BSP, without testing every face. Again hardware will do this for free, but there is a cost for a software rasteriser, per individual face. Each BSP hyperplane can host many sub-faces, and each hyperplane is already tested in order to generate the sort keys and perform culling - the same test can be used to eliminate all faces opposing the active hyperplane side.

I built a further optimisation on top of that - by reorganising faces within each node into two groups (front,back) and 'blitting' the group instead of checking each face's orientation code. This provided a significant speedup on Falcon versus the original code I started with.

So for our machine, sidestepping the RecursiveWorldNode function creates additional correctness problems and loses some nice optimisations which just push per-face costs everywhere else. This is mainly because RecursiveWorldNode was designed to help accelerate a software rasteriser which has a high overhead per face, per vertex and no ZBuffer. These are redundant on modern hardware, so sidestepping it is a better optimisation.

So that's why the Falcon version will keep RWN and i'll be trying to split it between CPU and DSP - it does too many useful things to give up

dml · Post by **dml** » Sun Oct 19, 2014 6:47 pm

So for the next round of optimizations I started the same process as was done for BadMood - using a 'special' version of the profiler to grab data, identifying different kinds of bottlenecks in the (non-optimized) code and counting references to various things. In this case, references to structures in memory.

sspr.png

[/size]

These references get summed in a spreadsheet and it tells me how to organize the structures for fast access, favouring the most frequently used fields. To an extent it also helps sequence the fields. It's a slow process but it will give the best result in the end, especially when code is quite complex and guesses can be wrong.

It's more important to do this on the DSP compared with 68k, because the DSP as no offset addressing mode. (And if it did, it would cost 2x as much as the others in space and cycles because of the way it works). Getting hit count and sequencing info is a big help when trying to optimize around that.

dml · Post by **dml** » Wed Oct 22, 2014 12:37 pm

So I have finished studying and analyzing the existing DSP code from test runs and have compiled a big list of things needing done, starting with the reorganization of type structures for surfaces and edges, but includes a few algorithm and layout changes as well.

So I'll disappear down a hole for a time, while I rewrite all of that code

dml · Post by **dml** » Wed Oct 22, 2014 1:33 pm

BTW i also have a pseudocode DSP fragment for the texturemapping routine, and an estimate of time for that. But I have no way to test it yet since the F30 version of the engine currently can't load the textures properly, and there are pieces missing for setting up the texture plane equations for each surface. I expect the fragment does not actually work yet but must be close.

The first thing I will do after optimizing DSP code is get texture loading to work, so I can approximate the surface colour for flatshaded mode, and then work towards textures.

Atari-Forum

Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030