Quake 2 on Falcon030

dml · Post by **dml** » Thu Aug 07, 2014 8:44 pm

When it continued happening at 250 bytes, I started to look at it a bit more carefully.

I see what's happening now. It's not a simple loop like the rendering routines in BM where you could tell exactly what was going on and where the boundary was between fitting/not fitting in the cache (around 254 bytes).

While the whole routine does fit in the cache, it can't stop interrupts from coming in and emptying it from time to time. That doesn't register clearly for a simple loop because its distributed over all of iterations quite evenly. This routine however conditionally executes bits and pieces, some quite rarely. It's the rarely executed bits which have the most misses. That's the clue to whats happening.

So it's to be expected - I can ignore it.

dml · Post by **dml** » Thu Aug 07, 2014 10:02 pm

Got some profiling done and things look a bit better.

The functions in blue have had a lot of effort put into them - either DSP'd or optimized 68k, or significantly changed C to make things cache better.

The ones in red have not been dealt with yet or are known to be slow because they are interfacing with the DSP using a protocol that's based on how the old code works and not how it needs to work to be fast using DSP.

The grey ones are there just for debugging purposes and will go away.

Around 15% of total time is 'lost' to lots of small routines registering less than 1%. Those can take a lot of effort to round up - it took ages to do it for BadMood - and I won't be looking at these anytime soon.

Used cycles:
25.70% _R_AddFaceEdge_DSP56k
14.63% _R_ProcessTaggedEdges
11.41% _R_InsertFaceEdges.constprop.0
10.39% _R_RecursiveWorldNode_68k
5.89% _R_TransformAndProjectTaggedVert
5.83% _dbg_draw_edges
5.82% _dbg_undraw_edges
5.74% _DrawFaces
3.52% _R_RecoverLeadingEdges_DSP56k
3.11% _R_ProcessBSPVisitQueue.constpro
1.28% _R_BeginFace
0.97% 3010664 ROM_TOS

So the BSP routine (R_RecursiveWorldNode_68k) which composes the scene is taking 10-15% depending on the view. This is a big improvement from 20-30% before optimization but it's still up to 15% of 'quite slow' and it's probably going to need DSP'd later. This takes too much effort and concentration so it's being left for another time. Will work on the red bits next.

dml · Post by **dml** » Thu Aug 07, 2014 10:35 pm

Latest vid is clearly a bit quicker:

https://www.youtube.com/watch?v=KAk6QJT ... vpEobZzMEw

troed · Post by **troed** » Fri Aug 08, 2014 1:30 pm

(Just checking in to let you know that your updates and videos are eagerly followed)

Scarlettkitten · Post by **Scarlettkitten** » Fri Aug 08, 2014 1:35 pm

Agreed I love this project

dml · Post by **dml** » Sat Aug 09, 2014 12:04 am

Thanks for reading.

I think I figured out a way to chop away another 20% CPU time.

There are several steps which need to get done on the queues emitted from the BSP tree traversal, which comprises the visible geometry in the scene on each frame. The BSP tree traversal does an efficient job of performing an intersection-style set operation between the static PVS (what can be seen from the player's position, regardless of view direction) and the dynamic viewcone (what can be seen in the direction the camera is looking).

So the queue of work being emitted from the BSP every frame is a narrow subset of the whole map - only whats in front of the camera, and not behind some big obstacle - but it's still quite a lot of stuff.

This queue of stuff is made up of leaf meshes which break down into polygons, edges and vertices. These are linked to each other using either pointers or indices - both of them able to reach across the whole pool of RAM. Unfortunately this stuff needs to be sent to the DSP, and there's no room there to store all the geometry from the whole map, so the pointers and indices are useless. The DSP can't reach vertex #14205 if it only has room for 1000 vertices in the first place.

So one of the things done in this queue is reindexing the vertices and edges, linked off the visible faces. The reindexing compacts the links so the largest index is based on the number actually onscreen (instead of the number in the whole map, which is huge). This lets everything fit on the DSP for one frame.

The reason indexing is needed at all (rather than just send the geometry raw) is the amount of sharing going on - vertices are shared by edges where they connect, and edges are shared by polygons where they join. Sharing means far fewer vertices need processed in total and that's a significant cost on its own.

But this reindexing needs to be done on the whole queue every frame, and costs about 20% total time. This is pretty annoying. It might be possible later to get the DSP to reindex stuff as it is sent over but it involves sending over the 'real names' for the vertices when they get rotated/projected etc. so things tie up properly later, and the whole thing is pretty complicated. Trying to avoid.

However I noticed that I had set things up in such a way that these work queues coming out of the BSP tree are self-contained and always complete, even if the BSP traversal is paused. So deliberately stopping the BSP while walking around just leaves you with a room which doesn't extend through the corridor when you walk through the door. The geometry gets frozen and no new visible stuff gets added for the next frame. But the view still updates and you can still move the camera around because the queue is still being transformed and drawn. It's just not being edited.

Anyway I figured a neat way to cut the cost of the reindexing (and maybe some other stages), is to figure out if the BSP actually produced anything different between subsequent frames. Use a revision counter, hash or something to track pieces of the map which turned on since the previous frame, and force the reindexing stage to be refreshed only then.

This would mean waving the camera around causes occasional extra work as bits of map move into view, which would be every few frames (since most map pieces are quite big) instead of exactly every frame.

I haven't tried a full version of this yet - only a hack to pause reindexing - but I think it should work. It certainly can't cost more than it does now, and must cost less at least most of the time. It doesn't even need done during the BSP algorithm, it can be done as part of processing the top level queue of leaf meshes which get emitted, before it deals with faces.

I'll need to do a test and see if its going to work.

dml · Post by **dml** » Sat Aug 09, 2014 1:01 am

Well the test worked, but maybe not as well as hoped. The reindex pass dropped from 15-20% to around 7-10% while moving the camera around, or 5% while moving forward, or 0% if the camera remains still.

Since it was 'free' I won't complain, but 10% wobble isn't great either so I'll see if there's another way to get the peak down a bit later on.

Post by **AtariZoll** » Sat Aug 09, 2014 11:18 am

Well, I don't want to be negative, but I'm surprised that Quake 2 on Falcon 30 project is started. Why - because it is much more demanding than Doom. Will not go in speed issues. What for me seems as impossible is that it will work on Falcon with 14 MB RAM. There are some complex maps, levels, which will eat more RAM, I'm sure. Any calculations, tests made in that direction ?

dml · Post by **dml** » Sat Aug 09, 2014 11:23 am

I had another idea this morning which has accelerated the vertex reindexing by 2x. The problem was that I'd spent too long focusing on literally what the code was doing before and not enough on what needed done. It was mainly working in terms of edges, tagging new edges as they are seen and reindexing vertices as they are seen. In fact it works better by reindexing the vertices independently, in winding order.

The vertices need reindexed early because this decides how many need sent to the DSP and that's a limiting factor on everything. However the edges don't need reindexed at that time. It's also not necessary to reindex both ends of each edge because the faces always have a winding order so it's only necessary to deal with the 1st vertex on each edge, providing the winding flag is respected (some edges are flipped for sharing purposes)

The Q2 engine uses signed edges to indicate their direction and this was getting in the way so I translated this into a flag bit which is easy to test and mask off / ignore, instead of having to conditionally negate it everywhere. The result is a vertex reindexing pass which takes under 10% CPU even without the previous optimization (I have turned that off for testing, treated as a last-resort speedup).

This time the 'optimized' code is green, the debug stuff is grey. I have marked the top two in a different colour because they are partly there for debugging/test-drawing purposes, and I also realized that most of this can now be folded into the vertex reindexing pass providing the DSP is just asked to buffer the vertex indices until later. I'm not sure what works best yet but depending on how fast the DSP can process the edges it might be better to fold all the CPU stuff into one small loop and make the DSP work in big blocks while the CPU does something else, rather than have the reindexing and transmitting parts separate and try to overlap stuff at a more fine grain.

The grey items will disappear, later to be replaced with filling costs.

It is for certain though that a big chunk of cost for the top two will go away when it is implemented properly.

Used cycles:
27.77% _dbg_R_AddFaceEdge_DSP56k
15.90% _dbg_R_InsertFaceEdges
11.95% _R_RecursiveWorldNode_68k
9.61% _R_ReindexFaceVertices_68k
6.94% _dbg_undraw_edges
6.93% _dbg_draw_edges
6.76% _R_XFormProjectTaggedVertices_DSP56k
4.01% _dbg_R_RecoverLeadingEdges_DSP56k
3.70% _R_ProcessBSPVisitQueue

AdamK · Post by **AdamK** » Sat Aug 09, 2014 11:35 am

Great work

One question: how complete is the engine in your youtube vids? Is that full game logic, or just video engine?

Btw. please add fps counter to any future videos if possible

dml · Post by **dml** » Sat Aug 09, 2014 11:35 am

AtariZoll wrote:Well, I don't want to be negative, but I'm surprised that Quake 2 on Falcon 30 project is started. Why - because it is much more demanding than Doom. Will not go in speed issues. What for me seems as impossible is that it will work on Falcon with 14 MB RAM. There are some complex maps, levels, which will eat more RAM, I'm sure.

In fact the aim is not to port the Quake 2 game to Falcon030. I mentioned this a bit earlier. The aim is to experiment with the Quake 2 map rendering technology, to see if the principles can be translated in a way that makes it usable for stuff.

If that results in 'stuff' being turned into a Quake 2 game (or a cut down 2-player thing) as a result then I don't mind at all

But that's a bit far fetched at the moment. I'll re-evaluate that when the experiments are done.

AtariZoll wrote: Any calculations, tests made in that direction ?

It is itself an experiment, a test and an ongoing series of calculations. But yes, of course. I never begin things I haven't estimated first

There are lots of compromises built into that, but as I said - the aim isn't currently to port the game but to focus on rendering for now. There will of course be maps that are just too big, or too slow to run on F030 no matter what I do but that doesn't reduce my interest in being able to draw the maps.

There are also better 68k (and probably even DSP) coders out there in Atari land. If I get this working and release the source then somebody might be able to make a faster version and something else interesting out of it.

It's also true that I try to pick projects that are too difficult for me at the start, so I have to learn to catch up.

dml · Post by **dml** » Sat Aug 09, 2014 11:52 am

AdamK wrote:how complete is the engine in your youtube vids? Is that full game logic, or just video engine?
Btw. please add fps counter to any future videos if possible

Thanks!

That's a good question with a complicated answer.

I started this project back in late 2012 or maybe early '13 using the official Q2 sources. I quickly ran into grief just getting a direct port to boot with the memory requirements especially since I'm mainly using Hatari and cross-tools for convenience (I don't always get time/space to work with a real machine). Also the files are big to transfer/update on each build which makes matters worse. This was bad enough with Doom, but Q2 is much bigger.

Since Hatari has no TT-ram support, it meant doing lots of bad things to the Q2 source to even get it loading up the menus. The startup time was awesome as well. So I get fed up with that quickly.

I was also conversing with NovaCoder on the Amiga060 port, and we exchanged sources (really, he sent me the 060 sources he was working on for Q2, while I sent him my old 040/assembler version of Q1 from '97 or so - not that I figure it's of any use these days but it did at least contain original code not transferred from anywhere else!). So I now have the original ID code, and the Amiga code for reference. Really the Amiga one attacks the drawing and span primitives but is essentially the original ID code in other respects.

Since this causes me the same problems for turnaround/development I started looking for other starting points. Turns out there are several Q2 engine derivatives around including the unfinished PolyEngine (by Alexey Goloshubin). While these are not exactly the same as ID's version they have a massive overlap, and some of the comments even match

The main benefit is getting a map onscreen with much less demand for ram.

So what I have now is a hybrid soup mixture of PolyEngine (data loading, collision detection, player movement) and ID's sources (BSP, geometry, rasterization) and Atari stuff (BadMood).

This is acutally quite difficult to cope with because the Q2 and PolyEngine data types are very similar but not always exactly the same. So I have to constantly check type sizes when I pull more Q2 code over. Moving the BSP code (which was almost completely absent from PolyEngine - it used to just scan over a linear list of map clusters) was the hardest part so far because it needs the original tree structure which is not properly loaded by PolyEngine.

So what you see in the videos is basically some core functionality.

- map loading
- game object/entity management
- collision detection
- movable brushes (doors etc. open and close but can't be seen yet in my version)
- other stuff not yet working, like networking

So its far from the full game, but it loads and runs and the main game functions appear to be cheap. So its a candidate.

Post by **AtariZoll** » Sat Aug 09, 2014 11:54 am

Dml, thanx for your detailed answer. Good luck with this project

Omikronman · Post by **Omikronman** » Sat Aug 09, 2014 1:14 pm

I really love to see the progress of your work in the youtube videos.

Zamuel_a · Post by **Zamuel_a** » Sat Aug 09, 2014 4:26 pm

Do you know how Quake 2 on the Nintendo 64 is made? You don't have much memory for anything on that machine so it must be very tight. The N64 version is very different from the PC version. The basic graphics are the same but the levels are totally different.

dml · Post by **dml** » Sat Aug 09, 2014 4:46 pm

Zamuel_a wrote:Do you know how Quake 2 on the Nintendo 64 is made? You don't have much memory for anything on that machine so it must be very tight. The N64 version is very different from the PC version. The basic graphics are the same but the levels are totally different.

No I haven't looked at it. If the map data format is close it might be interesting. Certainly the maps will be less complicated.

The PC version is mainly useful because I can actually run it here - I can alternate between Q2 in VStudio with breakpoints set, and the hybrid engine in another VStudio, and the hybrid engine under GCC for Atari. So I have a lot of visibility of whats going on inside.

That's probably a good bit more difficult with the N64 version, although it's probably useful to read through it and look at the data structures and counts for things to compare.

[EDIT]

Almost forgot - the PC version had software rasterization as a min spec. This means a lot of stuff is organised to work without needing a ZBuffer. The area I'll be focusing on is related to that. The N64 has a ZBuffer, so aside from the alphablending headache, all it has to do is precalculate UV coordinates, update lights and draw a list of faces. That's a gross oversimplification, but hardware does require less complexity than a software version.

There is a Zbuffer in the software renderer but it's not used for the world - it's used for drawing entities against the world. i.e. the world polygons only write to the zbuffer. The entities read/write it. I expect N64 will read/write it in all cases unless they tried hard to retain the ordering stuff from the software side (unlikely).

Zamuel_a · Post by **Zamuel_a** » Sat Aug 09, 2014 5:51 pm

I guess it's difficult to get the source for the N64 version. Would be if the level data could be used if it's simplier.
I also liked DOOM 64 more than the PC version. It was more advanced at the time so it looked better.

dml · Post by **dml** » Sat Aug 09, 2014 6:26 pm

Zamuel_a wrote:I guess it's difficult to get the source for the N64 version. Would be if the level data could be used if it's simplier.
I also liked DOOM 64 more than the PC version. It was more advanced at the time so it looked better.

Another issue with the console engines is availability of tools - the data isn't always in the same format so it's hard to make levels for them. The JagDoom WAD used a different format for some of the lumps so the only one I could test with BadMood is one which had been translated into PC format. I don't even remember how it was translated - maybe by hand (yikes!). At least the JagDoom source is available - but it's a near total rewrite and a bit strange in places (big chunks just disappear off into DSP land with just a vague symbol in the C code - and the result is not very readable at all).

I think the biggest problem I expect to see with N64 QII is the fact it's aimed at hardware, and probably got the same treatment as JagDoom - i.e. it's interesting reference but can't really work on anything but an N64 (or Jaguar). I don't know this for sure - but they probably had to do a lot of N64-specific things to get it to work. I remember the N64 being a lot cleaner to code than the PS2 but all of the consoles from that era were basket cases

dml · Post by **dml** » Sat Aug 09, 2014 8:04 pm

I made a bit more progress today, although I've been busy with other stuff so there are plenty of loose ends needing fixed.

The fact that (unmodified) _dbg_draw_edges has jumped from 7% to nearly 10% suggests a generic 40% speedup since the last test. This was achieved by writing a decent version of the code which sends the faces to the DSP. It's done in winding order now, 2 transfers per edge instead of 5+ and sent in compact groups. Around 5% of total is spent blocking on the DSP because it's not overlapping work properly with the CPU, and with the BSP edge clipping flags not available yet the DSP is doing too much work anyway clipping every edge against 5 planes.

The test is running reasonably fast now in simple rooms and approaching sensible in other places. It's possible to go outside now, which was way too slow before and I didn't bother to include much of that in the earlier vids. It's only recovering leading edges (rightmost edges of polys) for drawing at the moment, although all edges are processed. This is to stop the CPU getting bogged down drawing lines behind walls which can get quite bad - and which won't happen in future versions.

I coloured stuff in green which has been rewritten or introduced by me. The other stuff is part of Q2/PolyEngine .

Used cycles:
23.64% _R_SubmitFaceGeometry_DSP56k
17.21% _R_RecursiveWorldNode_68k
10.06% _R_XFormProjectTaggedVertices_DS
9.54% _dbg_undraw_edges
9.53% _dbg_draw_edges
8.88% _R_ReindexFaceVertices_68k
5.92% _R_RecoverLeadingEdges_DSP56k
5.54% _R_ProcessBSPVisitQueue
0.96% _Clip_CheckBrushes
0.76% _check_bsp_model
0.76% _check_node_r
0.75% _BSP_FindLeaf
0.70% _check_boundbox
0.51% _BSP_MakeVis
0.49% _Clip_Begin
0.45% _Server_MakeFrame

The next areas to attack will be:

- DSP version of BSP plane/BBox math, to try to max that routine out. it's still too slow in places despite executing from the i-cache, it's often either 2nd or 1st at the top of the profile results
- change the geometry pipeline again, this time to batch geometry in blocks of 256 or so faces at a time, to stop the DSP ram overflowing (which causes random lines to appear in some places on the map)
- use the batched geometry approach to try to overlap DSP edge processing with vertex reindexing (or whatever else can be overlapped) on the CPU, and to get rid of the synchronous blocking between CPU/DSP happening at the moment

Once those things are done, I think it will be as close to a workable design on F030 as I'm likely to get. Any other decisions on what to do with it will be based on performance after that. Optimization might claim another 5-10% but I think not much more.

dml · Post by **dml** » Sat Aug 09, 2014 8:23 pm

Will post a new vid soon showing 'outside' but need a beer first

dml · Post by **dml** » Sat Aug 09, 2014 9:15 pm

Here, last version: https://www.youtube.com/watch?v=VLizCAk ... UM&index=6

DarkLord · Post by **DarkLord** » Sat Aug 09, 2014 9:41 pm

dml wrote:Will post a new vid soon showing 'outside' but need a beer first

Hear, hear!

dml · Post by **dml** » Sun Aug 10, 2014 12:26 am

New video:

https://www.youtube.com/watch?v=zGS9L4q ... UM&index=7

Added:

- mouselook control for viewing the scenery
- framerate independence (so quakeperson doesn't run faster when the framerate goes up etc)
- ...and a rare Falcon bonus: 640x400 highres mode. ok it's a bit slower and counterproductive for optimizations but so what

lines looks nicer that way.
- its drawing all of the edges now, instead of just the right edge of each face. in fact it will draw some edges twice until edge caching is working fully. I changed this mainly to get a better view of the map. The DSP is doing the same amount of work but the CPU has to extract and draw more.

There's an area in the last room (in fact, the primary startpoint in singleplayer Q2 beside the crashed shuttle/pod and wrecked wall) which was so slow it virtually stopped the original version. It's still not quick but it is at least 'realtime' now especially with framerate independence working. It's basically a soup of lines which seems to contain half the map.

That's probably it for a while on this project. Will be busy for at least a week and not much time for this stuff. Will return to it later.

AdamK · Post by **AdamK** » Sun Aug 10, 2014 10:04 am

This is awesome

Btw. Did your Q1 port for 040 used DSP?

dml · Post by **dml** » Sun Aug 10, 2014 11:56 am

AdamK wrote: Btw. Did your Q1 port for 040 used DSP?

I started on a DSP module for it aimed at 030 - there were 2 or 3 experiments which could be enabled in the build to test that - but didn't put enough effort into it and was put off by the amount of floating point stuff I'd have to understand and convert. So it really only worked properly on a fast 040.

IIRC I had replaced the primitive drawing routines and the span processing - both of which already had asm implementations on x86. I may have done some other bits but it was so slow to turn builds around on a native setup under mint (and that version of gcc silently bombing out on me) so I lost patience with it after the main 040 parts were done.

The DSP experiments were not focused so much on texture mapping at pixel level but rather on texture coordinate subdivision, so it would use spanlets only just small enough to achieve perspective correction for a given delta-z. The plan was to reduce the number of spanlets in the scene. The idea worked, but it wobbled like hell because the subdivision points moved around depending on the view - they were not static, every 16 pixels as they were in the original, and the eye picks it up on the switching effect. oh well

Atari-Forum

Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030