Quake 2 on Falcon030

christos · Post by **christos** » Tue Oct 28, 2014 10:18 pm

Only amiga makes it (and some other people) possible!!!!

dml · Post by **dml** » Thu Oct 30, 2014 10:22 am

christos wrote:Only amiga makes it (and some other people) possible!!!!

So it seems.

Last night I fixed another bug which affected colour estimation, so the pipes and stuff look correct now and lit properly, instead of day-glow.

before:

grab0018.png

after:

grab0024.png

I quickly reviewed some changes that would be necessary to make dynamic object drawing work. The game objects are in fact already there, colliding and moving (mostly, spinning on the spot) and the doors open and close, but I can't actually draw the meshes until I rewrite the BSP traversal again and add new routines to manage objects through the BSP - I don't think this is present at all in the original code because it's not needed. It doesn't make sense to do that immediately but will roll it into the BSP rework which is due next.

So I will either start on textures next, or the BSP part. (Both will probably take some time and I only have a small slice of free time to play with it in the evening or over lunch).

calimero · Post by **calimero** » Thu Oct 30, 2014 10:56 am

I am looking at this crazy count of polygons in Quake 2 maps and compare them to number of polygons in best demos (Eko / Exa demos have scenes with most flat polygons).
It is not fair comparison since demos are "controlled" environment

but number of polygons in quake is insane!

if you try to compare dougs Quake to best ST 3D games... there are like 10 times more polygons than in any ST game.
it is amazing how much you manage to pull out from F030!

dml · Post by **dml** » Thu Oct 30, 2014 11:24 am

calimero wrote:I am looking at this crazy count of polygons in Quake 2 maps and compare them to number of polygons in best demos (Eko / Exa demos have scenes with most flat polygons). It is not fair comparison since demos are "controlled" environment but number of polygons in quake is insane!

Some of the Q2 maps are extremely dense, yes. This is partly because textures never truly 'repeat', so flat surfaces must be broken into unique tiles, creating many more vertices than would be needed for a flat-filled version. this is why the floors of big rooms seem to be made of randomly sized tiles - it is necessary in order to texture each 'world pixel' uniquely. Unique texturing is required for unique pixel lighting (lightmaps). So its really the Quake lighting that forces the polycount to be much higher than the 'geometric surface' count - which is already fairly high to give the game a decent look.

I think this was somehow sidestepped in the PSX port, but I'm not sure of the details. They did a very, very good job of that port, but a significant portion of the savings came from changing the content (e.g. the maps) to better suit the HW polygon engine in that box. They didn't need to retain anything that was designed with software rast in mind.

The Q2 tech is very close to Q1 with some algorithm optimizations, added capabilities and other improvements, the approach though is the same. The Q2 maps are however more complex than Q1 because the machine spec went up quite a lot in the years between releases.

One of the 'other' reasons I picked Q2, is simply that Q1 maps would also potentially be usable with smallish changes, and probably run faster since they were aimed at slower processors.

calimero wrote:if you try to compare dougs Quake to best ST 3D games... there are like 10 times more polygons than in any ST game.
it is amazing how much you manage to pull out from F030!

To be fair, the ST didn't have a DSP to help parallel work and this makes a big difference on the Falcon - but still I am doing my best to squeeze everything out of both chips and in this case the DSP is struggling, needing a lot of hand optimization in many places.

It is very difficult to approach performance of a processor at least 2 generations on from the 030, and probably 10x the clock rate. I am not finding it easy at all.

But I don't give up easily so long as it is interesting, and there are still things waiting to be done to help the speed.

dml · Post by **dml** » Thu Oct 30, 2014 2:13 pm

Posted a quick demo of the latest code over lunch - mainly speed improvements and fixes:

https://www.youtube.com/watch?v=Tp965ZL9Uvs

EvilFranky · Post by **EvilFranky** » Thu Oct 30, 2014 2:15 pm

Great!

You can see a clear performance jump since the last video.

dml · Post by **dml** » Thu Oct 30, 2014 2:21 pm

Sky colour was also fixed, plus some fixes for lighting etc. This screenie was taken before the vid.

grab0025.png

FPS in the corner (for some reason 'fps' is not printing properly in this engine, will fix another time).

AdamK · Post by **AdamK** » Thu Oct 30, 2014 4:32 pm

It looks like screen counter now counts FPS and not VBLs/Frame am I right?

dml · Post by **dml** » Thu Oct 30, 2014 4:34 pm

AdamK wrote:It looks like screen counter now counts FPS and not VBLs/Frame am I right?

Yes I changed it to show FPS properly for one of the earlier vids - was a build mis-configuration issue. For some reason the 'FPS' text has disappeared but it will get fixed.

[EDIT]

While uploading the vid I had also collected some updated profiling info which looks quite promising - the optimizations are really making an impact in the right areas, so a lot of the time is being wasted now in the remaining code - stuff which still need rewritten or was otherwise expected to move up the list again. That means there's definitely room to make it faster.

In one of the slowest areas of a particularly bad map (under 4fps), converting the scene from polygons into spans takes only 11% now - that's about 1.5 VBLs.

For the same scene, filling takes only about 20% - around 3 VBLs.

So it is really input-bound now - the output side is fast enough. The remaining C code is also beginning to matter again.

dml · Post by **dml** » Fri Oct 31, 2014 9:02 am

calimero wrote: It is not fair comparison since demos are "controlled" environment but number of polygons in quake is insane!

BTW I meant to add, on the subject of comparisons with games, demos...

Demos which pre-calculate the view (no keyboard/mouse input possible - always run the same sequence) can obviously trade memory and setup time for faster drawing. Not all do that, but the fastest 3D demos on ST/Amiga used it a lot. Taken to the extreme, there is not really any 3D going on at all - more like a clever movie decoder using spans or specially coded 2D primitives.

For the demos which really perform realtime 3D work, they can still pre-calculate which content is visible at which points in the sequence - no need to gather content dynamically. In this case its a 'movie of objects' but with realtime output stages to render the view from 3D data.

And for those which do everything dynamically (on Falcon - I believe many do because the DSP is fast enough not to be the limiting factor for average sized objects), they can still store geometry, textures on the DSP from frame to frame and eliminate the 'input side' costs. So there's lots of room for demos to do clever stuff which saves lots of time but will never work in a game environment.

Demos do some really clever (sometimes, just brilliant) things to obscure the true amount of work needing done to solve a given problem.

Unfortunately I can't really do any of that stuff here - not even caching of geometry. All 32kwords of DSP space is consumed by buffering of input/output geometry, which must be fed from main RAM continuously due to the size of it. Trying to use DSP ram for frame-to-frame caching just kills the parts which need fed big data - for scenery at least. There might be opportunities later to cache small game objects / pickups drawn several times in a row, but I expect the sorting/sequencing issue versus the map polygons will interfere with that for other reasons.

So I don't have much room to move with demo tricks, except to make the system itself as fast and efficient as possible. Any tricks I know can only be used to accelerate details within stages but the stages must still ultimately do what Quake did...

So I expect that this system will not (can not!) be as fast as the best demos on Falcon, on a per-polygon or per-pixel basis... but it should hold up very well against anything which attempts to run a full game engine. Especially anything that tries to run a very high poly count... I will make sure of that

Zamuel_a · Post by **Zamuel_a** » Fri Oct 31, 2014 9:28 pm

Q1/Q2 use a write-only zbuffer mode for the scenery, and a read/write zbuffer mode for objects.

How did they manage to get the texture routines so fast? Even if only writing to the zbuffer, it feels like it should take a lot of time and not be so fast. I remember when I tried to make a Quake type engine back in the 90s and used most "tricks" I could find regarding texture mapping. All written in ASM code, Subdivision, doing the div with the FPU while the CPU do other stuff in parallel, even using the MMX registers to get more registers and so on, but I was never able to get close to the FPS rate of Quake and I didn't use a Z buffer at all.

calimero · Post by **calimero** » Sat Nov 01, 2014 12:00 am

dml wrote:Demos which pre-calculate the view (no keyboard/mouse input possible - always run the same sequence) can obviously trade memory and setup time for faster drawing.

and yet you manage to put most polygons onscreen with really good framerate!
really astonishing! something unseen on F030!

how do you fill flat polygons?
there was lot of talk about F030 blitter here on AF... do you use blitter for filling polygons in this engine?

dml · Post by **dml** » Sat Nov 01, 2014 9:53 am

Zamuel_a wrote: How did they manage to get the texture routines so fast? Even if only writing to the zbuffer, it feels like it should take a lot of time and not be so fast. I remember when I tried to make a Quake type engine back in the 90s and used most "tricks" I could find regarding texture mapping. All written in ASM code, Subdivision, doing the div with the FPU while the CPU do other stuff in parallel, even using the MMX registers to get more registers and so on, but I was never able to get close to the FPS rate of Quake and I didn't use a Z buffer at all.

Are you sure it was a fillrate problem? Writing the pixels on PC wasn't a major overhead iirc, especially if to the right kind of memory and with textures on page/cache boundaries etc. There are lots of other things which could hurt speed if rendering the same types of scenes.

It is much harder to compare performance of un-alike scenes.

For example - Quake generates a lot of tall, thin or near-degenerate polygons as a result of overdraw management, so optimizing for fillrate isn't necessarily the best option. You also need a fast path for ultrashort spans and small faces.

So if you do crazy things to the polygon filler to maximize the divide scheduling, you can actually make this worse for tall thin polys without realizing.

There are many other things involved which can cause performance problems - overdraw itself, clipping performance (early rejection of scene content) and the grain of the stuff being drawn (face-at-a-time, or somehow optimized e.g. shared edges, triangle strips etc.).

It's hard to compare two different engines but if you did all the same things in the same way it might be possible to find the answer by just looking at their code?

Zamuel_a · Post by **Zamuel_a** » Sat Nov 01, 2014 10:04 am

dml wrote:
Zamuel_a wrote: How did they manage to get the texture routines so fast? Even if only writing to the zbuffer, it feels like it should take a lot of time and not be so fast. I remember when I tried to make a Quake type engine back in the 90s and used most "tricks" I could find regarding texture mapping. All written in ASM code, Subdivision, doing the div with the FPU while the CPU do other stuff in parallel, even using the MMX registers to get more registers and so on, but I was never able to get close to the FPS rate of Quake and I didn't use a Z buffer at all.
Are you sure it was a fillrate problem? Writing the pixels on PC wasn't a major overhead iirc, especially if to the right kind of memory and with textures on page/cache boundaries etc. There are lots of other things which could hurt speed if rendering the same types of scenes.

It is much harder to compare performance of un-alike scenes.

For example - Quake generates a lot of tall, thin or near-degenerate polygons as a result of overdraw management, so optimizing for fillrate isn't necessarily the best option. You also need a fast path for ultrashort spans and small faces.

So if you do crazy things to the polygon filler to maximize the divide scheduling, you can actually make this worse for tall thin polys without realizing.

There are many other things involved which can cause performance problems - overdraw itself, clipping performance (early rejection of scene content) and the grain of the stuff being drawn (face-at-a-time, or somehow optimized e.g. shared edges, triangle strips etc.).

It's hard to compare two different engines but if you did all the same things in the same way it might be possible to find the answer by just looking at their code?

Difficult to say what was the biggest issue. I was also using light maps, like Quake did and I never got it up to speed. Not so long after this I got a 3dfx card and could do everything in hardware, so never tried to solve the software problems. But it was hard to keep up with the commercial engines and doing a real game got harder since people soon wanted something more than "simple" Quake graphics, so I never finished it.
That's why I like to program for the Atari since the hardware is fixed, so even if the projects takes time, the end hardware hasn't at least not changed.

dml · Post by **dml** » Sat Nov 01, 2014 2:39 pm

calimero wrote: and yet you manage to put most polygons onscreen with really good framerate!

Thanks! Although keep in mind most of the techniques were originally devised by others not me

But I have a good feel for (and experience with) which techniques are 'right' for the target and which to discard, and how to make it translate to the Atari and still be effective. And sometimes I put something new or different in there which is more appropriate for this machine (like reworking the BSP technique or reorganizing data). Still a lot of the method is based on reinterpretation of others' work (And of course the whole no-floats and low-MHz thing is one of the main challenges here!).

calimero wrote: how do you fill flat polygons?
there was lot of talk about F030 blitter here on AF... do you use blitter for filling polygons in this engine?

You can toggle between CPU / Blitter with the 'c'/'b' keys but it doesn't make a big difference tbh. The fill rate is less important than the amortization of edge overheads in this engine. That is the only reason the blitter can be faster in some cases because there is a tiny bit of parallelism with CPU cache.

dml · Post by **dml** » Sat Nov 01, 2014 2:50 pm

I did a quick experiment to pack the vertices again (they were originally 16.16 fixedpoint, then 16.8 for DSP).

The Quake world is apparently -4096->4095 so that's a relative range of 8192 from the camera origin - so the verts can be packed to 13.3 fixedpoint and retain just enough precision for scenery detail. Ingame objects can use higher precision if needed since its just a storage/transfer format - the 3D operations remain the same.

This doesn't represent a huge saving in time but does halve the current storage cost and moves vertex transfer back down the profiling list.

[EDIT]

13.3 will probably be a temporary measure too since 12.4 is possible once the camera origin is moved to the DSP side - its still on CPU side because I didn't check the world extents when I first wrote the code.

Zamuel_a · Post by **Zamuel_a** » Sat Nov 01, 2014 2:56 pm

You can toggle between CPU / Blitter with the 'c'/'b' keys but it doesn't make a big difference tbh. The fill rate is less important than the amortization of edge overheads in this engine. That is the only reason the blitter can be faster in some cases because there is a tiny bit of parallelism with CPU cache.

How do you use the blitter for filling? There was a thread here a couple of years ago about using the blitter to fill polygons on an STE, but the conclusion was that there is not any good way of doing it.
I guess you can copy data with a solid color in it to use as a span, but that's a bit slow. But as far as I know, there isn't any way to define a fill color to use without copy? Or do you do it bitplane by bitplane and use the "all set to zero or one" operation?

One problem with your DOOM and Quake project is that when I see some of the FPS games that was made for the Falcon and that I thought was nice once, are not impressive at all to look at anymore.

EvilFranky · Post by **EvilFranky** » Sat Nov 01, 2014 2:57 pm

Maybe something to do with running the game in a chunky mode?

dml · Post by **dml** » Sat Nov 01, 2014 3:04 pm

Zamuel_a wrote: How do you use the blitter for filling? There was a thread here a couple of years ago about using the blitter to fill polygons on an STE, but the conclusion was that there is not any good way of doing it.

I used to do this on STE by code-generated blitter initializers, with a lookup based on xo:xw, so the span setup time was very, very low. Most of the table entries mapped to a single routine for longer spans, but short spans had special treatment. This was really fast IMO. However there are so many ways to do things - I don't claim to know what the best way is on STE.

Zamuel_a wrote: I guess you can copy data with a solid color in it to use as a span, but that's a bit slow. But as far as I know, there isn't any way to define a fill color to use without copy? Or do you do it bitplane by bitplane and use the "all set to zero or one" operation?

For fills, halftone is your friend. And for bitplane mode, endmasks are your friend.

For this case where edgecount matters, truecolour has a flatter cost than bitplane mode, because no matter what you do to accelerate bitplanes you can't avoid writing the same words multiple times where edges meet. So if you know you have a lot of computation overheads in the engine, you might as well pick the higher (but flat) cost, over the potentially lower (but rapidly rising) cost...

Zamuel_a wrote: One problem with your DOOM and Quake project is that when I see some of the FPS games that was made for the Falcon and that I thought was nice once, are not impressive at all to look at anymore.

Somebody is always ready to raise the bar again.

dml · Post by **dml** » Sat Nov 01, 2014 3:10 pm

dml wrote: I used to do this on STE by code-generated blitter initializers, with a lookup based on xo:xw, so the span setup time was very, very low. Most of the table entries mapped to a single routine for longer spans, but short spans had special treatment. This was really fast IMO. However there are so many ways to do things - I don't claim to know what the best way is on STE.

Actually this is a bit misleading - I remember now why I was using that, and it involved a source. Otherwise its probably better not to use the blitter if its a plain fill to bitplanes. Sometimes I forget the context more easily than the code...

Zamuel_a · Post by **Zamuel_a** » Sat Nov 01, 2014 3:17 pm

For fills, halftone is your friend. And for bitplane mode, endmasks are your friend.

But how to use the halftone for the colors? You don't draw the span in one pass?

dml · Post by **dml** » Sat Nov 01, 2014 3:46 pm

Zamuel_a wrote:
For fills, halftone is your friend. And for bitplane mode, endmasks are your friend.
But how to use the halftone for the colors? You don't draw the span in one pass?

The demo uses truecolour. So just load the face colour into the halftone file, load the xcount=1, load ycount=width (i.e. swap them around). On each span you just reload the dest pointer and ycount, and trigger the blitter-go flag in HOG mode. The span is done in one blit.

Since the spans have limited size and write-only, HOG doesn't cause problems with IKBD. Since HOG and line number can be written at once, you can get colour dithering nearly for free, from the halftone by just alternating the colours in the halftone file, and loading the alternated X-start bit into line register. etc. I didn't do this yet but probably will at some point since 16bit colour causes some clear banding.

So there isn't much to that side of it really.

(The only extra thing I did is overlap the blit with the next span setup, as Anima did with his sprite blitter)

A.F. seems very broken today. Taking a full 15 minutes to post one message. Best left for now.

[EDIT] of course you also need to set HOP,OP correctly to assign halftone as the source

dml · Post by **dml** » Sun Nov 02, 2014 9:54 am

Before attempting to pack the vertices to 16 bits (which now seems to work), I collected a lot of info on the current state of engine performance. The notes are below.

DSP has 50% utilization, but is under stress during that 50% - some waiting at CPU side.

About 15% DSP time is wasted trying to send span descriptors to CPU. When the horizontal resolution is configured below 256 pixels, the span exchanges will start using bytes instead of words, which is a bit faster across the DSP host port.

Converting polygons into spans now takes 11% in the worst areas (~1.5VBLs) so it's probably ok as it is now. Won't be a bottleneck anytime soon.

Face submission & edge clipping however causes some stalls on both sides and needs reworked. Several optimizations possible but big speedup might need radical change to stored BSP leaf records on CPU side, aiming for ideal mesh grain.

[DSP] Used cycles:
49.55% [idle]
19.07% recover_surface (== CPU:R_RecoverSurfacesCPU)
7.71% R_AETGenerateSpans <- optimization complete?
5.14% R_Edge2DAddToViewport_ <- optimization due
4.72% R_XFormProjVertices <- mostly CPU vertex transfer load
4.02% R_SubmitFaceGeometry <- optimization due
2.29% R_IndexedWoundEdge3DAddToFrustum <- optimization due
1.78% R_ProjectVertices <- optimization due
1.42% R_LinkGlobalEdge <- optimization due
1.20% R_AETStepActive <- optimization complete?
1.04% R_SpanEmit <- inlining due
0.64% R_AETIntegratePending <- optimization due
0.29% R_Edge2DAddToViewport <- optimization due
0.23% R_AETRemoveInactive <- optimization due
0.16% R_Edge2DCacheCull <- inlining due
0.13% R_Line2DIntersectY <- optimization due
0.13% R_ScanConvertGET <- optimization complete?
0.13% R_ClippedEdge2DAddToViewport <- optimization due

...not worth touching at all

0.09% R_RecoverSurfaces
0.09% R_VerticalEdge2DAddToViewport
0.07% R_EdgeIntersectZ
0.07% R_GSCleanupSpan
0.02% R_BeginFrame
0.00% R_BeginScan
0.00% R_BeginGeometry

CPU side still needs quite a bit of work.

R_RecursiveWorldNode needs rewritten to share load with DSP. minmax bounding box extents could be quantized to 16 or 32 units, if repacked to bytes (upper 8 bits of existing 12bit integer range for vertices). Would halve loading time into DSP per node later on. Clipping flags could conservative-terminate early when child minmax box is smaller than some sensible limit, could reduce number of frustum plane tests needed during BSP.

R_SubmitFaceGeometry & R_ReindexFaceVertices may be recombined to amortize work and overlap better with DSP edge processing / clipping. Might lose overlap with vertex projection step which is 100% DSP side and mostly free just now. Still, these are a bottleneck. Using GPU approach to leaf meshes which are already baked/pre-indexed would remove this cost, but lead to more vertices & faces needing transmitted/classified/clipped/processed on DSP side.

R_RecoverSurfacesCPU just implements span recovery and filling, and really just needs work on the host port exchanges (try to use bytes where possible)

A bunch of C still needs rewritten as 68k - both my code and existing engine code.

Vertices need packed into 16 bits (+/-4096 with 3-bit fraction) which suits the Q2 universe, and speeds up vertex transmission. (done)

[CPU] Used cycles:
20.03% _R_RecursiveWorldNode4PL_68k <- to replace with DSP:CPU hybrid
19.15% _R_RecoverSurfacesCPU_DSP56k (== DSP:recover_surface)
15.46% _R_SubmitFaceGeometry_DSP56k <- try to merge with R_ReindexFaceVertices)
11.04% _R_ScanConvert_DSP56k <- optimization complete?
9.09% _R_ReindexFaceVertices_68k <- try to merge with R_SubmitFaceGeometry)
7.47% _Clip_CheckBrushes <- rewrite as 68k + FPU
5.90% _R_ProcessBSPVisitQueue.constpro <- rewrite as 68k
4.77% _R_XFormProjectTaggedVertices_DS <- optimize 8.16 vertices to 12.4 / split camera load

...most of the rest is C framework & collision code from various reference engines incl. Q2, Q1, polyengine and my own inclusions. A lot of stuff not yet inlined, causing time fragmentation across lots of symbols.

some are called many times per frame, especially AI and collision code, needs studied and optimized later...

1.91% _check_node_r
1.58% _R_MarkLeaves
0.57% _BSP_MakeVis
0.55% _check_leaf
0.51% _BSP_FindLeaf
0.32% print_char
0.26% _FrustumSetup
0.14% _Test_Event
0.13% _Door1_Update
0.13% _shifter_vbl_asm2_cb
0.12% _memset
0.12% _Door1_Event
0.12% _Prog_GetFlags
0.10% _R_ProcessGeometry
0.09% _PR_BeginScene
0.07% _FMatrix_PreZRot

The 'input' side of the renderer takes about 50% for BSP/culling/submitting geometry. This part offers the most optimization opportunities now.

The 'output' side takes 11% for scanning and 20% for drawing. This part is becoming quite rigid, not easily made faster from here onwards.

That leaves approx 20% for collisions, AI and background stuff. So these areas are gradually moving up the list.

[CPU] Cumulative/callgraph cycles
29.47% _R_ProcessGeometry
(R_XFormProjectTaggedVertices+R_ReindexFaceVertices+R_SubmitFaceGeometry)
10.23% _Server_UpdateWorld
(_Prog_Event+...)
10.06% _Prog_Event
(lots of AI stuff)
7.67% _Clip_MoveSlide
1.42% _Clip_Begin.constprop.0
1.01% _Server_MakeFrame

Clip_CheckBrushes causes massive 40% instruction cache miss rate. a rewrite could double
its speed or better.

Instruction cache misses:
41.14% _Clip_CheckBrushes <- rewrite as 68k
11.89% _R_ProcessBSPVisitQueue.constpro
16.94% _check_node_r
2.44% _check_leaf

Zamuel_a · Post by **Zamuel_a** » Sun Nov 02, 2014 1:23 pm

What is the best way to draw polygons?

Calculate the edges and save the left and right X in a table and after this, loop through the table and draw the spans, or calculate and draw at the same time? It's much easier to save in a table first since the edges can be treated as individual lines without thinking about other lines at the same time, but of course there is one more loop involved.
Since you seems to be the expert on this, you probably have a good answer.

I guess it might depends on if it's done on a ST or Falcon because of the cache.

dml · Post by **dml** » Sun Nov 02, 2014 2:13 pm

Zamuel_a wrote:What is the best way to draw polygons?
Calculate the edges and save the left and right X in a table and after this, loop through the table and draw the spans, or calculate and draw at the same time? It's much easier to save in a table first since the edges can be treated as individual lines without thinking about other lines at the same time, but of course there is one more loop involved.
Since you seems to be the expert on this, you probably have a good answer.

I think it depends on the aim. If you just want to draw a cube with flat fill, there are lots of approaches which will work well and it might be hard to choose between them. The method I use here is unlikely to be as fast as a simple left-edge buffer (scan upward edges into the left edge buffer, scan downward edges and close the gap between the new x and the left edge x for that y).

If you want to draw dense meshes, the left-edge buffer will perform very poorly because you will be scanning most edges twice (or worse-clipping them multiple times). Still, you can couple edge scanning like this into a shared edge mesh representation so you only scan each edge once, and it gets reused where polygons join. You need to be able to allocate the edgebuffers from a pool.

Triangle-specific rasterization can scan both left and right concurrently, by splitting the scan into two sets of edges (by the middle vertex). This can be done for any convex polygon but is a bit more complicated. Trianglestrips aim to share edges to some degree while keeping the rules simple for hardware. Indexed meshes with shared edges are probably always better in software.

If you want to use affine texturing you can scan the uvs at the same time as you scan the span x coords.

If you want to draw dense meshes with p-correct texturing then you definitely don't want to be doing any of those things (!), not least because the number of attributes you must scan and store is large, and the edge scanning costs quickly become very big. It is better to adopt the scan-coherence reverse-paint algorithm and just compute span x intercepts for all polygons at once, a line at a time. Scan-coherence means that you take advantage of the fact that edges cross only occasionally on the display, so 95% of the time you're just incrementing x-gradients (x+=dx) and checking for crossings. When edges cross each other, you need to swap them in the chain - but it's a lowish frequency event. The main advantage of this is that you can emit spans without overlaps, using whatever criteria you need, and you're only dealing with x1,x2 for a given y.

When it comes to filling, you use a surface plane to efficiently solve any attribute (z,u,v.. whatever) from screen x,y. They never need scanned or stored. It also avoids one more potentially massive cost - the need to clip/intercept attributes, especially if you're trying to generate non-overlapping spans. And you never deal with attributes for spans you can't see.

The downside of this approach is that spans are not generated in surface order, but rather scanline order - so they need linked to a span chain, which is in turn linked to a surface chain, which in turn is associated with texture and transparency coding - before you can draw the surfaces. However the savings in all other areas can be very significant, and overall efficiency rises as the polycount goes up.

Zamuel_a wrote: I guess it might depends on if it's done on a ST or Falcon because of the cache.

Yes what is good for 2 processors and a cache, is not good for 1 processor and no cache

On the ST it's mainly an exercise in minimizing the number of total operations, and avoiding expensive ones. It's hard to do when the problem is complex and typically needs lots of dynamic control flow, but the approach is usually always clear. Decrease execution length, increase use of lookups.

When you have a complicated hardware landscape - which the Falcon is - the approach is not always clear. You need to prepare to look at less obvious solutions - especially those which involve doing *more* operations to save time in some other way.

A good example is the BSP search implementation in BadMooD. The obvious way to optimize it was to move all the calculations onto DSP, and leave the main memory access, state stack and conditional logic on the CPU side. This works, but it doesn't work as well as it might because there is a lot of decision making going on and lots of data going back and forth.

In the final version, it was actually faster to run part of the algorithm concurrently on *both* chips - both doing the same things at the same time, racing each other, in order to avoid the need to send questions and answers back and forth between the two. The primary reason this ended up faster, is because transfer turnaround costs more than the transfers themselves - bidirectional exchange bottlenecks - it creates dependencies between the two processors and causes idle time while the other side responds. By keeping enough info (e.g. a stack) on both sides, each side depends less on the other in order to proceed a few more steps and in one case (node bbox culling) the DSP can actually compute two competing results ahead of knowing which one the CPU actually wants next.

The gain from this arrangement was quite significant in the end. It seems like a good example of a non-obvious-optimization.

Atari-Forum

Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030