Quake 2 on Falcon030

All 680x0 related coding posts in this section please.

Moderators: exxos, simonsunnyboy, Mug UK, Zorro 2, Moderator Team

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1558
Joined: Sun Jul 31, 2011 1:11 pm

Re: Quake 2 on Falcon030

Postby Eero Tamminen » Wed Jun 24, 2015 6:43 pm

Finally had time to try this out and besides looking amazing, it's faster & more reactive to user input than I had hoped for. Works fine in latest Hatari from Mercurial (WinUAE CPU core build) with its 030 data cache emulation.

Below is slowest place I could find from ikdm4 map, normally speed was 7-14 FPS:
ikdm4-slowest.png
You do not have the required permissions to view the files attached to this post.

DrTypo
Atari freak
Atari freak
Posts: 73
Joined: Sat Apr 09, 2011 12:57 pm
Location: Paris, France

Re: Quake 2 on Falcon030

Postby DrTypo » Wed Jun 24, 2015 7:36 pm

I tested this public build on my Falcon. No lightmap weirdness, it runs fine!

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Thu Jun 25, 2015 9:45 am

Eero Tamminen wrote:Finally had time to try this out and besides looking amazing, it's faster & more reactive to user input than I had hoped for. Works fine in latest Hatari from Mercurial (WinUAE CPU core build) with its 030 data cache emulation.


Thanks for trying - I forgot that I left the mouse-filtering on for these releases. It should only be on while recording vids to absorb host-side judder from my pc, but I don't think anyone noticed the laggy mouse too much :)

Eero Tamminen wrote:Below is slowest place I could find from ikdm4 map, normally speed was 7-14 FPS:


It's not too hard to find some slow corners - there are still plenty :) I'll go back and do some profiling with the newer Hatari (and with my new PARCP remote d/l & execute setup for real HW!) to see if any progress can still be made on the worst parts. When I last left it, there were no particularly dominant costs (other than pixels) when standing still. Just lots of separate but very busy stages which add up.

I have made a few optimizations since the previous build - DSP ops removed from spans and a redundant dbra on the CPU side on every 8th span (simply offsetting the count and pixel jumptowers). Not much, but it was free :)

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Thu Jun 25, 2015 10:47 am

DrTypo wrote:I tested this public build on my Falcon. No lightmap weirdness, it runs fine!


Good news! :cheers:

I think I remember some versions of Hatari around 1.7-1.8 which had weird FPU issues and which got fixed, so perhaps it was just that. Unfortunately I was working a lot with built-from-source and nightly-build versions (none of which contain an identifiable source control revision for diagnostics) so tracking these kinds of issues for me was too much of a headache. If I stuck to the formal releases this would be easier but I often needed stuff for dev/profiling that wasn't released yet!

AxisOxy
Atariator
Atariator
Posts: 23
Joined: Tue Jun 23, 2015 2:00 pm

Re: Quake 2 on Falcon030

Postby AxisOxy » Thu Jun 25, 2015 4:15 pm

dml wrote:I had some quite painful experiences with the earlier Doom project because the rendering part had been created before I got anywhere near ID's source, and worked quite differently from the original.


Yeah doug, I can imagine your pain. I had similar issues with the doom renderer. I started it completely in assemby language. Not a good idea at all! It took ages to get a working version that halfway looked like the original Doom. Debugging was nearly impossible. So conclusion for the Quake renderer was to have a working version in C++ in Visual Studio as starting point. Then get it compiled and running with SAS/C on Amiga without major changes to the renderer code. Only the surrounding framework was exchanged. Mostly containing wrappers for IO-stuff. So I was relatively fast on having a Quake 2 renderer running on my Amiga. But unfortenately at about 1 fps on my target platform 68030/50Mhz. A profiling session showed, that the major problem was, that the old compiler in combination with 68030 is extremely bad on function calls. So first optimization steps: Replace functions with macros, remove the recursion. Still having a version that works in Visual Studio is very nice for debugging reasons.

dml wrote:It's now at the point though where further progress would need to involve an offline step (or a lot of waiting at runtime! already the case with coloured lightmaps...).


I´m pretty pleased with my approach of offline converting the maps. It massively helps on loading times, because some of the processes (E.g. calculating shading-/dithering-tables for colored lightmaps or the minimum fit bounding spheres) take really long to execute. Even a second or two on a PC. And thats not a good sign. Another nice benefit of this approach is, I dont need multiple loaders for importing different engines (Quake 3, Half-Life). Not that I´m taking any advantage of this, yet. But you never know.

When I see your algorithm descriptions, I have to admit that I didnt expect the architecture of the 68030/DSP having such a big influence on the algorithmic choices. The methods are really extreme, but totally make sense under these conditions. Partially they remind me of the good old days, when we just rotated wireframe cubes. I´m not sure if this is what you are doing. But it sounds pretty promising to just collect the data in the BSP-recursion. And then do it oldschool: transform all vertices -> cull all backfaces -> render all remaining faces. This would also make life much easier when porting parts of the code from C to ASM, because the communication protocol is much simpler. Like here are 1000 vertices in a linear array, transform them!
When I start to think about it, this would also be a very good structure for an engine on modern hardware, like PC or Game-consoles. This would make it much easier to parallelize work (SIMD, threads, GPU-programming).

calimero wrote:just to say hi to Axis!
- C64 demos that you worked on are AMAZING! Brilliant and mind blowing code :)

btw I did not know that you code on Amiga too...


Hi calimero and thanks!
I have done a lot of ugly AGA demos back in the 90´s.
Lately I´ve done some stuff for A500 OCS machines (The real Amiga). Our latest release there is Planet Rocklobster.
Unluckily graphics and design is not as good as intended. The team split up short before the Revision party and we didnt want to wait another year for the release. So we had to pull off and implement a completey new design concept in just 5 weeks. Crunchtime!
The Youtube quality is also pretty crappy, so better watch in on real hardware or emulator http://www.pouet.net/prod.php?which=65355.

Another thing on my todo-list is an Atari ST demo. I have ported a hand ful of A500 effects for the ST and have a lot of good friends in the atari scene. So there´s something to come there, too. Someday!

Always the same problem, so many ideas for projects and so little time to do them.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Sat Jun 27, 2015 6:33 am

Hi!

I'll reply more properly when I get more time (we have sunny weather! :D ) but for now I have attached updated snapshots of the engine architecture and typical execution map showing the various components and how they overlap and connect up.

The time map is reasonably up-to-date but the architecture one has become stale, not showing the parts related to surface caching. But it's near enough for now!

:cheers:

Falcon Q2_ architecture.pdf


Falcon Q2_ execution time map - Sheet1.pdf


Note: the %age times are wrong - they haven't been updated since the flat-shaded version, so pixel drawing is now the dominant cost at 50-70%
You do not have the required permissions to view the files attached to this post.

AxisOxy
Atariator
Atariator
Posts: 23
Joined: Tue Jun 23, 2015 2:00 pm

Re: Quake 2 on Falcon030

Postby AxisOxy » Thu Jul 02, 2015 2:18 pm

Yeah, weather here in northern germany is also awesome. So, I´m actually less coding and more BBQ´ing. ;o)
But I need to code something in the next days. Just got a great idea for speeding up the perspective texture mapping. That urges to get tested.
Its mostly based on rendering the polys in 8x8 blocks and do the expensive work for the perspective correction only on an 8x8 grid for the area the polygon covers. Not sure how this looks quality-wise compared with 16 pixel spans. I also have no clue how to integrate that with the coverage buffer. Perhaps I try to handle the coverage also based on 8x8 blocks. Allowing a little bit of overdraw along the edges. We will see.

Interesting to see your benchmark. I´m wondering a bit why there´s so much time spent in _R_GeomReindexBatch, _R_GeomSubmitBatch and _R_CommitFaceGeometry. My first educated guess goes to datatransfer to and from the DSP.

I attached a bechmark of my engine. Not as detailed as yours, but it shows the bottlenecks.
It uses the worst case frame of "Outer Base"/Quake 2 on Amiga 1200/Blizzard 1230@50 Mhz with flatshaded polys.
I normally use flatshading for benchmarks, because it shows the problems more obvious. Showing a frame with a fullscreen wall or floor is easy to do fast. But a frame with 500+ faces is a total different story.

A few posts ago, you mentioned that you did optimizations for the 68030 data-cache. So I guess the data-cache of the Falcon 030 actually works. Amiga 1200 68030-Boards all suffer the same issue. The data cache doesnt work at all. Or better said: It works (causes problems with SMC), but theres not a single cycle it saves. Darn it!

quake 2 benchmark.txt
You do not have the required permissions to view the files attached to this post.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Thu Jul 02, 2015 3:55 pm

...found some time to answer this properly!

AxisOxy wrote:I had similar issues with the doom renderer. I started it completely in assemby language. Not a good idea at all! It took ages to get a working version that halfway looked like the original Doom. Debugging was nearly impossible.


Reminds me of early projects - the fun of debugging big complicated assembly programs!

Sometimes I'm amazed at early demos and games just because they worked at all, before even taking the effects into account. Especially some of the big 3D games or games with complicated logic. Debugging those must have been hell, with the tools that you had no choice but to use back then...

AxisOxy wrote:But unfortenately at about 1 fps on my target platform 68030/50Mhz. A profiling session showed, that the major problem was, that the old compiler in combination with 68030 is extremely bad on function calls. So first optimization steps: Replace functions with macros, remove the recursion. Still having a version that works in Visual Studio is very nice for debugging reasons.


Sounds like a sensible approach. My first tests with the Q2 renderer on Falcon were in the sub-fps domain, even without textures :) it made me wonder if I was being too silly.

BTW I do remember the old SAS/C compiler. It was used on a work project long, long, long ago! Had an A4000 with a hard-disk which was quite a luxury compared with the A500's we were using until then.

GCC is much more forgiving with calls so long as they can be static/inline so I didn't have to take the macro route - but inter-translationunit calls are very expensive so decent organization of the code is still essential to get decent performance (The original game code was not organized in that way and would have been a pain to fix).


AxisOxy wrote:I'm pretty pleased with my approach of offline converting the maps. It massively helps on loading times, because some of the processes (E.g. calculating shading-/dithering-tables for colored lightmaps or the minimum fit bounding spheres) take really long to execute. Even a second or two on a PC. And thats not a good sign. Another nice benefit of this approach is, I dont need multiple loaders for importing different engines (Quake 3, Half-Life). Not that I?m taking any advantage of this, yet. But you never know.


Yes that's true, it keeps some complexity out of the business-end of the code.

I had put quite a bit of effort into managing loading times while retaining support for maps in their native format so the loading time is currently sensible - but I think it can only get worse now as I start doing other things with it. At some point converting the maps will just need to be done.

With the Doom project I used a local cache mechanism so the first time you run the game it would do the conversion for each instance of data, and write it back to disk in the optimized form, which it would then use. On the next start, it would just used the cached version.

This works really well and sidesteps the need for users explicitly convert stuff (and sidesteps the need to distribute 'copyrighted stuff' in preconverted form), but also increases executable size and can cause a lot of confusion for someone not expecting an initial wait. So it has some nice properties but also some bad ones. :-)

AxisOxy wrote:When I see your algorithm descriptions, I have to admit that I didnt expect the architecture of the 68030/DSP having such a big influence on the algorithmic choices. The methods are really extreme, but totally make sense under these conditions. Partially they remind me of the good old days, when we just rotated wireframe cubes.


When I had started I wasn't really planning on making such drastic changes in those areas - but it became clear very quickly that the project would end in 1000 cuts without going 'to the metal' in lots of places.

It was sobering to notice that while there were very obvious bottlenecks, there were also a lot of different processes which were not individual bottlenecks - and which still represented a cumulative cost-blob of epic proportions.

AxisOxy wrote:I'm not sure if this is what you are doing. But it sounds pretty promising to just collect the data in the BSP-recursion. And then do it oldschool: transform all vertices -> cull all backfaces -> render all remaining faces. This would also make life much easier when porting parts of the code from C to ASM, because the communication protocol is much simpler. Like here are 1000 vertices in a linear array, transform them!


That is a pretty accurate description of what it's doing. The BSP pass has been turned into a kind of gather, which outputs a more general displaylist of stuff, which gets digested by the renderer.

This was done for the reasons you describe, but also to make the renderer (which consumes a lot of DSP code space) more general so processes could be reused. The skybox is just another 'map' but formatted like raw BSP output - after gathering.

The dynamic models will also use this, but with a different type of gather. Same applies to doors/submodels.

This is quite different from the way in which ID's engines work - but it is a much better fit for this machine (and as you point out below, other machines which are suited for drawing pipelines vs ad-hoc code).

AxisOxy wrote:When I start to think about it, this would also be a very good structure for an engine on modern hardware, like PC or Game-consoles. This would make it much easier to parallelize work (SIMD, threads, GPU-programming).


Yes it is very helpful for overlapping work between the two processors. There are some problems with this too but I made an attempt at mitigating most of them. Some extra complexity and overheads persist because of limitations within the Falcon - which could be simplified/dropped on another platform - but they don't get in the way too much.

These compromises are mostly to do with buffer space on the DSP and trading global storage vs temporary storage.

AxisOxy wrote:Lately I?ve done some stuff for A500 OCS machines (The real Amiga). Our latest release there is Planet Rocklobster.


That's a very nice demo. I like the terrain voxels and the intersecty-cube parts (and the vector-to-point morphing bits bring back memories of Hardwired from my ancient OCS demo collection!). But I won't pick and choose too much between parts :) Nice work.

AxisOxy wrote:Another thing on my todo-list is an Atari ST demo. I have ported a hand ful of A500 effects for the ST and have a lot of good friends in the atari scene. So there?s something to come there, too. Someday!


...look forward to seeing that! :)

AxisOxy wrote:Always the same problem, so many ideas for projects and so little time to do them.


Ha! A very familiar problem!

:cheers:

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Thu Jul 02, 2015 4:42 pm

AxisOxy wrote:Yeah, weather here in northern germany is also awesome. So, I´m actually less coding and more BBQ´ing. ;o)


Good plan. It's cloudy & thundery-looking here but at least it is warm and not raining. Bonus!

AxisOxy wrote:But I need to code something in the next days. Just got a great idea for speeding up the perspective texture mapping. That urges to get tested. Its mostly based on rendering the polys in 8x8 blocks and do the expensive work for the perspective correction only on an 8x8 grid for the area the polygon covers. Not sure how this looks quality-wise compared with 16 pixel spans. I also have no clue how to integrate that with the coverage buffer. Perhaps I try to handle the coverage also based on 8x8 blocks. Allowing a little bit of overdraw along the edges. We will see.


That seems like a worthy experiment. Looking beyond the obvious coding difficulty, it probably will produce decent visual results and some gain at the same time.

Long ago I tried an adaptive (subdivision) approach to horizontal spans (again with DSP) but quickly found that it did not map well to perspective curves - it either required an unreasonable amount of subdivision (small error tolerance), which defeated the gains, or caused visual popping which was distracting. Methods which don't use aligned intervals for perspective correction also cause visual weirdness (e.g. perspective samples on adjacent scans moving around relative to each other = looks bad). Using a regular grid doesn't have any of those problems...

Coverage/zbuffer - yep good luck with that one :)

In fact I don't think low resolution depth arbitration is such a bad thing, as it mostly just affects dynamic models. It will affect transparency too but I developed an alternate solution to that one, so the transparent surfaces are handled as depth-sorted spans like the rest of the scenery. Still I will have problems with models here soon.

There won't be any perfect way to render dynamic models without perfect z-buffering, so I will opt for sub-perfect and just try to manage the worst-case error. A low-res coverage or depth implementation will probably be better than that either way :)

AxisOxy wrote:Interesting to see your benchmark. I´m wondering a bit why there´s so much time spent in _R_GeomReindexBatch, _R_GeomSubmitBatch and _R_CommitFaceGeometry. My first educated guess goes to datatransfer to and from the DSP.


Yes thats right. _R_GeomReindexBatch / _R_GeomSubmitBatch is the CPU side of geometry preparation and transmission. _R_CommitFaceGeometry is the DSP processing of geometry (mainly faces/edges).


This is actually one of the less fortunate / more complicated aspects of the Falcon architecture. The DSP is deeply involved in the geometry pipeline, but it has very little memory, and can't access main memory. So it has to be treated mainly as a pipe with very limited persistent storage.

I organized things such that the engine submits batches of no more than 128 faces, where edge/vertex sharing exists only within a batch - like an independent mesh for each batch. In this way, I can commit edges to the GET in a compact form, as the scene content is built up, while the edges and vertices themselves are discarded after each batch. This minimizes persistent storage.

Unfortunately there are 2 implications here, and a trade to make:

1) The map is pre-indexed into independent meshes of <= 128 faces, which must be submitted whole, and trimmed (backface culled etc) after transmitting into the geometry pipe.

2) Or the map is left as it is, but each batch is re-indexed on the fly using a vertex/edge cache, after the BSP has backfaced the geometry for free (as an implicit part of the BSP plane-side logic).

The trade here is that option 1) can be done offline by preprocessing the map, but greatly increases the amount of stuff you need to transmit to the DSP (about 50% extra). Transmitting is extremely expensive. It's an 8-bit port to the DSP. :(

Option (2) costs more to do at runtime, by the CPU. However it somewhat trivially reduces the amount needing transmitted can be magically overlapped with the DSP's geometry work if there is more than one batch - so it can be largely done for free. Not always, but most of the time it is free especially on dense scenery.

AxisOxy wrote:I attached a bechmark of my engine. Not as detailed as yours, but it shows the bottlenecks.
It uses the worst case frame of "Outer Base"/Quake 2 on Amiga 1200/Blizzard 1230@50 Mhz with flatshaded polys.
I normally use flatshading for benchmarks, because it shows the problems more obvious. Showing a frame with a fullscreen wall or floor is easy to do fast. But a frame with 500+ faces is a total different story.


Thanks - that is quite interesting to look at. It's actually quite a similar picture to what I saw, in relation to the HW of course. The edge setup in particular looks to be very fast.

Agreed - flatshading is good way to inspect hidden costs in other parts of the engine. I keep a switch handy for that.

I made a couple of shortcuts in my own edge scanner which can potentially fault on nonconvex polygons, but I arranged it such that the faults would be restricted and mostly unnoticed. It was very uncomfortable to use a state counter for active surfaces so I now use a single bit instead. I'm not recommending this because it does create specific headaches - but with some care they can be safely absorbed.

One other thing I forgot to mention is that I introduced a 5th clipping plane (nearclip), which isn't present in Q2. At first this just added to overall cost, but after many attempts at optimizing the BSP routines it began to save more than it cost.

(Incidentally this was not true of the C version on PC - when I measured, it seemed always faster to stick with 4 planes).

AxisOxy wrote:A few posts ago, you mentioned that you did optimizations for the 68030 data-cache. So I guess the data-cache of the Falcon 030 actually works. Amiga 1200 68030-Boards all suffer the same issue. The data cache doesnt work at all. Or better said: It works (causes problems with SMC), but theres not a single cycle it saves. Darn it!


That's also interesting to know. I had assumed that it would work better on Amigas with 030 CPUs.

It is safe to say that it does suck on the Falcon, but for obvious reasons - word fetches become longword fetches, over a 16bit bus. On average, it incurs a small but measurable overhead most of the time. Most demos turn it off because most code doesn't refer to the same data twice (or if it does, it also involves some sort of random-access, which just kills the gain with more wasted reads).

There are some specific cases where it works and is useful - but usually not by accident, and definitely requires confirmation in each case.

The 3 main places where I try to use it each behave a bit differently.

- the BSP routine uses it in write-allocate mode, to absorb the cost of popping the stack. write-allocate mode is not very well documented, and is usually referred to as 'slower' but in fact that has more to do with tags and collisions between super/user mode switching. if you're in super mode all the time, you only need to watch what you are writing to, since writes will allocate or destroy existing cache entries. if you reprogram the PMMU, it's possible to prevent specific memory sources from emptying your cached writes and it becomes useful.

- the vertex/edge reindexing step use it in a more typical manner - repeated fetches to the same cache records when edges share an index. hit rate on vertices is the main gain, since lots of edges can share a vertex. typically an edge is referenced twice maximum, so the hit rate is poor.

- for certain texturing routines, in conjunction with careful mipmap selection, repeated reads from nearby pixels can be cached. hitrate is unreliable though and rotation-sensitive. however lighting CLUT tables have a decent hitrate, and especially so if the maps are reindexed for locality. this is the main gain. unfortunately my texturing routine has a fixed-time aspect so if the CPU is too quick the system faults - so I can't use it everywhere I might otherwise :)

However, overall the gains are small relative to other kinds of optimization. It works, but probably not the most effective use of programming time! :D

AxisOxy
Atariator
Atariator
Posts: 23
Joined: Tue Jun 23, 2015 2:00 pm

Re: Quake 2 on Falcon030

Postby AxisOxy » Wed Jul 08, 2015 3:20 pm

dml wrote:and sidesteps the need to distribute 'copyrighted stuff' in preconverted form

Oh yeah, thats a problem I havent really thought about. No problem, yet. Because I am a gazillion lightyears far from a release, but this needs some thought in the future.

dml wrote:That seems like a worthy experiment. Looking beyond the obvious coding difficulty, it probably will produce decent visual results and some gain at the same time.

Did some fast tests. The texturemapping itself works quite well (quality- and speed-wise) compared to the span based approach. But I still have no idea how to solve the coverage problem, because the block-wise rendering is totally incompatible to the typical coverage buffer implementations. So, at the moments its moved back to the stack of ideas and prototypes until I get the right inspiration.

dml wrote:Long ago I tried an adaptive (subdivision) approach to horizontal spans (again with DSP) but quickly found that it did not map well to perspective curves - it either required an unreasonable amount of subdivision (small error tolerance), which defeated the gains, or caused visual popping which was distracting.

Yeah, I did the same back in the 90´s, for a PC 3D-engine (FPU+MMX), when I worked for Software 2000. I know that popping and wiggling when you use this kind of tech on typical Doom-/Quake-style maps. But for our usecase (A soccer game) it worked pretty well (A big grass surface in the right angle and everything else on the screen has rather small polygons or is pretty far away).

dml wrote:It is safe to say that it does suck on the Falcon, but for obvious reasons - word fetches become longword fetches, over a 16bit bus.

Uh, that sounds really bad. So your data cache even slows things down. *LOL*
After a little bit of research I found the reason for the data cache problem on Amiga. It looks, like it works only in write-allocate mode, due to the way the data-bus on Amiga works. I did a bit of testing and found some edge cases where it saves 1 cycle. But these cases are so rare, that I´m pretty sure they will never happen in real world scenarios. And definitely not by accident. So, lets take the Amiga and Falcon hardware-designers, put them in a bag und use a big club on them. It wont hit anyone wrong. ;)

dml wrote:One other thing I forgot to mention is that I introduced a 5th clipping plane (nearclip), which isn't present in Q2. At first this just added to overall cost, but after many attempts at optimizing the BSP routines it began to save more than it cost.

Ah, yes. I also clip against the nearplane. I mostly added that for accuracy reasons. Without that, I got only the choice between clipping bugs on very big faces or ugly jittering in the subpixel accurate edges, due to extremely big values in the edge-setup. In the end, it even saved some cycles. While we are at it, I only clip the nearplane in 3D. The other 4 clip-planes are only used for frustum-culling. And the x-/y-clipping is done in 2d during the edge-setup (in y) and rasterization (in x).

By the way, I lately tested around and benchmarked a handful of C-/C++-compilers for Amiga. Speedwise SAS/C is still in the lead. Damn, I hoped to find a nice cross-dev tool producing efficient code, because the compile-times with SAS/C start to kill the fun.
While testing around with the compilers I found some really funny code-snippets in the disassembly. They are a good reason to write less C and more ASM code. So here are some of them:

Code: Select all

;Address an interleaved int* linebuffer. Can be done for free with (a1,d3.l*8)
move.l d3,d0
lsl.l  #1,d0
lsl.l  #2,d0
add.l  d0,a1

;Looks like compilers dont know about 16 or even 32 bit.
move.b (a1)+,(a2)+
move.b (a1)+,(a2)+
move.b (a1)+,(a2)+
move.b (a1)+,(a2)+
move.b (a1)+,(a2)+
move.b (a1)+,(a2)+

;WTF! Didnt know, that this even works.
movem.l (a7)+,d0

AxisOxy
Atariator
Atariator
Posts: 23
Joined: Tue Jun 23, 2015 2:00 pm

Re: Quake 2 on Falcon030

Postby AxisOxy » Fri Jul 10, 2015 8:54 am

Compiler benchmark aftermath:

I started to port more code from C to ASM and SAS/C starts to impress me. Its not easy to have a big gain compared to the output of SAS. And most of the gain is based on structural changes, that get obvious in ASM, not on low-level code optimization. Things like indexed access to non-pow2 structure arrays or loops that need more than 7 pointers. I´d say the C-compiler has no chance to make good code out of that. The only real gain is taken on opcodes that are not accessible in C like ROL/ROR or carry flag trickery. But thats only a very small margin.

I see some hard times coming up...

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Fri Jul 10, 2015 9:55 am

Hi! I've been doing 3 things at once so not had much time to respond until now. Looks like you've been doing experiments :)

AxisOxy wrote:I started to port more code from C to ASM and SAS/C starts to impress me.


TBH I had similar experiences with GCC 4.6.x - while not as good as what you describe here, it still seemed pretty good considering the 68k backend probably hasn't been updated in several decades (any recent benefits are from the mid-high level IR optimizations if they even work at all).

While it is nearly always possible to beat the compiled code - sometimes trivial to do so - at other times actually quite difficult if the code is even slightly convoluted. A few trivial cases also produced nice surprises (if unrolling is ignored - gcc can't seem figure out what to do for that on 68k). For sprawling glue code there's no competition of course because relentless automation always beats patience :)

I sometimes set up my makefiles with two source profiles - optimize for speed or for size - and all of the glue code, loaders and other junk get the 'size' treatment. Anything I haven't had the patience/willpower to convert to asm gets the 'speed' treatment.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Fri Jul 10, 2015 10:16 am

AxisOxy wrote:Oh yeah, thats a problem I havent really thought about. No problem, yet. Because I am a gazillion lightyears far from a release, but this needs some thought in the future.


It will be interesting to see what you've done when its ready - I didn't realize this was going on alongside my effort on the Atari. I certainly saw the 060 Amiga port but wasn't aware of anything else. Sounds really cool :)

AxisOxy wrote:Did some fast tests. The texturemapping itself works quite well (quality- and speed-wise) compared to the span based approach. But I still have no idea how to solve the coverage problem, because the block-wise rendering is totally incompatible to the typical coverage buffer implementations. So, at the moments its moved back to the stack of ideas and prototypes until I get the right inspiration.


I'm not sure what to do about that either. It might be necessary to use subspans for the edges (and short spans), leaving only the tiled areas empty, then fill those separately. Or have some sort of crazy edgemask permutation system for tiles containing edges, and render only the tiles containing vertices 'properly'. It seems difficult anyway :-)

AxisOxy wrote:Yeah, I did the same back in the 90´s, for a PC 3D-engine (FPU+MMX), when I worked for Software 2000. I know that popping and wiggling when you use this kind of tech on typical Doom-/Quake-style maps. But for our usecase (A soccer game) it worked pretty well (A big grass surface in the right angle and everything else on the screen has rather small polygons or is pretty far away).


It was quite difficult back then with that hardware to decide on acceptable compromises, while trying to make engines general enough for the next game (especially not knowing what the next game should look like!). I remember cases where the concept art for the next project started appearing and realizing the last iteration of code wasn't anything like a good fit for what we had. :)

AxisOxy wrote:Uh, that sounds really bad. So your data cache even slows things down. *LOL*


Yep :)

AxisOxy wrote:After a little bit of research I found the reason for the data cache problem on Amiga. It looks, like it works only in write-allocate mode, due to the way the data-bus on Amiga works. I did a bit of testing and found some edge cases where it saves 1 cycle. But these cases are so rare, that I´m pretty sure they will never happen in real world scenarios. And definitely not by accident. So, lets take the Amiga and Falcon hardware-designers, put them in a bag und use a big club on them. It wont hit anyone wrong. ;)


Actually this makes some sense on a machine with fast 32bit memory. It's probably going to save a cycle here and there... I hadn't really given it much thought.

On the Falcon the bus is quite painful - IIRC 4 cycles per word, and 2 words for any 32bit fetch. So the d-cache is worth using if you can realize a good enough hit rate. By default though it tends to just get in the way.


AxisOxy wrote:Ah, yes. I also clip against the nearplane. I mostly added that for accuracy reasons. Without that, I got only the choice between clipping bugs on very big faces or ugly jittering in the subpixel accurate edges, due to extremely big values in the edge-setup. In the end, it even saved some cycles. While we are at it, I only clip the nearplane in 3D. The other 4 clip-planes are only used for frustum-culling. And the x-/y-clipping is done in 2d during the edge-setup (in y) and rasterization (in x).


I'm actually doing much the same - although the 5-plane thing I refer to is the node culling/classification step around the BSP. I still clip against nearplane (+some margin) regardless for the same reasons you offer - serious problems with integer range and jitter.

I'm also using a 2D clipper instead of a 3D winding-plane clip, which is indeed fast for the typical case. Although I ran into problems when I found there were nonconvex polys in the maps and broke the 2D clip exit/entry ordering logic, so it ended up more complicated than planned :( The Falcon is quite well suited to the 3D clipper so I could have saved myself some trouble and code sooner, but its late to be redoing all of that now....

AxisOxy wrote:;WTF! Didnt know, that this even works.
movem.l (a7)+,d0


...that is indeed a gem! :D

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Fri Jul 10, 2015 12:13 pm

Not had much coding time lately but have been playing with a couple of interesting things in-between other stuff.

I found a small oversight in the generation of tables for my texturemapping implementation which reduces the average error rate. I'm not sure if it is enough of a fix to impact the texture sparkles (probably not) but since the error is measurably lower it's worth fixing in a new build.

I figured out something else as well - a problem which has been bugging me for a while. It is a small thing on its own but I'm sure it will have many uses and enable new retro techniques! I'll explain later when it's properly implemented and finished...

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Sat Jul 11, 2015 11:17 am

So the other thing I had been working on is an alternate way to perform square-root operations for realtime 3D.

These are very expensive to perform via 882 FPU and even more so using algorithms on the CPU. Tables help but consume a lot of space to make a real difference and this is useless on a DSP 56k with very little RAM. The excellent Carmack/3DLabs sqrt() trick - which exploits floating point bit representation - deserves a mention. But it requires FPU and therefore still expensive (and limiting) on a Falcon, and useless on the DSP.

(I will point out that square-root is highly valuable for 3D graphics. Having access to a fast sqrt() makes a real difference to what is possible!)


So far I had been using a modified/improved bitwise algorithm on the DSP, both integer and fixedpoint versions. This works quite well but requires 23 iterations of a 5-instruction sequence. That's 23*5*2 = 230+ cycles (!!!). I tried translating other algorithms to DSP but this remained the general winner for speed/accuracy. There is a partial-table solution which should be faster but it didn't save much and consumed a lot more space and registers. In any case all methods tried are either so slow (or so inaccurate) that they have limited use.


But I didn't give up!

After some experiments I developed a solution which closely approximates a 23bit fixedpoint sqrt() in just 10 cycles.

A modified/compound version can also approximate 1.0 / sqrt(x) - albeit less accurately - which can then be used to normalize 3D vectors very very quickly. I wouldn't use this for important math (!) but I think it should suffice for most graphics uses.


The fun part - this method is continuous, accurate enough to replace other methods and fast enough to use per-pixel.


There is some other stuff going which ties in with this, but it is early stages and I'm not close to describing it yet. ;)


Below is a dump from random samples using this integer-only sqrt() approximation. Only result deviations >= 0.01% vs expected are reported, indicating that accuracy decreases with small source values, which turns out to be ok for most common cases of sqrt() in graphics problems and isn't too much of a surprise for integer-based formulas anyway as fewer bits are available for smaller numbers, unlike floats.

Code: Select all

[x=realvalue]  [y=expected] [y=actual] [error >= 0.01%]
r:0.4277 ye:0.6540 ya:0.653809 e:0.02%
r:0.0586 ye:0.2421 ya:0.241943 e:0.06%
r:0.1701 ye:0.4124 ya:0.412109 e:0.08%
r:0.2395 ye:0.4893 ya:0.489258 e:0.02%
r:0.3707 ye:0.6088 ya:0.608398 e:0.07%
r:0.3865 ye:0.6217 ya:0.621826 e:0.02%
r:0.1175 ye:0.3427 ya:0.342529 e:0.06%
r:0.3985 ye:0.6313 ya:0.631104 e:0.03%
r:0.4556 ye:0.6750 ya:0.675293 e:0.04%
r:0.3674 ye:0.6061 ya:0.605957 e:0.03%
r:0.2812 ye:0.5302 ya:0.530029 e:0.04%
r:0.1479 ye:0.3846 ya:0.384521 e:0.01%
r:0.3074 ye:0.5544 ya:0.554199 e:0.04%
r:0.0485 ye:0.2203 ya:0.219971 e:0.16%
r:0.4198 ye:0.6479 ya:0.647705 e:0.03%
r:0.0109 ye:0.1045 ya:0.104248 e:0.19%
r:0.4058 ye:0.6370 ya:0.636475 e:0.08%
r:0.1180 ye:0.3434 ya:0.343262 e:0.05%
r:0.3377 ye:0.5812 ya:0.580811 e:0.06%
r:0.2195 ye:0.4685 ya:0.468262 e:0.05%
r:0.3707 ye:0.6088 ya:0.608398 e:0.07%
r:0.0187 ye:0.1369 ya:0.136719 e:0.12%
etot: 0.000649

User avatar
troed
Atari God
Atari God
Posts: 1212
Joined: Mon Apr 30, 2012 6:20 pm
Location: Sweden

Postby troed » Sat Jul 11, 2015 3:37 pm

o_O

User avatar
dhedberg
Atari Super Hero
Atari Super Hero
Posts: 558
Joined: Mon Aug 30, 2010 8:36 am
Contact:

Re: Quake 2 on Falcon030

Postby dhedberg » Mon Jul 13, 2015 12:28 am

dml wrote:So the other thing I had been working on is an alternate way to perform square-root operations for realtime 3D.
...
After some experiments I developed a solution which closely approximates a 23bit fixedpoint sqrt() in just 10 cycles.
...

What?! 8O
Come on... don't be shy! I think there's more people than me that are dying to know how you do it! :wink:
Daniel, New Beat - http://newbeat.atari.org

User avatar
AdamK
Captain Atari
Captain Atari
Posts: 240
Joined: Wed Aug 21, 2013 8:44 am

Re: Quake 2 on Falcon030

Postby AdamK » Mon Jul 13, 2015 8:08 am

First he needs to filll patent form ;)
Atari: FireBee, Falcon030 + CT60e + SuperVidel + SvEthlana, TT, 520ST + 4MB ST RAM + 8MB TT RAM + CosmosEx + SC1435, 1040STFM + UltraSatan + SM124, 1040STE 4MB ST RAM + 8MB TT RAM + CosmosEx + NetUSBee + SM144 + SC1224, 65XE + U1MB + VBXE + SIDE2, Jaguar, Lynx II, 2 x Portfolio (HPC-006)

Adam Klobukowski [adamklobukowski@gmail.com]

User avatar
shoggoth
Nature
Nature
Posts: 854
Joined: Tue Aug 01, 2006 9:21 am
Location: Halmstad, Sweden
Contact:

Re: Quake 2 on Falcon030

Postby shoggoth » Mon Jul 13, 2015 6:53 pm

... wait for it.... waaaiit fooor iiiiit....
Ain't no space like PeP-space.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Tue Jul 14, 2015 10:15 am

Sorry, I've just been very busy to post much. :)

The sqrt() thing is another sampled nonlinear approximation like the divide operation used in the Falcon Q2 texture mapper. The solver just needed to be generalized for different functions and the expression and normalization of terms is different, so the ASM code sequence looks different. Generally its the same idea though!

I have been working away on some new stuff that will probably use the sqrt() although its just a small part of a bigger problem. I'll be a bit quiet until then since free time is pretty thin.

There are some other Falcon goodies on the way but more experiments to do first...

User avatar
alexh
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2593
Joined: Wed Oct 20, 2004 1:52 pm
Location: UK - Oxford
Contact:

Re: Quake 2 on Falcon030

Postby alexh » Tue Jul 14, 2015 3:15 pm

Very impressive.

A quick google shows that some of the traditional but optimised routines are 3 instructions per bit.

http://www.finesse.demon.co.uk/steven/sqrt.html

Presumably these methods are similar to the 5 DSP-instructions per bit routine you started with?

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Tue Jul 14, 2015 4:24 pm

alexh wrote:Presumably these methods are similar to the 5 DSP-instructions per bit routine you started with?


Yes looks like the same algorithm, although quite different asm code. The ARM has mind-bending conditional and compound shift operations. Not that DSP isn't also mind-bending at times, but in a different kind of way...

Maybe the DSP one can be improved but 5 ops and 23bit result was as near as I got for the traditional way. Note that it operates on 23bit fractions so in/out values are shifted by 1 bit, as is typical for DSP.

Code: Select all

sqrt   macro      xysqr,xyroot,Txy
   
   move      y:<cy_point5,b
   tfr      b,a      #<0,xyroot   ;            : pattern-accumulator
   do      #<23,_loop
   lsr      b      a,Txy      ; shift   trial bit      : new trial pattern
   mpy      Txy,Txy,a         ; trial   (x*x)
   cmp      xysqr,a      xyroot,a   ; (x*x)>a?         : restore pattern-acc for update
   tle      Txy,a            ; condition update pattern-acc
   add      b,a      a,xyroot   ; combine bit         : save updated pattern-acc
_loop:
   endm


Sorry about the tabs.

It is possible to get rid of the lsr shift but seemingly not the parallel move - so 5 ops it is for now. BTW unrolling it a bit can remove a few ops I think from the final iter but I didn't bother, kept it small. It takes forever anyway.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Quake 2 on Falcon030

Postby dml » Tue Jul 14, 2015 6:22 pm

I just got the DSP version of approximate sqrt() working, have tested it and I am now pretty certain that it will be accurate enough to replace the other one in most cases.

The body of the calculation is 8 cycles (4 ops) but there is a 12bit normalizing shift involved afterwards and the fastest I could do this was 10 cycles (5 ops - beating my previous impl for a 48bit dynamic shift by 2 ops). So the full arithmetic takes 18 cycles on DSP after all...

There is also some addressing setup code - which can be amortized into a loop (same as was done for the texturemapper) but standalone its another bunch of cycles. So lets say the first pass on the DSP is 18 cycles best case, up to 30 worst case if just called once. More than the 10 cycles I had sketched out but I won't complain. Definitely better than 230 though :)

For the texturemapper I was able to play with normalization of each term, at the expense of accuracy and removed nearly all of the shifting from the original version. Not sure I can do that here but it's only the first iteration. Maybe another day.

It's definitely nice to see my test running waaaaay faster with this upgrade :)

AxisOxy
Atariator
Atariator
Posts: 23
Joined: Tue Jun 23, 2015 2:00 pm

Re: Quake 2 on Falcon030

Postby AxisOxy » Wed Jul 15, 2015 9:15 am

dml wrote:While it is nearly always possible to beat the compiled code - sometimes trivial to do so - at other times actually quite difficult if the code is even slightly convoluted.

Looks like my biggest problem of beating the compiler is the amount of calls between C and ASM. If I call an ASM function per face or per edge, theres no chance to optimize away the involved overhead (Function-calls from the compiler-side, register-saving, ...). To get rid of that problem my structure starts to look a bit like yours. An array of nodes processed by an ASM function to output a sorted array of leafs. This gets processed by an ASM function that outputs an array of culled faces, ...

dml wrote:I didn't realize this was going on alongside my effort on the Atari.

I´m sure there are a lot of similar projects out there in the dark. Like 5-10 on every retro-platform. Games like Doom, Quake and Duke Nukem were so famous back in the days and every coder tried to do stuff like that on his own platform. Most of us without success. Nowadays all the old game engines are open source and there are books and websites available that explain the algorithms and trickery inside. So, chances are raising to succeed with a new try. E.g. I know a guy who is working on a pretty impressive Duke Nukem port for Amiga. Chances are good that now, when the first one is starting to make his work public, others are also getting out of the shadows and show their progess. I have experienced something similar with the Amiga demoscene. The scene looked pretty dead, but after making my demo open source, lots of other coders started to send me their WIP frameworks and tools. So, obviously there are people working on something.

dml wrote:Although I ran into problems when I found there were nonconvex polys in the maps and broke the 2D clip exit/entry ordering logic.

Urgs, non-convex polys. Now I know where these ugly bugs arise from.

User avatar
calimero
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2060
Joined: Thu Sep 15, 2005 10:01 am
Location: STara Pazova, Serbia
Contact:

Re: Quake 2 on Falcon030

Postby calimero » Wed Jul 15, 2015 8:36 pm

@AxisOxy

can you share link about DukeNukem port for Amiga?
using Atari since 1986.http://wet.atari.orghttp://milan.kovac.cc/atari/software/ ・ Atari Falcon030/CT63/SV ・ Atari STe ・ Atari Mega4/MegaFile30/SM124 ・ Amiga 1200/PPC ・ Amiga 500 ・ C64 ・ ZX Spectrum ・ RPi ・ MagiC! ・ MiNT 1.18 ・ OS X


Social Media

     

Return to “680x0”

Who is online

Users browsing this forum: No registered users and 4 guests