Bad Mood : Falcon030 'Doom'
Moderators: Zorro 2, Moderator Team
-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
I found a large number of optimizations which could be made to the DSP code, some of which I have applied now and others just made notes. While it's tempting to apply all of them just now it's not making any difference at all to the framerate - more evidence that the DSP is idling a lot of the time and isn't getting in the way of the CPU (this could be seen from profile results anyway, but it's a kind of confirmation by other means).
At some point later it will be worth completing these optimizations just to get the code size down and make more room for extensions or other improvements. For now it's more useful to look at CPU load, bus activity, host port exchange points and concurrency.
I now have a list of areas where time is being wasted in complex scenes. It won't be very easy to rework the whole thing but there are several stages needing changed and in an obvious order. It can be done incrementally. It should start by moving linedef & segment processing to the DSP and reducing the amount of traffic for each wall segment to just world x/z, and floor/ceiling heights for the current and adjacent sectors (or something similar). Apart from reducing the amount of CPU code involved (especially big mul/div patterns) and reducing the amount of stuff being transmitted, it also removes some bidirectional CPU/DSP operations which interfere with concurrency, and removes redundant/repeat z-divisions. The only thing the CPU should read back, if anything, is a shortcut code for the wall and even that can be deferred, and then optionally wall columns for drawing.
The walls also need mipmaps and a prelighting cache (for the smaller mips at least), as well as a few other rendering optimizations.
Still have a few naming issues to tidy up and will start on the next pass, mainly for walls.
At some point later it will be worth completing these optimizations just to get the code size down and make more room for extensions or other improvements. For now it's more useful to look at CPU load, bus activity, host port exchange points and concurrency.
I now have a list of areas where time is being wasted in complex scenes. It won't be very easy to rework the whole thing but there are several stages needing changed and in an obvious order. It can be done incrementally. It should start by moving linedef & segment processing to the DSP and reducing the amount of traffic for each wall segment to just world x/z, and floor/ceiling heights for the current and adjacent sectors (or something similar). Apart from reducing the amount of CPU code involved (especially big mul/div patterns) and reducing the amount of stuff being transmitted, it also removes some bidirectional CPU/DSP operations which interfere with concurrency, and removes redundant/repeat z-divisions. The only thing the CPU should read back, if anything, is a shortcut code for the wall and even that can be deferred, and then optionally wall columns for drawing.
The walls also need mipmaps and a prelighting cache (for the smaller mips at least), as well as a few other rendering optimizations.
Still have a few naming issues to tidy up and will start on the next pass, mainly for walls.
d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- Atari Super Hero
- Posts: 926
- Joined: Thu Sep 11, 2003 10:49 pm
- Location: UK
Re: Bad Mood : Falcon030 'Doom'
At the risk of sounding like a clueless end user...
If the DSP had DMA access to ST-RAM, would it have made the whole thing easier and faster?
From my limited knowledge the CPU feeds data to the DSP SRAM through the host port? Then the DSP does it's thing and feeds the results back to the CPU through this host port? But the port is slow, very slow and only 8-bit wide?
Just interested in how the architecture works
If the DSP had DMA access to ST-RAM, would it have made the whole thing easier and faster?
From my limited knowledge the CPU feeds data to the DSP SRAM through the host port? Then the DSP does it's thing and feeds the results back to the CPU through this host port? But the port is slow, very slow and only 8-bit wide?
Just interested in how the architecture works

-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
Don't worry - valid questions. But the answers are more complicated.EvilFranky wrote:At the risk of sounding like a clueless end user...
There is DMA access, in the sense that the CPU can DMA stuff to the DSP instead of direct to the DAC (which is what happens for sample audio), but the DMA transport operates at audio bandwidth (50kHz/16bit/stereo) so at maximum speed it can only send a few kbytes per video refresh.EvilFranky wrote:If the DSP had DMA access to ST-RAM, would it have made the whole thing easier and faster?
DMA is efficient, but not very fast for quantities of data. It's good for continuous 'background' transfers at low bandwidth since the CPU doesn't have to sync very often, but the low rate limits its use for rendering or short timely transfers.
It might be good for streaming small batches of self-contained 3D geometry - vertexbuffers, mesh data. But it's hard to use effectively in Doom because there is actually very little geometry per thing drawn, and constant feedback between what's just been drawn, and what's permitted to be drawn next (occlusion side effects) - results of drawing each wall actually affect which route the BSP-walk takes next while composing a scene, which would mean cancelling (or adding redundancy to) DMA transfers started ahead of time. You could trade occlusion accuracy for streaming but I don't think it would pay for itself (pixel-precision occlusion saves a lot of wasted processing).
The host port is much, much faster than DMA but it needs to be used with care. It is 24-bits wide, broken into 8-bit registers. It has been buffered (with some kind of FIFO/pipe) in such a way that normal 16-bit/word transfers are very quick (apparently quicker than the same kind of access to STRam). This is why the DSP texturing works so well - copying each pixel direct from the host port to the screen.
So the host port is a good way to get stuff in/out in bursts (or small pieces) with a high transfer rate, at the cost of tying up the CPU for the transfer time. Fortunately the transfer time can be tiny if you're just sending a few words. If you want to exchange a large amount of data in a short time, you have no choice but to tie up the CPU. So you need to make it worthwhile (ideally this is done only where the CPU would have been tied up anyway - e.g. drawing).
It feeds data to the DSP via a register, and the DSP picks it up on the other side from an equivalent register. The other direction works the same way. So there's no direct access to ram on either side. DMA is the closest thing to 'direct' ram access but even that isn't exactly direct, esp. on the DSP side where the stuff gets intercepted.EvilFranky wrote:From my limited knowledge the CPU feeds data to the DSP SRAM through the host port? Then the DSP does it's thing and feeds the results back to the CPU through this host port? But the port is slow, very slow and only 8-bit wide?
By designing your program to avoid the pattern above (send, wait, receive) you can get good concurrency and a lot of work out of the DSP. I did a reasonable job with this the first time round, but not perfect - and the split between CPU and DSP is too low/fine-grained for what's going on. Will be fixing most of that this time round.
d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- 10 GOTO 10
- Posts: 3362
- Joined: Fri Oct 04, 2002 11:23 am
- Location: Warsaw, Poland
Re: Bad Mood : Falcon030 'Doom'
true, but please multiply your calculations by factor 8 - max number of audio channels transferred by DMA - 49170/16bit/8channels. It is more or less 784 kbytes/sec.dml wrote:the DMA transport operates at audio bandwidth (50kHz/16bit/stereo)
Therefore in case of refresh rate 10FPS, DMA can transfer about 78 kbytes to DSP per frame.dml wrote:so at maximum speed it can only send a few kbytes per video refresh.
BTW. I love AtariForum for a such technical threads, and I'm waiting for mentioned by you a new DSP dedicated thread

thanks
ATW800/2 / V4sa / Lynx I / Mega ST 1 / 7800 / Portfolio / Lynx II / Jaguar / TT030 / Mega STe / 800 XL / 1040 STe / Falcon030 / 65 XE / 520 STm / SM124 / SC1435
DDD HDD / AT Speed C16 / TF536 / SDrive / PAK68/3 / Lynx Multi Card / LDW Super 2000 / XCA12 / SkunkBoard / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
http://260ste.atari.org
DDD HDD / AT Speed C16 / TF536 / SDrive / PAK68/3 / Lynx Multi Card / LDW Super 2000 / XCA12 / SkunkBoard / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
http://260ste.atari.org
-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
That's a good point - I completely forgot about the multitrack thing on the Falcon. 78k/frame is definitely more usable for streaming blocks of stuff. Might be good for compressed textures for example if they can be scheduled early enough.Cyprian wrote: true, but please multiply your calculations by factor 8 - max number of audio channels transferred by DMA - 49170/16bit/8channels. It is more or less 784 kbytes/sec.
Yes sometimes best way to fix your assumptions is to say them out loudCyprian wrote: BTW. I love AtariForum for a such technical threads, and I'm waiting for mentioned by you a new DSP dedicated thread
thanks

d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- Hardware Guru
- Posts: 4725
- Joined: Sat Sep 10, 2005 11:11 am
- Location: Kosice, Slovakia
Re: Bad Mood : Falcon030 'Doom'
EDIT: Ehem. I thought this is wrong calculation but no, it's right. I'm getting senile or what. For some reason I felt host port is about 1 MB/s what is nonsense.Cyprian wrote:true, but please multiply your calculations by factor 8 - max number of audio channels transferred by DMA - 49170/16bit/8channels. It is more or less 784 kbytes/sec.
Btw Doug, while reading about your progress I'm still thinking whether you don't over-complicate things here. Literally everyone has reached the same point in DSP coding after some time - squeeze as much data in DSP as possible, let 030 clear the damn screen buffer and then render textures directly from DSP. Of course, this doesn't apply to you for 100% as you don't need to clear the buffer but rather draw directly from DSP + you've got sprites + a "little" more textures but maybe you can render it partially (for each texture / screen area etc - no clue how Doom rendering code works).
Doing complicated synchronizations, a lot of CPU<->DSP transfers, ... that always only ends up in a slowdown. CPU ain't good for anything but transferring data from DSP :))
-
- 10 GOTO 10
- Posts: 3362
- Joined: Fri Oct 04, 2002 11:23 am
- Location: Warsaw, Poland
Re: Bad Mood : Falcon030 'Doom'
that's good point, question how REALLY fast is host port.mikro wrote:EDIT: Ehem. I thought this is wrong calculation but no, it's right. I'm getting senile or what. For some reason I felt host port is about 1 MB/s what is nonsense.Cyprian wrote:true, but please multiply your calculations by factor 8 - max number of audio channels transferred by DMA - 49170/16bit/8channels. It is more or less 784 kbytes/sec.
DML mentioned on Hatari list that Host Port is faster than ST RAM. As far as I remember CPU has access every 4th clock to ST-RAM, it is between 6~8 MB/s (if you deduct what was stolen by Videl and memory refresh)
ATW800/2 / V4sa / Lynx I / Mega ST 1 / 7800 / Portfolio / Lynx II / Jaguar / TT030 / Mega STe / 800 XL / 1040 STe / Falcon030 / 65 XE / 520 STm / SM124 / SC1435
DDD HDD / AT Speed C16 / TF536 / SDrive / PAK68/3 / Lynx Multi Card / LDW Super 2000 / XCA12 / SkunkBoard / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
http://260ste.atari.org
DDD HDD / AT Speed C16 / TF536 / SDrive / PAK68/3 / Lynx Multi Card / LDW Super 2000 / XCA12 / SkunkBoard / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
http://260ste.atari.org
-
- Hardware Guru
- Posts: 4725
- Joined: Sat Sep 10, 2005 11:11 am
- Location: Kosice, Slovakia
Re: Bad Mood : Falcon030 'Doom'
Unfortunately, it's much slower. Access to ST-RAM isn't 8 MB/s ;) It's about 5 MB/s and goes drastically down with better (320x240x16bit for example) resolutions. Reading (!) from ST-RAM is faster than from DSP, i.e. host port transfer speed without additional handshake is about 3 MB/s.Cyprian wrote:that's good point, question how REALLY fast is host port.
DML mentioned on Hatari list that Host Port is faster than ST RAM. As far as I remember CPU has access every 4th clock to ST-RAM, it is between 6~8 MB/s (if you deduct what was stolen by Videl and memory refresh)
EDIT: to explain, the slowdown happens not only because Videl and refresh. Don't forget that to read a value from RAM you need an instruction. And these instructions are really costly (13 cycles for simple move outside the instruction cache). Plus couple of other things, you can find a document about it on my website.
-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
The primary problem with Doom on the DSP is random access to a large dataset. It is this alone which makes a 'fully DSPd' implementation very difficult. So the end result is something more complicated than would be nice.mikro wrote: Btw Doug, while reading about your progress I'm still thinking whether you don't over-complicate things here. Literally everyone has reached the same point in DSP coding after some time - squeeze as much data in DSP as possible, let 030 clear the damn screen buffer and then render textures directly from DSP. Of course, this doesn't apply to you for 100% as you don't need to clear the buffer but rather draw directly from DSP + you've got sprites + a "little" more textures but maybe you can render it partially (for each texture / screen area etc - no clue how Doom rendering code works).
Doom is very view-dependent by nature and there can be a lot of data referenced in one image. The WADs alone are a few hundred K for a level - some are 20mb in size for a level set - that includes texture sharing, and excludes texture patch generation at runtime! So at some level, the CPU has to deal with that mess and provide the right/filtered data to the DSP (either with or without the help of the DSP - pros and cons involved with that). Ideally this step is only done once too, because the scene walk is expensive enough to count if repeated.
It's difficult to have the DSP 'drive' that process completely because at best it would have to ask the CPU for things without itself making any random access to the BSP & other data - using the CPU as a sort of query server. Unfortunately having the CPU respond to a range of DSP queries isn't ideal either, as the CPU is slow to respond with anything.
There are also some difficulties with textures - floors use relatively few textures per scene and they tile like crazy. A small amount of data goes a long way and direct texturing works well. Walls are the opposite - it's a huge random access dataset once again, using a 'patch cache' which makes unique surface textures out of smaller patches for individual walls (and the textures are much bigger too). Mips would help reduce the amount of data needing sent - but having the CPU guess which mips to use and when, is again expensive and complicated and may not reduce it enough.
So the best balance seems to be having the CPU deal with top level sceneview duties - the BSP walk and sector wall iteration, which involves a few hundred to maybe 1k operations per scene frame. The DSP does as much of the rest as possible - tens to hundreds of k operations per scene frame. Walls columns probably still need to be done by the CPU, at least simplified texture plotting from the large dataset. The DSP can do the texture addressing or just use the time between columns (as it does now).
The main bottlenecks are living in the 'middle layer' places which haven't been properly converted to DSP yet - lots of projection, side testing etc. per portion of wall (x2 typically since walls come in sets of 1-3 sections at a time) and then shoving too much at the DSP per wall, with which it quickly fills & defrags its spanbuffer. The CPU code for that stuff is large, doesn't cache and is cycle-heavy with muls & divs. Fixing it should make a big difference, and would achieve the plan above - the CPU is left just doing the scene walk, at least until rendering.
If the massive texture set was put aside, the DSP could compose a complete scene being fed only BSP-walk map data, and probably render it out as tiles quite easily.
Will see how it goes, and will keep an eye on other avenues to exclude the CPU and involve more DSP (I have a few lined up already which will definitely cut wall rendering cost even without the fixes mentioned above).
d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- Atari Super Hero
- Posts: 926
- Joined: Thu Sep 11, 2003 10:49 pm
- Location: UK
Re: Bad Mood : Falcon030 'Doom'
Thanks for your explanation Doug, appreciate you taking the time to answer my 'numpty' questions 

-
- Fuji Shaped Bastard
- Posts: 3999
- Joined: Sun Jul 31, 2011 1:11 pm
Re: Bad Mood : Falcon030 'Doom'
This is probably a stupid question as I haven't done any that kind of graphics stuff, but why DSP needs to handle the texture data? Wouldn't texture ID, size and position in addition to the scene geometry be enough for DSP calculations? Could it with those provide CPU some kind of generalized lighting values, with texture ID & offsets which CPU would then plot to screen on indicated place using the actual texture data?dml wrote:If the massive texture set was put aside, the DSP could compose a complete scene being fed only BSP-walk map data, and probably render it out as tiles quite easily.
Alpha / partial transparency would of course be a problem with that, but maybe one could just decide not to support them...

-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
That's basically what it does at the moment, for walls anyway. Actually there are 2 schemes - one just has the DSP pass span data and gradients to the CPU and the CPU does the actual addressing and plotting - and the other has the DSP do the addressing, leaving only plotting to the CPU (I call this the 'hybrid' texturing method in BM).Eero Tamminen wrote: Wouldn't texture ID, size and position in addition to the scene geometry be enough for DSP calculations? Could it with those provide CPU some kind of generalized lighting values, with texture ID & offsets which CPU would then plot to screen on indicated place using the actual texture data?
It originally did this for floors too, but with recent changes it was possible to keep the active floor texture on the DSP and made quite a difference.
It's possible to mix schemes and select the best one for each task. So you don't need to give up anything, just a little performance for some specific surfaces. The lava stuff worked that way too - expensive shader, but not used everywhere.Eero Tamminen wrote: Alpha / partial transparency would of course be a problem with that, but maybe one could just decide not to support them...
d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
I meant to add - it's true that it is too complex in places but I expect much of that complexity will vanish as more of the middle stuff is moved to the DSP.mikro wrote: Btw Doug, while reading about your progress I'm still thinking whether you don't over-complicate things here.
The reason for some of this complexity is related to where the project was left, having converted chunks of it (mostly bottom-up) from 68k to DSP over a few sessions - with a few higher level operations DSP'd as separate functions. The job wasn't really completed so the bit in the middle is messy and becomes a bottlenecks as the surface count increases.
It's difficult to draw a clean diagram of the pipeline just now - which hints at the complexity of it. The intended, final layout however is much easier to draw and mostly flows in one direction (pixel plotting, and BSP node visibility checking are the only exceptions).
d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- Fuji Shaped Bastard
- Posts: 3999
- Joined: Sun Jul 31, 2011 1:11 pm
Re: Bad Mood : Falcon030 'Doom'
Hatari's new callgraph code could maybe help here?dml wrote:It's difficult to draw a clean diagram of the pipeline just now - which hints at the complexity of it.

-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
Hehe. It's not that I don't know how it works (that would probably be bad) - it's more that it's difficult to lay out in a tidy wayEero Tamminen wrote:Hatari's new callgraph code could maybe help here?dml wrote:It's difficult to draw a clean diagram of the pipeline just now - which hints at the complexity of it.

However the hatari profiler has been pretty good at finding things so far - nothing ground breaking but a few things I didn't know (either about BM, or about the Falcon HW itself).
I have attached a very, very approximate diagram of the main stages. The size of the arrows indicate rough amount of traffic per frame for a simple scene. The arrow colours show how badly the traffic scales with scene complexity. The big box indicates the bits which are still CPU based, which will become DSP based.
I wouldn't take the diagram too seriously - it was very quickly done and doesn't really do it justice but it does help illustrate why framerate drops excessively with complex views, even if the pixel plotting cost is approximately constant (not quite but it scales relatively well). It also shows that key areas need to be moved to the DSP.
You do not have the required permissions to view the files attached to this post.
d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
A rewrite of the early DSP based BSP-node bounding box visibility test (the first operation to get DSP'd in the original project) has reduced the cost of visibility determination quite significantly.
For a really slow scene (400 visible ssectors from around 700 visited), the visibility testing time has been cut by 40%. For the e1m1 startpoint it's now making up about 2.5% of total time.
The visibility test wasn't one of the major bottlenecks, but was the only one involving the DSP to any degree and is right at the top of the chain for all other drawing. It involved a projection step for node bounding boxes which worked in terms of vertex pairs for each bounding edge with z-clipping. The new one has a noclip fast path and works with single vertices, only generating extra vertices in the slower path and only where needed (i.e. a lot of expensive divides removed - almost halved).
The visibility test now has several fast paths before the final occlusion buffer test - which is expensive. It's only used as the final step for otherwise visible nodes. The occlusion test now checks 4 display columns at a time instead of 1, with a much lower cost per column.
The whole visibility checking step is pixel (column) accurate so only nodes which contribute something to the scene end up having their walls processed. This test and the feedback it gets from the occlusion buffer is what makes it possible to draw a scene with 700+ sectors in it (thousands of wall segments) without freezing up.
Turning this thing off in stages - the occlusion buffer test - the 2D projection test - the fast octant test - results in more and more stuff being processed - the framerate craters with it fully off (almost frozen). Get this bit even slightly wrong and there's almost no point in messing with the rest of it
For a really slow scene (400 visible ssectors from around 700 visited), the visibility testing time has been cut by 40%. For the e1m1 startpoint it's now making up about 2.5% of total time.
The visibility test wasn't one of the major bottlenecks, but was the only one involving the DSP to any degree and is right at the top of the chain for all other drawing. It involved a projection step for node bounding boxes which worked in terms of vertex pairs for each bounding edge with z-clipping. The new one has a noclip fast path and works with single vertices, only generating extra vertices in the slower path and only where needed (i.e. a lot of expensive divides removed - almost halved).
The visibility test now has several fast paths before the final occlusion buffer test - which is expensive. It's only used as the final step for otherwise visible nodes. The occlusion test now checks 4 display columns at a time instead of 1, with a much lower cost per column.
The whole visibility checking step is pixel (column) accurate so only nodes which contribute something to the scene end up having their walls processed. This test and the feedback it gets from the occlusion buffer is what makes it possible to draw a scene with 700+ sectors in it (thousands of wall segments) without freezing up.
Turning this thing off in stages - the occlusion buffer test - the 2D projection test - the fast octant test - results in more and more stuff being processed - the framerate craters with it fully off (almost frozen). Get this bit even slightly wrong and there's almost no point in messing with the rest of it

d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
Here's an interim build which is relatively stable and has the recent changes rolled in.
The main visible differences (compared with v3.07) are:
- speed, particularly with with large floor/ceiling surface areas, and in complex scenes
- mipmaps reducing noise in the floor/ceiling
- sky, lava and sprites are disabled pending more work
- mouse input is disabled for now, mainly to stop it interfering with profiling
It seems to run faster on a real machine for simpler scenes, but faster in Hatari for complex scenes. Probably related to the small amount of FPU code still in there affecting walls.
This is the last update before a big chunk of stuff gets reimplemented.
The main visible differences (compared with v3.07) are:
- speed, particularly with with large floor/ceiling surface areas, and in complex scenes
- mipmaps reducing noise in the floor/ceiling
- sky, lava and sprites are disabled pending more work
- mouse input is disabled for now, mainly to stop it interfering with profiling
It seems to run faster on a real machine for simpler scenes, but faster in Hatari for complex scenes. Probably related to the small amount of FPU code still in there affecting walls.
This is the last update before a big chunk of stuff gets reimplemented.
You do not have the required permissions to view the files attached to this post.
d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- Atari Super Hero
- Posts: 559
- Joined: Wed Dec 01, 2004 12:13 am
- Location: Germany
Re: Bad Mood : Falcon030 'Doom'
It is a shame you do all this now, 20 years after I bought my Falcon because I gave it away some days ago. Never thought the bad mood project would ever been going on after so many years of sleep. 

-
- Captain Atari
- Posts: 269
- Joined: Mon Apr 24, 2006 10:00 pm
- Location: Netherlands
Re: Bad Mood : Falcon030 'Doom'
Good things come to those who wait!Omikronman wrote:It is a shame you do all this now, 20 years after I bought my Falcon because I gave it away some days ago. Never thought the bad mood project would ever been going on after so many years of sleep.

-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
All I can say to this is: 'maybe this time', and 'real life takes no prisoners'.Omikronman wrote:It is a shame you do all this now, 20 years after I bought my Falcon because I gave it away some days ago. Never thought the bad mood project would ever been going on after so many years of sleep.

d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- Atari God
- Posts: 1223
- Joined: Wed Nov 20, 2002 11:22 pm
- Location: France
Re: Bad Mood : Falcon030 'Doom'
And that seems to put you in a reaeaeally bad mood.Omikronman wrote:It is a shame you do all this now, 20 years after I bought my Falcon because I gave it away some days ago. Never thought the bad mood project would ever been going on after so many years of sleep.

(yeah, that's an easy one)
-= Personal pages hub = YM-Rockerz =-
-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
I found 3 more strange things in the DSP code that seem pointlessly expensive. I can't say that it will help performance in a big way but I'm going to fix them anyway.
1) It seems I came up with a weird independent solution to perspective correction, which happens to be more costly than doing it properly (for setup anyway, the main bit is similar). I guess it was a result of experiments at the time, instead of reference.
2) There's a redundant division in the wall init code. No explanation for that.
3) The linear (?) fastpath for perspective correction still has a divide in it. probably set up the fastpath and simplified it a bit and forgot to convert it fully to affine mapping.
I doubt fixing this stuff will speed things up immediately since the costs are almost completely hidden behind CPU drawing, but scenes with lots of walls and very short columns of a few pixels - this might start to matter as excess DSP work isn't hidden so well there. Strange that I'd leave it like that but I guess I was doing lots of different things at once in spare time and not very organised about it.
[EDIT]
...I also figured out a way to almost completely hide the cost of visibility tests for nodes, by doing it predictively and reordering the results after classification. Probably won't try this until next week.
1) It seems I came up with a weird independent solution to perspective correction, which happens to be more costly than doing it properly (for setup anyway, the main bit is similar). I guess it was a result of experiments at the time, instead of reference.
2) There's a redundant division in the wall init code. No explanation for that.
3) The linear (?) fastpath for perspective correction still has a divide in it. probably set up the fastpath and simplified it a bit and forgot to convert it fully to affine mapping.
I doubt fixing this stuff will speed things up immediately since the costs are almost completely hidden behind CPU drawing, but scenes with lots of walls and very short columns of a few pixels - this might start to matter as excess DSP work isn't hidden so well there. Strange that I'd leave it like that but I guess I was doing lots of different things at once in spare time and not very organised about it.
[EDIT]
...I also figured out a way to almost completely hide the cost of visibility tests for nodes, by doing it predictively and reordering the results after classification. Probably won't try this until next week.
d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- Fuji Shaped Bastard
- Posts: 3988
- Joined: Sat Jun 30, 2012 9:33 am
Re: Bad Mood : Falcon030 'Doom'
I couldn't remember what Doom looked like on my old 386sx machine (long since scrapped) but I remembered it being pretty painful until I got hold of a 486. Anyway I've seen some comparisons between Falcon and 386sx/16mhz so I thought I'd look it up. The results are pretty amusing. Especially the first one.
The 33mhz version seems to run reasonably fast... until something happens.
386SX @ 16mhz / fullscreen
http://www.youtube.com/watch?v=qFG6dqQZzbU
386SX @ 25MHz / reduced window
http://www.youtube.com/watch?v=Jf21G6beseg
386SX @ 33mhz / unknown resolution
http://www.youtube.com/watch?v=vETjEH0Vo9I
A couple of other things spring to mind about the 386 - the SX didn't have a cache but the DX variants had big caches on the mainboard. Doom did run quite a bit faster on the 386DX machines as a result.
PCs never had shared video memory - they had some annoying framebuffer layouts but they were optimized for writing even back then (although not so much for reading). The video hardware didn't fetch from the system bus as it does on the Falcon, so it didn't interfere with CPU performance.
So the comparison isn't very direct - the architectures are completely different from the CPU to RAM to video - so it's hard to compare fairly. Still, it is funny to see how slow it really was without reducing the resolution
The 33mhz version seems to run reasonably fast... until something happens.
386SX @ 16mhz / fullscreen
http://www.youtube.com/watch?v=qFG6dqQZzbU
386SX @ 25MHz / reduced window
http://www.youtube.com/watch?v=Jf21G6beseg
386SX @ 33mhz / unknown resolution
http://www.youtube.com/watch?v=vETjEH0Vo9I
A couple of other things spring to mind about the 386 - the SX didn't have a cache but the DX variants had big caches on the mainboard. Doom did run quite a bit faster on the 386DX machines as a result.
PCs never had shared video memory - they had some annoying framebuffer layouts but they were optimized for writing even back then (although not so much for reading). The video hardware didn't fetch from the system bus as it does on the Falcon, so it didn't interfere with CPU performance.
So the comparison isn't very direct - the architectures are completely different from the CPU to RAM to video - so it's hard to compare fairly. Still, it is funny to see how slow it really was without reducing the resolution

d:m:l
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
Home: http://www.leonik.net/dml/sec_atari.py
AGT project https://bitbucket.org/d_m_l/agtools
BadMooD: https://bitbucket.org/d_m_l/badmood
Quake II p/l: http://www.youtube.com/playlist?list=PL ... 5nMm10m0UM
-
- Atari God
- Posts: 1266
- Joined: Wed Feb 11, 2004 4:34 pm
- Location: Middle Earth (Npton) UK
Re: Bad Mood : Falcon030 'Doom'
Ouch, that first video in particular was painful. Would it be reasonable to talk of frames per minute there?
And I found something that might be a closer match to what you're doing. This video link is for a modestly accelerated Amiga, which would be a 68030 based booster, at 28 MHz. So a faster CPU than ours, but no DSP. It also has a shedload (32MB) of ram in comparison.
Bizarrely, the poster takes it on herself to tell us that she's got other applications running concurrently in multitasking mode.
http://www.youtube.com/watch?v=fhI60ahCvDc
EDIT:
From the same poster, another Amiga-based Doom session, noteworthy of interest.
Noting it's a faster Miggy than the first one, a 50 MHz 68030 and FPU, also she's running it in a window on a 1280 x 720 desktop, but also that it's dithered down to sixteen colours.
http://www.youtube.com/watch?v=RKixK1gx ... bA&index=3

And I found something that might be a closer match to what you're doing. This video link is for a modestly accelerated Amiga, which would be a 68030 based booster, at 28 MHz. So a faster CPU than ours, but no DSP. It also has a shedload (32MB) of ram in comparison.
Bizarrely, the poster takes it on herself to tell us that she's got other applications running concurrently in multitasking mode.
http://www.youtube.com/watch?v=fhI60ahCvDc
EDIT:
From the same poster, another Amiga-based Doom session, noteworthy of interest.
Noting it's a faster Miggy than the first one, a 50 MHz 68030 and FPU, also she's running it in a window on a 1280 x 720 desktop, but also that it's dithered down to sixteen colours.
http://www.youtube.com/watch?v=RKixK1gx ... bA&index=3
"Where teh feck is teh Hash key on this Mac?!"
-
- Atari God
- Posts: 1291
- Joined: Wed Dec 19, 2007 8:36 pm
- Location: Sweden
Re: Bad Mood : Falcon030 'Doom'
I used to play Doom om my 386sx 20MHz and it worked fine. I remember I had to use the double pixel mode and ofcourse I had the status bar up so it wasn't fullscreen. It was a laptop PC and since the LCD screens were so slow back then I got a natural motion blur effect so even if the framerate wasn't so high. It wasn't so noticeable 

ST / STFM / STE / Mega STE / Falcon / TT030 / Portfolio / 2600 / 7800 / Jaguar / 600xl / 130xe