Bad Mood : Falcon030 'Doom'

All 680x0 related coding posts in this section please.

Moderators: Zorro 2, Moderator Team

User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:My head will explode if I have to do any more with mipmaps. Will finish off the loose ends, tidy it up and move onto something else.
I would suggest looking into optimizing audio and PointOnDivlineSide, as they've been most time consuming in several of the "worst" frame profiles; audio for both sides, PointOnDivlineSide (just) for the thinking part:

Code: Select all

Executed instructions:
  15.98%  15.99%  16.24%       65136     65156     66199   _subframe_block
  14.13%  14.16%  26.67%       57575     57721    108671   _P_RunThinkers
  11.37%  11.43%  13.11%       46332     46587     53447   _P_PointOnDivlineSide
Or maybe you could look into caching to reduce extra sprite loads?

After those are optimized I could do some more "worst frame" profiling, it could then show more interesting DSP side stuff as these less frequent, but heavy parts don't contribute so much to worst frames.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote: I would suggest looking into optimizing audio and PointOnDivlineSide, as they've been most time consuming in several of the "worst" frame profiles; audio for both sides, PointOnDivlineSide (just) for the thinking part:
Yes these are high on my list.

I've been focusing on the wall texture prep/drawing part for an extended period because wall drawing has been a huge singular cost - and once the work began here again it broke *lots* of stuff (sprites, transparencies, texture tiling etc.). It has recently been tricky to commit changesets for other fixes because I have so much other code held back which you really don't want to get :-)

Fortunately the sprites are now working again and most of the other problems are fixed. Wall transparencies are still broken but they don't really get in the way of testing and I'll be able to fix those once I implement a 'transparency detect' routine for the loader so it knows what texture formats/flags to apply.

Aside: I retested performance with the new index packing arrangement, and texture count limited to 16 colours. The FPS @ e1m1 rose to 8.45 - and it looks surprisingly ok for 16-colour textures! (mainly because each texture gets its own 16 cols and the original textures were sharing a 256col palette, so didn't use many each anyway) I've still to test with other more sensible values ranging up to to the default 256cols to see how that impacts performance. I expect 64cols is the lower bound for decent looking results.

Anyway the big-list-o-wall-changes is nearly complete and I'm reviewing the changeset documentation so I can push it :)

Eero Tamminen wrote: Or maybe you could look into caching to reduce extra sprite loads?
After those are optimized I could do some more "worst frame" profiling, it could then show more interesting DSP side stuff as these less frequent, but heavy parts don't contribute so much to worst frames.
In fact I didn't realize this was getting in your way until you brought it up just now. Obviously you can't do much with worst-frame profiling if it's loading even one sprite during play. Sorry about that - I'll prioritize this one next!
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

It appears the sprite precache list is complete, because it is using the same list used to register sprite definitions for all other purposes. i.e. if there were names missing from that list, they wouldn't even be drawn when needed.

However there's an ugly hack in the sprite precache routine which walks through the 'thinker' list and only precaches sprite frames from thinkers associated with the P_MobjThinker general purpose thinker module. I suspect the 'missing' sprites don't use that thinker module, and are getting skipped.

Code: Select all

[r_data.c]

    for (th = thinkercap.next ; th != &thinkercap ; th=th->next)
    {
	if (th->function.acp1 == (actionf_p1)P_MobjThinker)  // <---- ugly hack, probable cause!
	    spritepresent[((mobj_t *)th)->sprite] = 1;
    }
	
    spritememory = 0;
    for (i=0 ; i<numsprites ; i++)
    {
		if (!spritepresent[i])   // <----- test is causing some sprites to get missed
			continue;
It's also possible the missing sprites aren't even in the thinker list. Not sure how to fix this yet because the above test is probably there for a reason - mainly to limit precaching of items just from that map, but perhaps as a safety check because some thinkers don't have any sprites associated. Will take a closer look later.

Eero, if you feel like fiddling with it while I'm cleaning up the other stuff you'll now know where to find it :)
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

The latest changes are in HG now.

Mipmaps are configured to operate only on non-transparent textures. Since I can't reliably detect these yet, it assumes any single-patch texture can be transparent, and prepares the texture accordingly (multipatch textures should not be transparent - it causes a well known Doom bug if attempted - 'medusa effect' http://doom.wikia.com/wiki/Medusa_effect).

A bunch of single-patch textures which are *not* transparent will get handled incorrectly. Unfortunately, while these surfaces do still render ok they are not unrolled for tiling. This means tiling can cause brightly coloured spots/flecks on those walls. This is normal. It is ugly, but will go away when I figure out how to detect the transparent textures at loadtime and classify everything correctly.

I can hack the code to mipmap all textures but then transparencies stop working and there's a small - but real - risk of a crash while drawing so leaving things this way is better for now.

Since mipmap generation is a loading process, and is very expensive, loading takes longer than before. As explained previously, all of that work will be offlined later and should only happen on the first run, cached for future use.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero, a fix is in for the uncached sprites.

It's just a 'fixed list' of sprite enums we know of, applied on top of the existing rules. It's not ideal but should solve the problem for now. See if it works for you.

I applied the fix close to the last revision you were working with (#409a25cb587c), but also merged it with the tip following the latest changes. Just in case the new code causes any problems.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Working on a DSP replacement for PIT_AddLineIntercepts which seems to be where most of the performance pain is coming from re: P_PointOnDivlineSide/P_PointOnLineSide.

The code is written already but I'm preparing myself for debugging woe.

[EDIT]

Not too much debugging... in fact none. Now getting a stream of intercepts from the DSP which match the CPU version of the same, within the reduced precision expected of 24bit math (the original uses 32bit fixedpoint).

Code: Select all

        frac: cpu: 0x00000e53
        frac: dsp: 0x00001000
        frac: cpu: 0xfff2d9bf
        frac: dsp: 0xfff2de00
        frac: cpu: 0xfff8c2a0
        frac: dsp: 0xfff8c500
        frac: cpu: 0x008246bd
        frac: dsp: 0x00823100
        frac: cpu: 0xfff2d9bf
        frac: dsp: 0xfff2de00
        frac: cpu: 0xfff8c2a0
        frac: dsp: 0xfff8c500
        frac: cpu: 0xfffeab76
        frac: dsp: 0xfffeac00
        frac: cpu: 0x0004dd0c
        frac: dsp: 0x0004da00
        frac: cpu: 0x00000969
        frac: dsp: 0x00000b00
        frac: cpu: 0xffffa3b1
        frac: dsp: 0xffffa400
        frac: cpu: 0x000009be
        frac: dsp: 0x00000900
        frac: cpu: 0x000005bb
        frac: dsp: 0x00000500
        frac: cpu: 0x00006a02
        frac: dsp: 0x00006a00
        frac: cpu: 0x000061fc
        frac: dsp: 0x00006100
        frac: cpu: 0x00008213
        frac: dsp: 0x00008200
        frac: cpu: 0x00008a18
        frac: dsp: 0x00008a00
        frac: cpu: 0x0000d64e
        frac: dsp: 0x0000d600
        frac: cpu: 0x0000e659
        frac: dsp: 0x0000e600
        frac: cpu: 0x00010cea
        frac: dsp: 0x00010c00
        frac: cpu: 0x0000f02b
        frac: dsp: 0x0000f000
        frac: cpu: 0x0001037f
        frac: dsp: 0x00010300
        frac: cpu: 0xffffa6d9
        frac: dsp: 0xffffa700
        frac: cpu: 0x000009dd
        frac: dsp: 0x00000900
        frac: cpu: 0x000005cd
        frac: dsp: 0x00000500
        frac: cpu: 0x00006b4e
        frac: dsp: 0x00006b00
        frac: cpu: 0x0000728a
        frac: dsp: 0x00007200
        frac: cpu: 0x0000632f
        frac: dsp: 0x00006300
        frac: cpu: 0x000083aa
        frac: dsp: 0x00008300
        frac: cpu: 0x00008bc9
        frac: dsp: 0x00008b00
        frac: cpu: 0x0000d8ed
        frac: dsp: 0x0000d800
        frac: cpu: 0x0000e92b
        frac: dsp: 0x0000e900
        frac: cpu: 0x0000eff7
        frac: dsp: 0x0000ef00
        frac: cpu: 0x000121a5
        frac: dsp: 0x00012100
        frac: cpu: 0xffff9f56
        frac: dsp: 0xffff9f00
        frac: cpu: 0x000009b8
        frac: dsp: 0x00000900
        frac: cpu: 0x000005b8
        frac: dsp: 0x00000500
        frac: cpu: 0x000061bf
        frac: dsp: 0x00006100
        frac: cpu: 0x000069c0
        frac: dsp: 0x00006900
        frac: cpu: 0x000081c1
        frac: dsp: 0x00008100
        frac: cpu: 0x00009c5d
        frac: dsp: 0x00009c00
        frac: cpu: 0x0000e0d7
        frac: dsp: 0x0000e000
        frac: cpu: 0x0000edca
        frac: dsp: 0x0000ed00
        frac: cpu: 0xffffd0ca
        frac: dsp: 0xffffd100
        frac: cpu: 0x000004dc
        frac: dsp: 0x00000400
        frac: cpu: 0x000002dc
        frac: dsp: 0x00000200
        frac: cpu: 0x000034df
        frac: dsp: 0x00003400
        frac: cpu: 0x000030df
        frac: dsp: 0x00003000
        frac: cpu: 0x000040e0
        frac: dsp: 0x00004000
        frac: cpu: 0x000075f0
        frac: dsp: 0x00007500
        frac: cpu: 0x00007a10
        frac: dsp: 0x00007a00
        frac: cpu: 0x0000890e
        frac: dsp: 0x00008900
        frac: cpu: 0x0000a25b
        frac: dsp: 0x0000a200
        frac: cpu: 0x0000c737
        frac: dsp: 0x0000c700
        frac: cpu: 0x0000cec6
        frac: dsp: 0x0000ce00
        frac: cpu: 0x0000e2ea
        frac: dsp: 0x0000e200
The most important point is that the values are paired properly, so the DSP only yields an intercept result when the CPU does - and that means the test is working. An occasional mismatch is to be expected and the results are 8 bits less precise, but I'm reasonably happy with this so far.

Unfortunately, this is based on a simplified version of Doom's intercept test which makes sense to me. The real one uses a 2-way test which I'm still trying to understand (because it looks wrong). Another one of those. So I can't finish the code until I understand how that can work the way it is. :-z

[EDIT]

It appears that any loss of precision in this function has a bad effect on demo replay. Which is annoying, because it's very difficult to replicate the arithmetic on the DSP 100%. The DSP version is working in-game but the demos drift after 10 seconds or so, loses sync, player dies. The same happens if I use the original version and apply similar precision limits to inputs/outputs.

Some options:

- place it under TIMEBASE_CONTROL and treat it as one more incompatibility with 35hz PC demos, but probably lose out on profiling the improvement properly with current demos (Sorry Eero!)
- put in multiple code paths so it only uses the DSP path when the number ranges look ok. not sure that will work and it's not very satisfying to have it use the slow version at random times.
- try to do something silly on the DSP to emulate the CPU more closely (not tempting - it involves a 64/32bit divide)
- drop the DSP version and do a 68k version - which definitely won't be as fast but won't cause desync
- optimize the iterator/visitor instead of the vector parts, and hope the compiler does a better job of those when it gets to inline stuff properly (semi-hand-wavey option)
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

I made a hybrid version of the op which does the P_PointOnDivlineSide parts on the DSP, but the P_InterceptVector on the CPU. This should make some measurable impact without breaking the demos so much. They still break but they run long enough for profiling.

However... my profiling runs don't show those functions at all because gcc 4.6 seems to be collapsing them with the current opts. This is interesting because the PIT_AddLineIntercepts is called through a function pointer from P_BlockLinesIterator - not something I'd normally expect to see optimized away. But I suppose it's all in the one source file so it could happen...

The e1m5 demo gave me this profile using the DSP'd version of PIT_AddLineIntercepts. Since it doesn't show anywhere, I'm assuming it happens inside P_BlockLinesIterator and that's showing 1.22%.

Eero, I have committed a version of this code - can you let me know if you see any difference? Keep in mind the demo will drift a bit so it will likely end sooner than usual.

Code: Select all

Used cycles:
   9.62%   9.62%   9.64%    89191312  89276432  89377548   _audio_mux_frame
   9.41%                    87238052                       R_VisPlaneShader
   6.00%   6.30%  11.14%    55638600  58422784 103362056   _BM_P_CrossBSPNode
   5.94%                    55083236                       R_DrawSurface_NoTileNoMip
   5.67%                    52614980                       R_AdvanceSurface_TMip0
   5.38%   5.40%  32.63%    49907744  50085460 302626132   _P_RunThinkers
   4.26%                    39548340                       R_AdvanceSurface_TMip1
   3.74%   3.75%   4.09%    34650748  34795068  37973116   _memmove
   3.70%   3.71%   4.09%    34329496  34447604  37906568   stream_texture
   3.36%                    31179104                       R_SpriteColumnShader_Masked2
   3.33%   3.34%   4.14%    30881632  30989920  38404040   BM_P_CrossSubsector
   2.65%   2.66%   3.30%    24571224  24675928  30641960   _V_DrawPatch
   2.37%   2.38%   2.71%    21937096  22037704  25159320   _R_PointInSubsector
   1.98%   1.98%   2.20%    18327520  18408128  20366784   _R_DrawColumn
   1.68%                    15576024                       R_AdvanceSurface_NMip0
   1.66%                    15407280                       _BM_P_CheckSight
   1.40%   1.47%   1.65%    12998884  13645156  15289444   stack_visplane_area
   1.32%   1.33%   1.47%    12268284  12329496  13678876   _P_UpdateSpecials
   1.23%   1.24%   7.18%    11398724  11463440  66596944   _P_CheckPosition
   1.22%   1.22%   2.36%    11283612  11339416  21897724   _P_BlockLinesIterator
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

I have now gone through the game code and re-applied all the optimizations that might affect demo replay - mainly the math, vector, ray-test/point-classify areas. Each change is controlled by several build flags which divide them into categories, so they can be turned on/off individually or by some axis (e.g. area, math, vector, asm, dsp etc..).

This has helped find a few areas which were causing some desync, and have been fixed. One was a divide overflow clamp which is missing in some ports with asm optimizations, but present in the original. There were a few others.

Now I'm getting no desync at all with the 030 asm opts in place, but with the DSP opts disabled.

I'm also getting the *same* desync cases with the DSP version of PIT_AddLineIntercept enabled, providing I simplify the CPU version to use the same procedure (otherwise they are not the same - the original version does a precision/range classify before selecting one or other route to suit the number range - the DSP one uses only one route).

This seems like a good startpoint to work on the DSP opts a bit more to see if they can be made to match the originals more closely. This should make profiling easier with those enabled.

I'm also going to try using the floating point divide procedure on the DSP to get closer to the 64bit div used in the original.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Pain, pain, pain.

PIT_AddLineIntercepts is now working on the DSP for both long and short raytests without causing demo desync - and without adding much new DSP code. It turns out the precision of the inputs wasn't as important as flipping the linesegment/ray test around for long rays. This doesn't match Doom's code but it yields the same results for the demos I tested.

PIT_AddThingIntercepts looks very similar but less complicated. It can be converted too if needed.

This leaves P_CheckSight as the last DSP vector co-pro optimization which causes desync. If I can do something about that it will mean decent profiling of demos so long as the TIMEBASE_CONTROL mods are kept off (Those are game optimizations which will always cause desync and that's not going to change).
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Another painfully annoying thing.

"Bug with East-West walls in Line of Sight Calculation" http://prboom.sourceforge.net/mbf-bugs.html

I accidentally 'fixed' this C typo while converting one function to 030 asm some time ago, and it causes a minor amount of demo desync. Just enough to hide behind other kinds (Ultimate Doom e4m2/demo4 desyncs half way due to a pinkie monster getting distracted from a fight).

Of course turning that bit of asm code off reinstates the original bug, causes the desync to stop (since the demo was recorded using the buggy code). The source of the problem wasn't obvious because the asm looks right and works correctly when tested on its own.

I noticed the typo while comparing the asm and C versions, and then went googling for bug reports...
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Last night I had everything going pretty well, with all of the (so far noticed) demo desync-inducing optimizations either identified or fixed.

The two which are not fixed are PIT_AddLineIntercepts operating in 'short raycast' mode (close-combat / melee scenario), and the much more complicated P_CheckSight. The former (plus a few other minor optimizations) are disabled when building with the new OPTI_DEMOSAFE build flag. However P_CheckSight is an important optimization so I'm reluctant to treat that the same way.

The level of desync caused by PIT_AddLineIntercepts is manageable and should work fine for normal play (non-demo-safe builds), and demos recorded on the Falcon. I know the reason (fixedpoint precision / dropped fractions) and solving it isn't worth the trouble.

P_CheckSight is a bit more interesting. The level of desync seems to be also manageable but comparing results with the C version shows significant deviation for the very last part where rays are checked against blockages from floor/ceiling height differences between sectors. I'm not sure what's causing this yet. Could just be more numerical problems but I need to spend more time with it to be sure.

Something like 1/3rd of the rays classified as passing through sector junctions by the C version are passing on the DSP version, from approximately the same number of rays tested. This is a bit strange because the DSP version seems to be casting the right number of rays (+/- 2%) at walls despite processing 1/2 as many subsectors as the C version to find which walls need tested... i.e. it must be doing a lot of things right to get that far and to maintain a low level of desync, but still something strange going on.

Won't be able to look again until late this evening but once that one is explained/solved I'll commit the changes and move on.

[EDIT]

Deviation in ray count was a bug in metrics - the C and DSP versions do match up, even with my modified (ray splitting) version of the BSP algorithm. Was also able to replace a very expensive divide op with a faster method and no noticable increase in desync.

Spent a bit of time also on the CPU side and have shrunk the P_CheckSight loop code from 308 bytes to 260 bytes. If I can scrape back an extra 4 bytes it will begin to cache properly and speed things up. There is a second part which occasionally calculates vertical ray blockages from floor height differences. This is another 100 bytes or so and can't be cached but it doesn't happen so often.

[EDIT]

...256 bytes

[EDIT]

...252 bytes

Still works, not touching it any more.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

So I've done pretty much all I usefully can with the cost of P_CheckSight and to a lesser extent PIT_AddLineIntercepts without radically changing the game code. Both of these still register as significant costs when running a 35Hz tick (much of this is cache misses during AI caused by random, frequent calls to P_CheckSight from the game AI).

Since we won't be supporting that scenario, working more on these is just burning a lot of my time for not much gain so I'm considering them final versions.

I will look at P_PointInSubsector next because there are decent analogues already in 68k and/or DSP form and it's also a relatively high cost.

As for other game side stuff - not sure I will convert any more to DSP. Will rely mainly on 'activity management' techniques to keep cost down.

I just want to finish the user-facing side of the game, finish the effects and audio, fix bugs and get it 'released'.


I'm currently struggling with a weird IKBD problem which limits playability - the IKBD seems to track only 2 simultaneous alphanumeric-class keys, plus one arrow-class key plus one control/shift-class key. So a maximum of 4 keys held at once if all 3 classes of key are used together, or a maximum of 2 at once if only alphanumeric keys are used.

Worse still - the IKBD actually loses track of additional keys and responds with erratic keypresses or releases, causing 'stuck keys' and other errors or aberrations. This is pretty bad when you're in a fight. It seems to be caused by hard limits/bugs in the 6301 IKBD software as no such problems are seen under emulation.

It's notable that while the 'obvious solution' is to map all 3 classes of key to increase the limit to something usable (i.e. 4 keys), it is in this mode that IKBD responds with the most bugs. Using only alphanumeric keys causes no bugs - but is useless for gameplay.

Not sure what to do about this yet. Doom was actually best played with a good mouse/keyboard combination e.g.

(e.g. for a right-handed player)

Code: Select all

W: forward
S: backward
A: strafe left
D: strafe right

mouse: left/right turn a.k.a 'mouse look'
mouse leftbutton: shoot
mouse rightbutton: 'use' or 'run'
There are variations on this but it's a typical performance deathmatch play arrangement. Players using default keys tended not to last long in DM unless they were particularly gifted. :)

However playtesting with mouse in Hatari is also causing me various accuracy, mouse-warping and responsiveness problems (which appear to be Hatari/host-build/host-pc induced more than anything Atari related) and my real F030 mouse isn't working so well, so I can't be sure it's actually a viable arrangement on Falcon.

This is a frustrating problem because it's not one I can just fix in software, unless the 6301 code is replaced or I find some other magic fix which isn't yet obvious...
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:Working on a DSP replacement for PIT_AddLineIntercepts which seems to be where most of the performance pain is coming from re: P_PointOnDivlineSide/P_PointOnLineSide.
Sorry, I haven't had time to test this yet, and might not have for several next days either, but hopefully soon.

(The audio thing might be more interesting from performance point of view though as it can happen both during thinking and rendering...)
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:However playtesting with mouse in Hatari is also causing me various accuracy, mouse-warping and responsiveness problems (which appear to be Hatari/host-build/host-pc induced more than anything Atari related)
If I remember correctly, Hatari's checking SDL input events only at VBL, at which point it also does screen output. Your host OS and its mouse probably also have different mouse DPI and acceleration than Atari mouse.

If you see unexplained mouse warping in fullscreen, that sounds like a bug (Hatari should warp mouse only when Falcon resolution is changed, to center mouse).
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote:
dml wrote:Working on a DSP replacement for PIT_AddLineIntercepts which seems to be where most of the performance pain is coming from re: P_PointOnDivlineSide/P_PointOnLineSide.
Sorry, I haven't had time to test this yet, and might not have for several next days either, but hopefully soon.

(The audio thing might be more interesting from performance point of view though as it can happen both during thinking and rendering...)
Yeah I've been busy and distracted also.

BTW the audio is disabled by a build flag now so it doesn't have to get in the way. I'll rewrite the code soon to speed it up.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote: If you see unexplained mouse warping in fullscreen, that sounds like a bug (Hatari should warp mouse only when Falcon resolution is changed, to center mouse).
Hatari doesn't work too well in fullscreen mode on my machine so I suppose I'm having a different kind of problem in windowed mode (two mouse pointers - both of them in different places but both also warping! :) However the ALT-M shortcut does at least release the mouse back to the host so the warping stops.

In any case I think testing of mouse accuracy issues has to be done on a real box - accurate mouse in the emulator is a convenient thing but not the whole story.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:BTW the audio is disabled by a build flag now so it doesn't have to get in the way.
I prefer using full game features for "worst frame" profiling, to see what's most problematic for user. And automating testing of something else would also require some support in Makefile & profile.sh...
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote:
dml wrote:BTW the audio is disabled by a build flag now so it doesn't have to get in the way.
I prefer using full game features for "worst frame" profiling, to see what's most problematic for user. And automating testing of something else would also require some support in Makefile & profile.sh...
Ok, sure - but it may be useful to know that the (expensive) audio part which can be disabled has a very simple upper bound in terms of cost. It has only 3 modes, and all 3 have a fixed amount of work-per-second (fixed loopsize, no control flow). The part of the code which *doesn't* have fixed cost has deliberately been left running so any misbehaviour would show in the profiles.

So until the expensive bit is optimized, I don't think there is much interesting to be discovered in it. After it gets optimized, then it will still be useful to have an overall picture of load.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:Hatari doesn't work too well in fullscreen mode on my machine
Is that both with "--desktop" set to "yes" and "no"?

(With "yes" desktop resolution isn't changed, Falcon display is zoomed to integer multiple of Falcon resolution closest to desktop resolution.)

dml wrote:it may be useful to know that the (expensive) audio part which can be disabled has a very simple upper bound in terms of cost. It has only 3 modes, and all 3 have a fixed amount of work-per-second (fixed loopsize, no control flow). The part of the code which *doesn't* have fixed cost has deliberately been left running so any misbehaviour would show in the profiles.

So until the expensive bit is optimized, I don't think there is much interesting to be discovered in it. After it gets optimized, then it will still be useful to have an overall picture of load.
Sounds reasonable. Have had yet time to look into the expensive audio part?

Below is profile for Doom II timedemo from latest code (last commit 22nd of July) I got from your repo, with TIMEBASE_CONTROL = 1.

From first player shot until player dies, there are 164 frames:

Code: Select all

$ grep r_begin badmood-all-frames-CPU.txt
r_begin:
0x49362: 0x4935c = 164 n, r_begin
(74s/164frames = 2.2 FPS with WinUAE CPU core, GCC 2.x, TIMEBASE_CONTROL=1)

And on average it's spend on CPU side:

Code: Select all

Time spent in profile = 74.44504s.

Visits/calls:
- max = 228099, in R_BSPHyperPlane_RHS at 0x508da, on line 10777
- 1824964 in total
Executed instructions:
- max = 916160, in _subframe_block+268 at 0x1edfa, on line 624
- 133230582 in total
Used cycles:
- max = 22003548, in _subframe_block+278 at 0x1ee04, on line 628
- 1194284064 in total
Instruction cache misses:
- max = 213675, in _PIT_CheckThing+218 at 0x3104a, on line 3268
- 17110869 in total
...
Executed instructions:
   7.98%   7.98%   7.98%    10626773  10632373  10632373   _subframe_block
   5.83%   5.84%   6.29%     7764088   7775516   8382794   _R_PointInSubsector
   5.43%   5.43%   5.75%     7228956   7239562   7663211   D_AllocPreview_Dynamic
   5.26%                     7005327                       R_DrawSurface_NW
   4.35%                     5797668                       R_AdvanceSurface_T4
   4.26%                     5672999                       R_VisPlaneSkyShader
   4.25%   4.25%  32.65%     5658399   5662520  43503911   _P_RunThinkers
   4.11%   4.20%   8.58%     5474808   5596019  11435698   _BM_P_CrossBSPNode
   3.68%                     4907678                       R_AdvanceSurface_T2
   3.59%                     4781961                       R_AdvanceSurface_T8
   3.35%   3.35%   3.96%     4460510   4462508   5278601   BM_P_CrossSubsector
   2.54%                     3380360                       R_VisPlaneShader
   2.42%   2.42%   2.55%     3218664   3223277   3392055   D_CachePreview
   2.30%                     3068905                       R_SpriteColumnShader_Masked2
   2.27%   2.28%   2.45%     3026134   3031873   3270484   _R_DrawColumn
   2.14%   2.15%  13.87%     2854430   2862938  18480270   _P_CheckPosition
   2.00%   2.01%   2.28%     2665428   2671964   3032068   _PIT_CheckThing
   1.89%   1.89%   4.37%     2519156   2524680   5818174   _P_BlockThingsIterator
   1.87%   1.86%   2.03%     2493620   2483560   2705661 * D_AllocPreview_App
   1.85%   1.85%   1.99%     2458524   2463077   2657742   D_AllocPreview_Fixed
   1.84%   1.84%   1.98%     2448028   2452871   2636755   D_AllocPreview_Unknown
   1.66%   1.66%   1.90%     2210188   2217985   2533207   stream_texture
   1.45%   1.45%   1.68%     1927170   1930735   2234970   _V_DrawPatch
   1.33%   1.35%   1.51%     1774127   1798118   2018166   stack_visplane_area
   1.15%   1.15%   1.25%     1530120   1533695   1669126   D_AllocPreview_Free
   0.99%   1.00%   1.13%     1321888   1326252   1500847   _P_UpdateSpecials
   0.87%   0.87%   1.95%     1154043   1156168   2603846   _P_BlockLinesIterator
   0.85%   0.85%   3.40%     1125804   1128723   4535160   _R_DrawVisSprite
   0.83%   0.83%   0.94%     1100218   1102860   1257298   R_AddSpriteSpans
   0.78%                     1034759                       R_AddLine_loop
   0.75%                      994858                       R_BSPHyperPlane
   0.65%   0.65%   9.35%      860611    863842  12450850   _BM_P_CheckSight
   0.63%   0.63%   0.70%      843094    845257    929700   R_ViewTestSpriteLines
   0.55%   0.56%   0.99%      738673    740488   1318311   _PIT_CheckLine
   0.54%                      726021                       R_AdvanceSurface_T1
   0.54%   0.54%  14.51%      718790    721199  19327321   _P_Move
...
Instruction cache misses:
   8.88%   8.90%  28.53%     1519803   1523486   4880936   _P_CheckPosition
   7.47%   7.50%  61.38%     1278616   1282700  10502600   _P_RunThinkers
   6.23%   6.24%  10.38%     1065553   1068233   1776806   _P_BlockThingsIterator
   4.02%   4.04%   4.11%      688382    691890    702770   _PIT_CheckThing
   3.96%   4.00%   4.09%      676886    683668    699281   _R_PointInSubsector
   3.42%   3.43%   3.54%      585016    586358    606562   R_AddSpriteSpans
   2.85%   3.24%   7.03%      487319    554093   1202738   _BM_P_CrossBSPNode
   2.78%   2.79%   9.84%      475333    476760   1683018   _BM_P_CheckSight
   2.54%   2.55%  33.33%      434468    436061   5703712   _P_TryMove
   2.43%   2.45%   3.70%      415047    418502    633805   BM_P_CrossSubsector
   2.29%   2.30%  31.21%      392542    393777   5340179   _P_Move
   2.23%   2.24%   3.93%      382320    383235    671684   _PIT_CheckLine
   1.88%   1.89%   5.83%      322295    323554    998077   _P_BlockLinesIterator
   1.87%   1.88%   3.21%      320092    321499    549982   _R_DrawVisSprite
   1.81%                      309054                       build_ssector
   1.81%   1.81%   1.82%      308935    310036    312199   R_ViewTestSpriteLines
   1.67%   1.67%  48.41%      285561    286325   8282643   _P_SetMobjState
   1.49%   1.49%   1.50%      254773    255221    256183   _P_BoxOnLineSide
   1.46%   1.46%   9.64%      249599    250318   1649133   _P_LookForPlayers
   1.27%                      217792                       R_AddLine_loop
   1.23%   1.24%  26.64%      211019    211552   4558289   _P_NewChaseDir
   1.23%   1.24%   1.28%      209824    212895    218971   _R_DrawColumn
   1.20%   1.20%   1.21%      204715    205795    207128   BM_P_CrossSubsector_vdelta
   1.10%   1.10%   6.60%      187845    188470   1129712   R_FlushDeferredSurfaces
   1.09%   1.09%  33.93%      185843    186072   5806507   _A_Chase
Largest single CPU user is audio handling. Attached is a callgraph of CPU usage from which I've taken out also calls to _P_Random, _R_PointToAngle2, _S_StartSound, _Z_Malloc & _memset because they were called from so many places (i.e. made graph messy) and used only little CPU.

DSP side:

Code: Select all

Visits/calls:
   7.56%   7.56%      131936    131936   TestLineSegVectorBisection
   7.19%              125377             R_EndAddSurface
   6.20%              108241             lowerwall_skip
   6.01%              104893             R_DoColumnTextureUV
   5.90%              102896             P_CrossSubsector_body
   4.80%   4.80%       83826     83826   R_DoColumnPerspCorrect
   4.79%               83594             command_base
   4.65%   9.26%       81190    161481   InterceptVectors
   4.61%   4.61%       80410     80410   Divs48_Real
   4.51%               78739             upperwall_skip
   3.29%               57345             midwall_skip
   3.05%   3.05%       53141     53141   VPRenderSpanDT
   3.05%               53141             VPRenderPlane_continue
   2.78%               48543             R_SetSSectorLuma
   1.93%               33724             R_CheckBBox_end
   1.93%   3.28%       33724     57291   R_CheckBBox
   1.85%               32261             project_node
   1.54%               26915             R_CheckBBox_pass
   1.53%               26658             trans_skip
   1.40%   1.40%       24500     24500   divs_x1_a
   1.22%   1.22%       21266     21266   new_luminance
   1.04%   1.04%       18226     18226   R_DoColumnConstantClone
   1.01%               17628             oct_4
...
Used cycles:
  58.79%                  1404211290                       command_base
  12.47%  13.32%  13.32%   297953994 318090474 318090474   R_DoColumnPerspCorrect
   4.22%                   100706150                       P_CrossSubsector_body
   4.10%                    97959576                       ALGO_P_CrossBSPNode
   3.52%                    84047130                       R_VPRenderSky
   3.02%                    72052422                       R_VPLoadTexture
   2.92%   2.92%   2.92%    69802324  69802324  69802324   VPRenderSpanDT
   1.57%   1.57%   1.57%    37428682  37428682  37428682   extract_subvisplane
   1.37%                    32714136                       R_ViewTestAddLine
   1.03%                    24617858                       R_DoColumnTextureUV
   0.80%                    19063302                       project_node
   0.68%   0.87%   4.00%    16170566  20811708  95590682   AddLowerWall
   0.65%   0.65%   0.65%    15629604  15629604  15629604   Divs48_Real
   0.54%                    12920406                       R_CheckBBoxPair
Most of time in R_DoColumnPerspCorrect goes to:

Code: Select all

p:045e  0aa981 00045e  (06 cyc)  jclr #1,x:$ffe9,p:$045e                           9.86%
You do not have the required permissions to view the files attached to this post.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

As to Doom II startup, symbols on top of instruction and cycle counts lists differ interestingly:

Code: Select all

Time spent in profile = 129.22795s.

Visits/calls:
- max = 777632, in strcmp_8 at 0x4c026, on line 11806
- 2601042 in total
Executed instructions:
- max = 7757869, in _V_DrawPatch+244 at 0x4322c, on line 7760
- 279149256 in total
Used cycles:
- max = 124220016, in _V_DrawPatch+246 at 0x4322e, on line 7761
- 2073138536 in total
Instruction cache misses:
- max = 582062, in mip_plot_16bit at 0x4da30, on line 12868
- 11207806 in total

Visits/calls:
  29.89%  29.90%      777488    777734   strcmp_8
  26.22%  26.26%      681873    683013   correct_element
  11.81%  11.82%      307311    307572   mip_plot_16bit
...
Executed instructions:
  19.00%  19.02%  20.12%    53032380  53093733  56164519   D_GeneratePostMap
  11.96%  11.98%  12.52%    33386278  33452598  34941538   _V_DrawPatch
   8.27%   8.27%   8.33%    23072063  23088300  23254410   strcmp_8
   8.10%   8.11%   9.22%    22604895  22637307  25751198   D_FlatMipGen_8_16
   6.74%   5.51%   5.66%    18804376  15374288  15799738 * D_PatchRender_8_8
   5.51%   5.52%   6.39%    15375613  15397997  17831929   D_TextureMipGen_8_16
   4.64%   4.65%   4.93%    12957696  12990477  13760357   correct_element
   3.66%   2.05%   2.88%    10226679   5710105   8046128 * D_TextureRemapPixelBlock_16_8
   2.89%   2.89%   2.99%     8061092   8073307   8344857   D_TextureRemapInit
   2.88%   2.88%   2.88%     8026920   8028966   8028966   _subframe_block
   2.68%   2.67%  10.02%     7475207   7462937  27969601 * hrsc_locate_q
...
Used cycles:
  14.10%  14.16%  14.75%   292231244 293532400 305814424   _V_DrawPatch
  12.75%  12.81%  14.13%   264339676 265545036 292852172   D_GeneratePostMap
   7.64%   7.67%   7.98%   158440024 159085840 165429768   correct_element
   6.70%   6.73%   8.03%   138812856 139447824 166466988   D_FlatMipGen_8_16
   6.47%   6.49%   6.56%   134209604 134573416 135941956   strcmp_8
   5.02%   4.13%   4.29%   104029088  85524728  89030396 * D_PatchRender_8_8
   4.94%   2.87%   3.65%   102498876  59511592  75714580 * D_TextureRemapPixelBlock_16_8
   4.65%   4.67%   5.69%    96460944  96902296 118001040   D_TextureMipGen_8_16
   3.16%   3.16%   3.16%    65421728  65466836  65466836   _subframe_block
...
Instruction cache misses:
  23.30%  23.34%  23.39%     2611895   2615513   2621465   mip_plot_16bit
  15.36%  15.51%  28.05%     1721608   1737941   3143794   D_FlatMipGen_8_16
   6.32%   6.39%   6.44%      708793    716034    721646   strcmp_8
   5.65%                      632922                       ROM_TOS
   4.36%   4.37%   9.88%      488891    489318   1107574   hrsc_locate_wd_q
   4.13%   4.13%  11.64%      462857    463124   1304677   ___timesub
   3.77%   3.87%  14.20%      422354    433804   1591500   D_TextureMipGen_8_16
   3.27%   3.27%   3.27%      366152    366750    366886   ___udivsi3
   2.11%   2.12%   5.28%      236960    237493    592149   ___divsi3
   1.48%   1.48%   3.90%      165768    165929    436884   ___modsi3
Attached are callgraphs of Doom II startup instructions, cycles and i-cache misses (with following generic, small impact functions removed from graph _Z_Malloc, _memset, _memcpy, _strlen, _strncpy, _strncmp, _getenv, ___malloc). TIMEBASE=1, GCC 2.x.
You do not have the required permissions to view the files attached to this post.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote:
Is that both with "--desktop" set to "yes" and "no"?

(With "yes" desktop resolution isn't changed, Falcon display is zoomed to integer multiple of Falcon resolution closest to desktop resolution.)
I tried it recently with "yes" for the desktop arg and got the same result - which is to say the fullscreen mode won't enter fullscreen 'properly' but instead blits the image onto the desktop with 2x zoom, with a black border around it. The OS menu bar is still visible etc. In this mode its much slower than windowed mode with the usual 2x zoom. However it could be a result of the way I built it, haven't tried an official release to see if it's the same.

Eero Tamminen wrote: Sounds reasonable. Have had yet time to look into the expensive audio part?
I looked at it briefly but remembered that the left/right panning needs done and tested before optimizing the mixer - in case I had to change it or drop it for some reason. This is why the 'slow' mixer is enabled just now (there are other semi-optimized versions but they are turned off).

I've been working on some other crazy thing for a few days, and repairing some mainboards before that. Will return to it next week.
Eero Tamminen wrote: Executed instructions:
7.98% 7.98% 7.98% 10626773 10632373 10632373 _subframe_block
5.83% 5.84% 6.29% 7764088 7775516 8382794 _R_PointInSubsector
5.43% 5.43% 5.75% 7228956 7239562 7663211 D_AllocPreview_Dynamic
I see the memory debugger is left on again :-/ I'll need to turn that off. Was enabled for tracking mipmap allocation.
Eero Tamminen wrote: DSP side:
...
Most of time in R_DoColumnPerspCorrect goes to:

Code: Select all

p:045e  0aa981 00045e  (06 cyc)  jclr #1,x:$ffe9,p:$045e                           9.86%
That's fine - it means the DSP is mostly ahead of things, waiting for the CPU to finish a column and fetch a new one.


This is actually one of the things I was interested to track in a better way than I have been so far - it's quite effort intensive right now.

I'm wondering if it's possible for the profiler to analyze spinloops - any pair of instructions which result in a tight loop, on either CPU or DSP.

Originally I was interested in detecting just DSP host port spinloops because they are relatively easy to decode on the DSP - they always look the same. On the CPU side however, they can vary a bit and it's harder.

I then realized that any 2-opcode loop is a spinloop, and the semantics are going to be similar so it's probably easier/better to just detect all of them and profile them in the same way (blitter spinloops could also benefit from this). So on the CPU side any pair of ops where the 2nd is a branch back to the first - doesn't matter what the first opcode happens to be.

The general idea is to track the activity of spinloops one level higher than the instruction counts. Specifically, recording the minimum, maximum and average iteration count for each spinloop recognized. A digest of the spinloop sites with these metrics is immensely useful because it becomes possible to spot a stall which occurs only infrequently but for a significant duration. It also helps pinpoint those spinloops which never iterate due to a favourable performance ratio, and can probably be removed.

I think the most valuable side to watch is the CPU side, because the CPU should not be waiting for the DSP except in some rare cases for vector operations where inputs and outputs follow each other closely. Watching the DSP side is also useful though - it can give some indication of where there are unused/idle cycles which could absorb nearby work if some code is moved. etc.

Anyway see what you think :) The way I did this before involved a very large and unresponsive spreadsheet with the profiler disasm pasted into it and some simple column calcs to spot the stalls. It only handled the DSP side but it 'inferred' the CPU side by counting any DSP spinloops with low iteration counts (relative to neighbour ops), as a DSP bottleneck (i.e. CPU side is spinning). It's better though to actually track the CPU side since buffering between the two sides makes 'inferring' a bit less reliable.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:
Eero Tamminen wrote: Is that both with "--desktop" set to "yes" and "no"?

(With "yes" desktop resolution isn't changed, Falcon display is zoomed to integer multiple of Falcon resolution closest to desktop resolution.)
I tried it recently with "yes" for the desktop arg and got the same result - which is to say the fullscreen mode won't enter fullscreen 'properly' but instead blits the image onto the desktop with 2x zoom, with a black border around it. The OS menu bar is still visible etc. In this mode its much slower than windowed mode with the usual 2x zoom. However it could be a result of the way I built it, haven't tried an official release to see if it's the same.
Did you try with "no"? "yes" is Hatari default for Videl.

(Because Hatari supports only doubling for ST/E screen modes, --desktop-st option isn't enabled by default them, only the Videl option.)

I'll address the spinloop detection feature on the Hatari mailing list.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote:
Did you try with "no"? "yes" is Hatari default for Videl.

(Because Hatari supports only doubling for ST/E screen modes, --desktop-st option isn't enabled by default them, only the Videl option.)

I'll address the spinloop detection feature on the Hatari mailing list.
I'll give it another try with both settings to see if it makes any difference. I think I tried both but I don't remember now. I'll check the Hatari list for your answer. Don't worry too much about the spinloop thing - it can still be done manually and doesn't have to be done *that* often - just a nice thing for a DSP coder to have :-).

BTW I'm working on a new 3x2 volume panning audio mixer to replace the 'debug' version in there now. It's basically written but untested. I didn't manage a satisfying solution on the first attempt, ran out of cpu registers - but 2nd try was good. :-) 3 passes are now one.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

The new audio mixer is in. It works but I haven't profiled it. If it still takes too long the next remedy is to configure 2x2 channels by default and leave 3x2 as an option, or to make panning optional (that's the expensive bit).

Hopefully though it will be fast enough now to leave it alone.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:The new audio mixer is in. It works but I haven't profiled it. If it still takes too long the next remedy is to configure 2x2 channels by default and leave 3x2 as an option, or to make panning optional (that's the expensive bit).

Hopefully though it will be fast enough now to leave it alone.
Thanks, it's not anymore so visible in profiles.

This is from same Doom II profile:

Code: Select all

Time spent in profile = 73.75542s.

Visits/calls:
- max = 229154, in R_BSPHyperPlane_RHS at 0x51972, on line 12487
- 2154698 in total
Executed instructions:
- max = 1520000, in hash_insert+6 at 0x4afba, on line 10081
- 128970221 in total
Used cycles:
- max = 24342652, in hash_insert+10 at 0x4afbe, on line 10082
- 1183220920 in total
Instruction cache misses:
- max = 230678, in _P_BlockThingsIterator+72 at 0x332d8, on line 5324
- 17776430 in total
...
Executed instructions:
   6.94%   7.04%   7.42%     8953784   9076826   9574936   _BM_P_CrossBSPNode
   6.02%   6.03%   6.28%     7764088   7775089   8100853   _R_PointInSubsector
   5.95%   5.95%   6.09%     7669248   7679644   7852902   hash_insert
   5.41%                     6975614                       R_DrawSurface_NoTileNoMip
   4.63%                     5969195                       R_AdvanceSurface_T4
   4.43%                     5715478                       R_VisPlaneSkyShader
   4.39%   4.40%  31.50%     5658399   5669726  40629044   _P_RunThinkers
   3.85%                     4969042                       R_AdvanceSurface_T2
   3.81%                     4912829                       R_AdvanceSurface_T8
   2.90%   2.90%   3.00%     3735552   3745183   3869341   correct_element
   2.66%                     3427699                       R_VisPlaneShader
   2.41%   2.41%   2.55%     3107106   3112695   3290873   _R_DrawColumn
   2.36%                     3045463                       R_SpriteColumnShader_Masked2
   2.21%   2.22%  13.82%     2854430   2862988  17826988   _P_CheckPosition
   2.07%   2.07%   2.21%     2665428   2671702   2853076   _PIT_CheckThing
   1.95%   1.96%   4.31%     2519156   2525487   5563746   _P_BlockThingsIterator
   1.88%   1.89%   1.89%     2430161   2431501   2431501   _BM_A_Mux3x2
   1.70%   1.70%   1.84%     2194008   2198121   2368618   stream_texture
   1.52%   1.52%   1.67%     1964769   1966433   2150022   _V_DrawPatch
   1.52%   1.52%   1.56%     1957573   1962068   2017156   D_TextureRemapPixelSubBlock_16_8
   1.39%   1.39%   1.48%     1792810   1794837   1914011   D_GeneratePostMap
   1.39%   1.39%   1.39%     1792615   1793595   1793595   _BM_A_Mux2x2
   1.38%   1.40%   1.51%     1782674   1807094   1943699   stack_visplane_area
   1.34%   1.34%   1.37%     1731707   1734086   1766229   D_TextureMipGen_8_16
   1.07%   1.07%   1.07%     1380034   1380954   1380954   _BM_A_Mux1x2
   1.02%   1.03%   1.10%     1321888   1324138   1416278   _P_UpdateSpecials
   0.97%   0.97%   3.99%     1248024   1250600   5148074   create_local_palette_64levels
   0.89%   0.90%   1.95%     1154043   1156450   2515150   _P_BlockLinesIterator
   0.89%   0.89%   3.52%     1144731   1148409   4534792   _R_DrawVisSprite
...
And DSP side:

Code: Select all

Used cycles:
  46.01%                  1088721690                       command_base
  26.03%  26.94%  26.94%   616004332 637494024 637494024   R_DoColumnPerspCorrect
   3.96%                    93754098                       ALGO_P_CrossBSPNode
   3.47%                    82143284                       R_VPRenderSky
   2.94%                    69559530                       R_VPLoadTexture
   2.94%   2.94%   2.94%    69466634  69466634  69466634   VPRenderSpanDT
   2.88%                    68246456                       P_CrossSubsector_body
   1.54%   1.54%   1.54%    36537408  36537408  36537408   extract_subvisplane
   1.33%                    31538582                       R_ViewTestAddLine
   1.09%                    25687338                       R_DoColumnTextureUV
   0.81%                    19167630                       project_node
   0.79%   0.79%   0.79%    18643908  18652596  18652596   InterceptVectorsUF
   0.70%   0.90%   4.00%    16632652  21209604  94598926   AddLowerWall
   0.55%   0.00%   0.00%    13062878     32890     32890 * Divs48_Real
   0.54%                    12705474                       R_CheckBBoxPair
   0.51%   0.63%  20.49%    12108966  14861104 485001286   AddMidWall

Visits/calls:
   7.66%   7.66%      149450    149450   InterceptVectorsUF
   7.03%   7.03%      137209    137209   TestLineSegVectorBisection
   5.62%              109714             lowerwall_skip
   5.44%              106117             R_DoColumnTextureUV
   5.38%              104997             R_EndAddSurface
   5.27%              102886             P_CrossSubsector_body
   4.59%               89470             command_base
   4.37%   4.37%       85187     85187   R_DoColumnPerspCorrect
   4.09%               79765             upperwall_skip
   3.72%               72510             VECOP_return
   3.60%   3.60%       70212     70212   InterceptVectors
   3.57%   0.01%       69603       126   Divs48_Real
   2.98%               58087             midwall_skip
   2.74%   2.74%       53530     53530   VPRenderSpanDT
   2.74%               53530             VPRenderPlane_continue
   2.22%               43362             R_SetSSectorLuma
   1.74%               33942             R_CheckBBox_end
   1.74%   2.97%       33942     57849   R_CheckBBox
   1.66%               32477             project_node
   1.39%               27076             R_CheckBBox_pass
   1.37%               26803             trans_skip
...
Audio went from 8% CPU usage to 4.3%, which improved the average FPS from 2.20 to 2.26, a 2.7% FPS improvement. It's clear, but a bit modest improvement. I'll hopefully have time tomorrow to check how large impact that and your earlier optimizations had on worst frame numbers.

Attached is a callgraph of where the CPU instructions are spent in these 167 timedemo frames (with calls to P_Random, R_PointToAngle2 and to leaf & intermediate symbols removed).

Profile for hash insert which is 3rd on list and has instructions with highest instruction & cycle counts, looks like this:

Code: Select all

hash_insert:
$04afb4 :             move.l    d0,d2                      0.03% (34624, 138552, 934)
$04afb6 :             moveq     #$c,d3                     0.03% (34624, 138828, 317)
$04afb8 :             lsr.l     d3,d2                      0.03% (34624, 276552, 2)
$04afba :             and.w     #$1fff,d2                  1.18% (1520000, 6082076, 0)
$04afbe :             tst.l     (a2,d2.w*4)                1.18% (1520000, 24342652, 0)
$04afc2 :             beq.s     $4afc8                     1.18% (1520000, 6084052, 603)
$04afc4 :             addq.w    #1,d2                      1.15% (1485376, 5963684, 42)
$04afc6 :             bra.s     $4afba                     1.15% (1485376, 11899688, 633)
$04afc8 :             rts                                  0.03% (34624, 416408, 11)
You do not have the required permissions to view the files attached to this post.

Return to “680x0”