Bad Mood : Falcon030 'Doom'

All 680x0 related coding posts in this section please.

Moderators: Zorro 2, Moderator Team

User avatar
shoggoth
Nature
Nature
Posts: 1447
Joined: Tue Aug 01, 2006 9:21 am
Location: Halmstad, Sweden

Re: Bad Mood : Falcon030 'Doom'

Post by shoggoth »

Tried v0.5 just now. I couldn't reproduce the random-pixel issue I described earlier (it was floor/ceiling, btw). Again amazing progress :)

I'm hoping for a low detail mode which doubles pixels in hardware on VGA!!! I guess you'll need to split the screen using a Timer B HBL interrupt in such case (has anyone tried that, ever?). Text output will look funny I guess, but then again it could be moved to some other place on screen which has square pixels.

Hugely impressive, Doug :) Looking forward to the next release, whenever that is.
Ain't no space like PeP-space.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Thanks again for testing it.

I am looking at various video mode options including a raster split but I'm wary that Hatari can't reproduce this (it doesn't translate any intra-frame Videl events) so I'll probably focus on a 'clean' 320x mode with menus and resizable window, and a fullscreen mode where anything goes - probably SCP based (and it should flip between the two easily, when the menu is activated).

A special splitscreen mode can still be done if it works reliably but it will have to be real-HW-only. I expect it could work, providing it is stable enough without 68k sync tricks.

I have made a few more improvements to the game code to speed it up but the main thing I have been working on is fast 2D overlay drawing and fast 3D sprite drawing. The new sprite routine works quite differently and is much faster when sprites near the viewer. It should also work on transparent walls with a bit of effort.

The existing method will remain near-optimal for distant sprites because it is area-limited but the new one will be much better for big stuff as it doesn't require masking tests and it isn't as expensive as the post-based method in Doom. Pixel columns can be drawn directly with a classic addx DDA for downscaling or carry method for upscaling and column skips require a lookup and multiply only.

[EDIT]

Just got it working a few minutes ago, after fixing a stupid bug. It's not quick yet but the method & precalc is mostly working so the fast version should work also.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Tonight I got the new masked/transparent surface routine working with sprites and transparent walls.

The very slow room at the end of e1m1 is a lot faster now esp. with a reduced size render window.

There's a divide operation per column that only needs done once per surface but once that's fixed the drawing side of it should be ready.

The routine needs a precalculated buffer and it's only necessary for textures which need to be masked. Unfortunately the 'masked' flag in the WAD texture records seems to be unused/unset so I'll have to work out which textures are used with masking some other way - probably by scanning segs and their flags, and tracking un-covered pixels in the patches used to build each texture. That's pretty annoying to have to do at loadtime, when a stored flag would have been great - but it will get solved. Currently it just uses extra memory for a lot of unused buffers.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Another minor update.

By reimplementing the transparency renderer to use a postmap for skips and wrap prediction, the instruction count per pixel has been reduced from:-

5, including a branch for mode change (for transparent texels)
6, including a branch for mode change (for opaque texels)

..to..

0 (for transparent texels)
4 (for opaque texels)

There is still an overhead for mode change but it is relatively small compared with filling time. So the new method is a lot faster than the old one (I did always say it was a crap effort and needed replaced! :-) )


I have also been looking at similar changes to wall rendering. There will be no secondary postmap available for normal, opaque walls to assist with wrap prediction but it is still possible by another method and the instruction count per pixel can therefore probably be reduced from 5 to 4, which should be the lower limit for lighting lookups via CPU, or 3 for prelit surfaces if that can be employed effectively. (The current code uses 6 ops instead of 5 but that was an old memory scheduling optimization and isn't part of the fragment itself).
User avatar
bullis1
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2301
Joined: Tue Dec 12, 2006 2:32 pm
Location: Canada

Re: Bad Mood : Falcon030 'Doom'

Post by bullis1 »

That is a significant improvement! Does that apply for sprites and masked walls, or just one? I gathered it was for walls based on your previous post.
Member of the Atari Legend team
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Hi!
bullis1 wrote:That is a significant improvement! Does that apply for sprites and masked walls, or just one? I gathered it was for walls based on your previous post.
It works now for both, although sprites get an extra speed kick due to constant-z, constant-luma, constant-dv. Walls unfortunately need unique term for each column so not quite as quick. I might try a constant-luma version for walls (both ends of the wall lie in the same luma 'band') but the saving for luma indexing will be very small so it's maybe not worth the trouble.


The main problem with the walls was wrapping the texture v term efficiently (which sprites don't use), and figuring out which textures actually happened to be transparent (whereas all sprites are transparent) in order to generate the postmap data. Apart from that the same method works for both so it should be a big help.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Here's an outline of the postmap method now used in BM for masked sprites and walls.

The technique is quite simple, although somewhat painful to get working well in all cases - loss of precision in calculations can cause hysteresis or pathological looping problems. I found a tidy way to stop that though without needing to raise precision (in fact all postmap operations are done in very limited 7:9 fixedpoint).

The method involves generating a secondary 'postmap' texture which is a 1:1 mapping to the texture itself. The 'texels' in that map encode the number of texels remaining in the corresponding real texture, in that run mode (opaque/transparent), for that texture column. If the column is opaque, the counts are encoded as positive. If the column is transparent, the counts are negative.

e.g.

Code: Select all

-3  <- start of transparent run 
-2
-1
05 <- start of opaque run
04
03
02
01
-1 <- single texel gap
02 <- short opaque run
01
-2 <- last run is again transparent
-1
The renderer looks up the postmap each time a pixel run exhausts in screenspace and a new one must begin, finds the number of texels remaining in the *next* run using the current V address, and predicts the number of pixels to plot in screenspace (which must also be clipped against the screenspace column size, since Doom can occlude columns by stuff sitting in front of them). The key word here is 'remaining' because advancing V may skip texels and the lookup may happen one or two pixels into the postmap entry.

For transparent runs, just the pixel and texel address need advanced before the next postmap lookup. For opaque runs, the pixel plotting loop is activated and V is left in the correct state for the next lookup so there is no extra cost.

What's the point of using it over just drawing scaled columns? Two reasons - clipping is easier since you just address the postmap for any given V and any given pixel, and you know what to do, so you don't need to process columns which lie outside of the clip range. Second reason is, it's faster for transparent runs which collapse as they scale down, since no V advance is required - the postmap lookup says just keep plotting pixels. For normal scaled posts, every opaque run must be offset and corresponding rounding error compensated in screenspace. It's also worth noting that posts which get scaled down can't overlap, since the texture is advanced in screenspace, not post-space. So overheads within a fragmented sprite with lots of posts (e.g. some of those fireball/exploding barrel puff sprites) will get cheaper as it shrinks on the screen.


Optimization: A count of zero indicates no more data for the remainder of the column, so when the last 'skip' is encountered rendering can stop early without having to perform the dv skip.

Code: Select all

-3  <- start of transparent run 
-2
-1
05 <- start of opaque run
04
03
02
01
-1 <- single texel gap
02 <- short opaque run
01
00 <- last run encodes a STOP condition
00
Hysteresis fix: The vast majority of postmap lookups at 'normal' scaling values are at the start of new runs (or where precision fails, the very end of existing runs). This means it could accidentally draw only 9 of 10 intended pixels, then perform a new postmap lookup for the 10th pixel instead of switching mode to transparency as it should have. This causes overhead for an unwanted pixel run, even if the result looks correct (watch out - there are several ways for this algorithm to look right, but operate wrong :-) it always needs verified ). An easy 'fix' is to modify the postmap so it is conservative - each run in the postmap starts one texel early, so the chance of looking up the *next* run is 100%. The cost is slightly early termination of a run in screenspace (but not texel space) which is quite tolerable.

Code: Select all

-3  <- start of transparent run 
-2
06 <- extend start of opaque run over end of transparent run
05 <- 'real' start of opaque run
04
03
02
-2 <- extend start of transparent run over end of opaque run 
-1 <- 'real' start of single texel transparent run
02 <- opaque run can be extended into short transparent run (but not other way around - preserving gaps is less important than pixels)
00 <- STOP condition extended into end of last run
00 <- start of last run encodes a STOP condition
00
This is best combined with deliberately under-estimating pixel run lengths so the postmap does all the precision corrections. Over-estimating (or 'correct' rounding) can lead to the opaque run drawing more pixels than necessary, and posts terminate with colour index 0 a bit too late. Not a disaster but can be mostly avoided even with low precision estimates.

How to verify the algorithm? Draw the last postmap lookup instead of real texels. You'll see fragmentation in the columns, or extra pixels hanging on the ends, if it's not working properly. If your fixedpoint range breaks down you'll see pixel noise and horrible slowdown :)
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

...cont'd.

A fast path has been added for walls which are predicted to need no texture wraps on the v-axis. This uses 4 ops per pixel. The result is a 3%-5% improvement for most scenes (or a bigger improvement when close to one of the affected walls or when the scene doesn't contain many edges).

I'm also looking at a faster version of the wrapping case, which predicts post heights before a wrap event and wraps when needed. The post height prediction is best done on the DSP and will take longer to complete, but I'll test the idea on the CPU first.

[EDIT]

Wrap prediction is working on CPU, but slowly. I think the textures would also need unrolled an extra 2 pixels or so at load time since the wrap prediction has a small degree of error which can reveal undefined pixels near the texture bottom. By doing that and moving the prediction calcs to the DSP it should be fast enough to replace the old code.

[EDIT]

Scrap that idea - there's another much simpler solution which is probably also quicker. Doing that instead.

{EDIT}

All walls now benefit from texture wrap optimization and there's a measurable speedup of around 7% on e1m1 startpoint in Hatari (more where walls use more fill area, less where edges and floor are dominates).
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

So I rewrote the CPU wall drawing code and was surprised to find a buildup of rubbish in that area, probably resulting from many separate edits or changes in storage or DSP stuff.

The wall column renderer is now broken into multiple paths which are DV-rate specific, and can switch from one path to another mid-surface on a DV threshold change (exit/entrypoint switch in the middle of the column setup). The point is to minimize vertical wrapping/tiling events and context switches on surfaces which still need tiled. e.g. most walls don't need tiling events any more frequently than 1 in 16 pixels, this being determined by DV (z) and the amount of unroll performed on the texture lower edge.

Textures which don't have tilable dimensions at loadtime (e.g. not 2^n, or in Doom's case just 128 high) are not unrolled at all and use the fastest (no-wraps, no-switches) path.

Results are pretty good but it has broken the postmap generation code I just did for fast transparency and the HD texture upgrades, which assume the texture isn't unrolled and use the WAD texture dimensions - I'll need to revisit that and repair them to cope with unrolled textures.

One thing that might work out well from this - the same paths are also mip-level specific, so I could add mipmaps for vertical DV rates and have the same paths implement the DV scaling and table lookup for that mip, allowing mipmaps to be selected per column instead of per-surface, probably for free. I haven't tried to do this yet so I may have missed a detail but it looks like it should work.

There is always a slight downside to adding more setup code or funny cases - increased area gets faster but increased item counts get slower. Since Doom has to run a fairly low-res window for other reasons, focusing on area fillrate is probably a waste of time beyond some point. i.e. there's no point in speeding up the case where the player is facing a single wall, while slowing down a typical view across an open courtyard. It might be at/beyond that point already so these may be the last wall filling optimizations done on the project and I'm trying to apply them with the minimum of extra setup or tests.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Since it's easier to explain nearly anything with a picture... this screenie was captured with the texture unrolling disabled so the unroll area is left black. You can then see where the texture tiling is performed too late and shows the unroll border. It also shows that more distant textures have a higher DV rate and therefore need more frequent tiling events in screenspace (and different code to optimize for that).

[EDIT] I measured the gaps in doubled-up 640x image so the numbers I scribbled on the image are wrong by 2x :-) but the point remains the same...

[EDIT] I decided to update the original image with a fixed one since the errors probably add more confusion than anything!

[EDIT] If it's interesting at all, the reason I dropped the DSP wrap-prediction method for this, is the added complication for any wall pixel shaders which might want to use the DSP later. The DSP method would minimize the number of wrap events to the exact number needed but the saving probably isn't significant compared with the headaches it would create for DSP shading.
You do not have the required permissions to view the files attached to this post.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

I'm working on support in Hatari and Doom profiling script that will automatically find the worst frame, based on how many instructions were executed between successive "r_begin" calls. Each time such interval took more instructions than earlier, profile save file will be overwritten with the profile between last two "r_begin" calls.

For some reason Doom timedemos behave now randomly, they differ slightly on each run. This makes profiling results not re-producible and therefore they cannot be anymore used in verifying optimizations. :-/

For the beginning of Doom1 timedemo (until player starts shooting at barrels), the worst frame info looks like this:

CPU side:

Code: Select all

Time spent in profile = 0.50364s.

Visits/calls:
- max = 741, in R_BSPHyperPlane_RHS at 0x4a5b2, on line 6701
- 8406 in total
Executed instructions:
- max = 12160, in _subframe_block+456 at 0x1ef2c, on line 758
- 1024240 in total
Used cycles:
- max = 243288, in _subframe_block+470 at 0x1ef3a, on line 763
- 8079684 in total
Instruction cache misses:
- max = 1677, in _PIT_CheckLine at 0x2eff6, on line 2407
- 94198 in total

Visits/calls:
   8.82%                 741             R_BSPHyperPlane_RHS
   7.09%                 596             R_BSPHyperPlane_LHS
   6.65%   7.35%         559       618   _PIT_CheckLine
   5.88%                 494             copy16_d
   5.83%   6.06%         490       509   _PIT_CheckThing
   4.16%  13.67%         350      1149   _P_BlockThingsIterator
   2.11%   2.43%         177       204   BM_P_CrossSubsector
   1.87%   1.87%         157       157   _R_PointInSubsector
   1.71%  29.13%         144      2449   _P_SetMobjState
   1.56%   8.91%         131       749   _P_BlockLinesIterator
...
Executed instructions:
  31.24%                      319993                       R_TransparentColumnShader_Masked
  18.68%                      191337                       render_wall_1x1
  12.07%  12.08%  12.24%      123663    123703    125329   _subframe_block
   3.95%   3.95%  26.85%       40430     40504    275053   _P_RunThinkers
   3.64%   3.64%   3.64%       37250     37286     37286   _R_PointInSubsector
   3.12%   3.19%   5.91%       31949     32715     60567   _BM_P_CrossBSPNode
   2.25%                       23079                       R_VisPlaneShader
   1.95%   1.95%   2.06%       19936     19970     21064   BM_P_CrossSubsector
   1.58%   1.58%   1.58%       16180     16225     16225   stream_texture
   1.42%   1.42%   8.84%       14530     14534     90516   _P_CheckPosition
...
Instruction cache misses:
  10.61%  10.65%  72.59%        9994     10031     68382   _P_RunThinkers
   6.97%   6.97%  28.36%        6563      6566     26714   _P_CheckPosition
   6.00%   6.03%  12.32%        5654      5680     11602   _P_BlockLinesIterator
   5.28%   5.28%   6.29%        4973      4976      5922   _PIT_CheckLine
   3.74%   3.77%  10.19%        3522      3554      9602   _BM_P_CheckSight
   3.56%   3.57%   3.57%        3349      3367      3367   _R_PointInSubsector
   3.43%   3.44%  11.50%        3228      3239     10831   _P_BlockThingsIterator
   3.06%   3.10%   3.55%        2881      2920      3340   _PIT_CheckThing
   3.04%   3.04%  32.14%        2860      2860     30273   _P_TryMove
   2.83%   2.83%  47.22%        2666      2666     44482   _P_SetMobjState
DSP side:

Code: Select all

Visits/calls:
  84.37%   1.47%       49227       860   P_CrossSubsector
   2.24%                1305             R_EndAddSurface
   1.56%                 913             R_DoColumnTextureUV
   1.14%                 663             trans_skip
...
Used cycles:
  62.36%                    10077558                       command_base
  15.84%  16.43%  16.43%     2560152   2654584   2654584   R_DoColumnPerspCorrect
   4.08%                      659022                       ALGO_P_CrossBSPNode
   3.91%   3.91%   3.91%      631666    631666    631666   VPRenderSpanDT
   3.02%                      488622                       R_VPLoadTexture
   1.88%   2.53%   2.75%      303330    409176    444738   P_CrossSubsector
   1.24%                      200928                       R_ViewTestAddLine
   1.20%                      193940                       R_DoColumnTextureUV
   0.68%   0.68%   0.68%      110048    110048    110048   extract_subvisplane
   0.65%                      105252                       P_CrossSubsector_doseg
PS. Your latest code isn't in repository yet, so I don't know how useful above is anymore.

PPS. I'm not sure whether this should be separate script used mainly for checking interactive doom playing:
* Using it means skipping several of the profiles done by the script earlier, which might still be useful.
* In automated usage timedemo restarts, differences in levels and lenght of timedemos between different Doom versions complicate making it work nicely across all Doom versions i.e. getting it to profile all relevant frames (not slow ones at start, not stuff where next level is loaded etc).
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote:I'm working on support in Hatari and Doom profiling script that will automatically find the worst frame, based on how many instructions were executed between successive "r_begin" calls. Each time such interval took more instructions than earlier, profile save file will be overwritten with the profile between last two "r_begin" calls.
This is probably good, because trying to do this in realtime on the emulated side is difficult. It's probably best tracked a frame at a time and this can't be measured accurately without a high precision timer (I considered using the audio DMA counter as a precision profiling clock this but then sound doesn't work of course and it is something which itself needs profiled). The best I could come up with was a smoothed measurement over several frames using a low frequency timer e.g. 200hz. This would catch the same approximate areas of the game but not the single worst frame.
Eero Tamminen wrote: For some reason Doom timedemos behave now randomly, they differ slightly on each run. This makes profiling results not re-producible and therefore they cannot be anymore used in verifying optimizations. :-/
I'll need to look at this. I did add something recently to manage sound event propagation between sectors and this does cause at least one new bug/problem I noticed afterwards. If I did check that in already, it is probably the reason.

I thought it had been safely tied to (TIMEBASE_CONTROL>=2) but maybe not (demos should work at <= 1).
Eero Tamminen wrote: For the beginning of Doom1 timedemo (until player starts shooting at barrels), the worst frame info looks like this:
Time spent in profile = 0.50364s.
I assume that means the profiling started very late during execution? How did it establish this is the worst frame from just 0.5s of sampling?
Eero Tamminen wrote: Visits/calls:
6.65% 7.35% 559 618 _PIT_CheckLine
5.88% 494 copy16_d
I haven't looked at profiles for a week or two but is copy16_d one of the ones we identified with C patch drawing, V_CopyRect and status bar updates? Or is it something new?
Eero Tamminen wrote: 31.24% 319993 R_TransparentColumnShader_Masked
This should be much improved by the time I put in recent code. It won't happen immediately because I'm reworking some source files and have some other things to fix first.
Eero Tamminen wrote: 12.07% 12.08% 12.24% 123663 123703 125329 _subframe_block
This is from the audio mixer - it's registering a bit more than I've seen it while testing here (around 10%) but it's something which still needs a second pass to optimize. The previous aim was just to get it in working.
Eero Tamminen wrote: 3.95% 3.95% 26.85% 40430 40504 275053 _P_RunThinkers
I think I checked in some changes for this one already, so it should be taking considerably less time than before. It's still one of the main costs but some of the smaller utility functions using by the thinker code have been inlined or implemented in asm, which reduced cache misses, cleaned up the callgraph a lot and made the total cost easier to account for.
Eero Tamminen wrote: 3.64% 3.64% 3.64% 37250 37286 37286 _R_PointInSubsector
3.12% 3.19% 5.91% 31949 32715 60567 _BM_P_CrossBSPNode
2.25% 23079 R_VisPlaneShader
Hmmm. More time spent checking line of sight and finding which sectors things live in, than actually filling floor/ceiling pixels...

BM_P_CrossBSPNode has a bit more work to be done on it but is highly optimized already.

R_PointInSubsector still need a complete reimplementation.

It is worth reminding that the cost of many of these game-side problems is greatly inflated when built with (TIMEBASE_CONTROL < 3)
Eero Tamminen wrote:

Code: Select all

Cache misses:
  10.61%  10.65%  72.59%        9994     10031     68382   _P_RunThinkers
   6.97%   6.97%  28.36%        6563      6566     26714   _P_CheckPosition
   6.00%   6.03%  12.32%        5654      5680     11602   _P_BlockLinesIterator
   5.28%   5.28%   6.29%        4973      4976      5922   _PIT_CheckLine
   3.74%   3.77%  10.19%        3522      3554      9602   _BM_P_CheckSight
   3.56%   3.57%   3.57%        3349      3367      3367   _R_PointInSubsector
   3.43%   3.44%  11.50%        3228      3239     10831   _P_BlockThingsIterator
   3.06%   3.10%   3.55%        2881      2920      3340   _PIT_CheckThing
   3.04%   3.04%  32.14%        2860      2860     30273   _P_TryMove
   2.83%   2.83%  47.22%        2666      2666     44482   _P_SetMobjState
It is reassuring to know that only one of these were implemented by me for BM :-)
Eero Tamminen wrote:

Code: Select all

DSP side:
Visits/calls:
  84.37%   1.47%       49227       860   P_CrossSubsector
   2.24%                1305             R_EndAddSurface
   1.56%                 913             R_DoColumnTextureUV
   1.14%                 663             trans_skip
This one is a little distracting because the P_CrossSubsector label and entrypoint is inside its own loop. I might look for a way to remove the label from there for profiling builds, if it helps make the other stuff clearer.
Eero Tamminen wrote:

Code: Select all

...
Used cycles:
  62.36%                    10077558                       command_base
  15.84%  16.43%  16.43%     2560152   2654584   2654584   R_DoColumnPerspCorrect
   4.08%                      659022                       ALGO_P_CrossBSPNode
   3.91%   3.91%   3.91%      631666    631666    631666   VPRenderSpanDT
   3.02%                      488622                       R_VPLoadTexture
   1.88%   2.53%   2.75%      303330    409176    444738   P_CrossSubsector
   1.24%                      200928                       R_ViewTestAddLine
   1.20%                      193940                       R_DoColumnTextureUV
   0.68%   0.68%   0.68%      110048    110048    110048   extract_subvisplane
   0.65%                      105252                       P_CrossSubsector_doseg
This looks ok, but it reminds me I had a question relating to DSP/CPU profiling and revealing some extra info. I'll talk to you separately about it.
Eero Tamminen wrote: PPS. I'm not sure whether this should be separate script used mainly for checking interactive doom playing:
* Using it means skipping several of the profiles done by the script earlier, which might still be useful.
* In automated usage timedemo restarts, differences in levels and lenght of timedemos between different Doom versions complicate making it work nicely across all Doom versions i.e. getting it to profile all relevant frames (not slow ones at start, not stuff where next level is loaded etc).
There is some value in an interactive version because it makes it easier to find new cases to focus on for automated profiling. However, that is also subject to free time by whoever happens to be profiling, so I don't know how much one would be used versus the other :)
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:
Eero Tamminen wrote: For the beginning of Doom1 timedemo (until player starts shooting at barrels), the worst frame info looks like this:
Time spent in profile = 0.50364s.
I assume that means the profiling started very late during execution? How did it establish this is the worst frame from just 0.5s of sampling?
That profiling was started after:
- Level start is detected by "P_RunThinkers" breakpoint hit
- Several frames at start are skipped as they're very slow (currently 20)
- Specific Doom symbol that signifies extra action was hit (currently either P_LineAttack or A_Explode)

Worst frame is determined by how many cycles its profile took.

How it works:
- breakpoint that saves profile data triggers at "r_begin" if frame took more cycles than some earlier frame. New save overwrites earlier save.
- at every "r_begin" profiling data is cleared and instruction counter is reset by a second breakpoint.

Both of these required small change to Hatari debugger.

Currently Hatari is quit after fixed number of frames (currently 100), to prevent next level loading giving bogus "worst frame" profile. Is there some symbol which could be used to determine when player dies, or timedemo ends otherwise?


I use CPU instruction count because cycles count is reliable only for DSP (and CPU cycles counting would be more code which I don't want to spread around Hatari).

Btw. maybe "r_end" would be better trigger point than "r_begin"? I assume that would be closer to whole frame start.

dml wrote: I haven't looked at profiles for a week or two but is copy16_d one of the ones we identified with C patch drawing, V_CopyRect and status bar updates? Or is it something new?
Looking at the callgraph for calls, it's calling itself. Calls are better checked from the callgraph, it's probably best if I leave them out from the ASCII profile data I paste here.

dml wrote: subframe_block s is from the audio mixer - it's registering a bit more than I've seen it while testing here (around 10%) but it's something which still needs a second pass to optimize. The previous aim was just to get it in working.
This was for Doom1, in Doom2 its percentage should be less as it uses more CPU.
dml wrote: Hmmm. More time spent checking line of sight and finding which sectors things live in, than actually filling floor/ceiling pixels...
I think the profile was from place where Imp and fireball were largely in front of the player.

(You can monitor the data being written to profile in another window while Hatari runs to see when it saves the data.)

dml wrote: I thought it had been safely tied to (TIMEBASE_CONTROL>=2) but maybe not (demos should work at <= 1).
...
It is worth reminding that the cost of many of these game-side problems is greatly inflated when built with (TIMEBASE_CONTROL < 3)
...
There is some value in an interactive version because it makes it easier to find new cases to focus on for automated profiling. However, that is also subject to free time by whoever happens to be profiling, so I don't know how much one would be used versus the other :)
I hadn't thought about the TIMEBASE_CONTROL stuff, but now that you mentioned it, the current profiling stuff works well only for build where timedemos run OK. "worst frame" profiling stuff can then be used for the best optimized BM, when doing things interactively.

I'll provide a separate option in the profiler script for using "worst frame" profiling. Something like:

Code: Select all

profile.sh -w
Then when you want profiling for the worst frame to start, you just do in debugger:

Code: Select all

parse profile-worst.ini
And it will do worst frame profiling [1] while you play Doom and quit Hatari automatically when you die (or you can quit Hatari yourself), after which the script will post-process the data for last saved (i.e. worst frame) profile.

[1] Note: this has noticeable extra overhead on Hatari because profiling will be restarted on every frame. Profile save on worst frames may also make gameplay extra laggy. :)

[EDIT] I commited above discussed features to Hatari and BM profiling script. Option is now just "-w".

I also changed BM Makefile to add TIMEBASE_CONTROL value to doom binary & symbol file names that are copied to profiling directory so that profiling script can pick a suitable binary automatically.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

Worst frame from Doom2 level beginning, from BM built with TIMEBASE_CONTROL = 3.

I'm not sure whether lack of hectic action or build option caused worst frame to be better than with Doom1 TIMEBASE_CONTROL=1 build. I died pretty soon due to extra profiling slowdown making things awkward...

CPU side:

Code: Select all

Time spent in profile = 0.38140s.

Visits/calls:
- max = 1130, in R_BSPHyperPlane_RHS at 0x4a6c6, on line 7670
- 8004 in total
Executed instructions:
- max = 9600, in _subframe_block+456 at 0x1ef2c, on line 765
- 774686 in total
Used cycles:
- max = 192068, in _subframe_block+470 at 0x1ef3a, on line 770
- 6118580 in total
Instruction cache misses:
- max = 826, in _P_BlockLinesIterator+116 at 0x306e4, on line 3269
- 66773 in total
...
Executed instructions:
  30.50%                      236243                       render_wall_1x1
  23.76%                      184040                       R_TransparentColumnShader_Masked
  12.59%  12.60%  12.95%       97545     97585    100359   _subframe_block
   1.93%   1.93%   2.46%       14934     14948     19043   get_ssector
   1.91%   1.92%   2.82%       14788     14846     21848   _R_PointInSubsector
   1.87%   1.88%   2.76%       14464     14526     21400   _R_DrawColumn
   1.84%   1.84%  15.16%       14232     14246    117420   _P_RunThinkers
   1.77%                       13685                       R_VisPlaneShader
   1.41%                       10953                       R_StackTransparentSurface
   1.26%   1.26%   4.03%        9784      9798     31198   _R_DrawMaskedColumn
   1.26%   1.26%   1.26%        9756      9774      9774   _P_PointOnLineSide
   1.12%   1.12%   4.58%        8670      8674     35445   _P_BlockLinesIterator
   1.11%   1.11%   2.62%        8580      8598     20265   _PIT_AddLineIntercepts
   1.03%   1.03%   1.12%        7948      7968      8706   _V_DrawPatch
   1.02%   1.02%   1.02%        7905      7932      7932   _P_UpdateSpecials
   0.84%   0.84%   0.84%        6472      6510      6510   stream_texture
   0.81%   0.81%   5.38%        6247      6258     41680   _P_CheckPosition
   0.60%   0.60%   0.61%        4627      4634      4710   _I_SetPalette
   0.52%   0.52%   0.91%        4044      4064      7041   _P_BlockThingsIterator
   0.52%   0.53%   0.53%        4040      4095      4095   stack_visplane_area
...
Instruction cache misses:
   7.36%   7.38%  14.67%        4916      4925      9794   _PIT_AddLineIntercepts
   7.20%   7.21%  27.58%        4808      4811     18414   _P_BlockLinesIterator
   5.97%   5.98%  57.65%        3984      3990     38494   _P_RunThinkers
   5.42%   5.44%   5.44%        3622      3631      3631   _P_PointOnLineSide
   4.55%   4.56%  19.65%        3041      3047     13121   _P_CheckPosition
   4.36%   4.37%   7.36%        2914      2920      4916   _R_DrawMaskedColumn
   3.50%   3.50%   5.70%        2337      2337      3809   _PIT_CheckLine
   2.79%   2.84%   2.99%        1862      1895      1996   _R_DrawColumn
   2.35%   2.39%   2.58%        1569      1599      1725   _R_PointInSubsector
   2.34%   2.34%   9.74%        1565      1565      6505   _R_DrawVisSprite
   2.16%   2.17%   4.36%        1439      1450      2912   _P_BlockThingsIterator
   1.85%   1.85%   1.85%        1234      1234      1234   _PIT_CheckThing
   1.51%   1.51%  22.22%        1006      1009     14835   _P_TryMove
   1.50%   1.50%   1.50%        1003      1003      1003   _P_InterceptVector
   1.47%   1.47%   1.54%         980       980      1028   R_AddSpriteSpans
   1.46%   1.46%  22.71%         977       977     15167   _P_SetMobjState
   1.45%   1.48%  22.08%         970       986     14746   _P_PathTraverse
   1.38%   1.38%   1.95%         920       920      1304   _P_BoxOnLineSide
   1.17%   1.19%   1.65%         784       795      1101   _V_DrawPatch
   1.09%                         725                       render_wall_1x1
   1.01%   1.05%   1.43%         677       699       957   _subframe_block
   1.01%   1.01%   1.01%         672       672       672   _P_PointOnDivlineSide
   0.93%                         622                       build_ssector
DSP side:

Code: Select all

Used cycles:
  56.99%                     6974460                       command_base
  25.63%  26.40%  26.40%     3136488   3230166   3230166   R_DoColumnPerspCorrect
   2.56%                      313276                       R_ViewTestAddLine
   2.25%   2.25%   2.25%      275282    275282    275282   VPRenderSpanDT
   1.60%                      196398                       R_VPLoadTexture
   1.40%   1.80%   3.12%      171132    220076    382188   AddLowerWall
   1.28%                      156810                       R_DoColumnTextureUV
   1.23%   1.63%   2.01%      150234    199200    246442   AddUpperWall
   1.18%                      144416                       R_VPRenderPlane
   0.65%   0.65%   0.65%       79664     79664     79664   extract_subvisplane
   0.61%                       74764                       project_node
   0.60%   0.72%  25.77%       73810     88244   3152980   AddMidWall
Btw. Because of different user action, these kind of profiles can show wildly different data. One needs to check frame time first and know that it was really from gameplay, to know whether data in the profile is relevant compared to data in some other profile. :-)
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Thanks for the profiling changes. I'll avoid changing the makefile for now until I update from the repo. I'm not doing much with the game code at the moment so it shouldn't lead to any conflicts.

The profiles I have taken from gameplay usually involve a cheat (all weapons and ammo + invuln.) and then running across the map with chaingun or plasma gun on until the ammo is used up, then save the profile immediately. This maxes out the audio, sprites/things (esp. plasma gun) and attack raytests as well as keeping monsters busy. It does get quite slow while doing this, the profiling overhead begins to show quite a lot on my poor laptop.

It's not exactly representative of gameplay but it does hit some peak cost areas easily. The results are quite different from your 'worst frame' results but that's to be expected.

TBH the worst case peak cost is likely to be seen with a few big sprites near the viewplane using the old sprite system. It would probably be more valuable to find the worst frame in terms of 'large numbers of things' rather than simple fillrate costs which we know about (and hopefully will register much less in future). The invisible costs caused by many small things under strange conditions are more difficult to diagnose. This probably means repeating the same process but somehow keeping some distance from the monsters :-) easier said than done!

Perhaps recording a demo with the monsters visible, then replay with sprites disabled? Or just modifying the sprite code to draw every n'th line and with masking off.. something to remove the fillrate factor from the peak cost tracking.


I still can't commit any code yet but some big changes are on the way for the engine soon.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

The hint about using cheats was good. Now the first level in Doom 2 got this as worst frame:

Code: Select all

Time spent in profile = 0.41633s.
...
Executed instructions:
  54.82%                      517483                       R_TransparentColumnShader_Masked
  16.32%                      154026                       render_wall_1x1
  11.02%  11.03%  11.17%      104048    104088    105464   _subframe_block
   3.49%                       32921                       R_VisPlaneShader
   1.53%   1.53%   1.64%       14409     14413     15479   _V_DrawPatch
   1.52%                       14303                       R_StackTransparentSurface
   1.48%   1.48%   2.81%       13946     13964     26555   _P_RunThinkers
   1.30%   1.30%   1.30%       12294     12312     12312   _R_DrawColumn
   0.84%   0.84%   0.84%        7905      7919      7919   _P_UpdateSpecials
   0.69%   0.69%   0.69%        6472      6483      6483   stream_texture
   0.68%   0.69%   0.69%        6447      6491      6491   stack_visplane_area
   0.51%   0.51%   1.28%        4817      4848     12064   R_AddSpriteSpans
...
Instruction cache misses:
  11.27%  11.30%  21.01%        3247      3256      6053   _P_RunThinkers
   8.98%   9.03%  10.03%        2587      2603      2891   R_AddSpriteSpans
   5.14%                        1481                       R_TransparentColumnShader_Masked
   4.33%   4.33%   7.11%        1249      1249      2048   _R_DrawMaskedColumn
   3.76%   3.77%   5.30%        1083      1086      1528   _V_DrawPatch
   3.19%   3.19%  10.38%         918       918      2990   _R_DrawVisSprite
   3.02%   3.06%   3.06%         869       883       883   R_ViewTestSpriteLines
   2.74%   2.77%   2.77%         790       799       799   _R_DrawColumn
   2.60%   2.60%  13.93%         749       749      4014   _STlib_drawNum
   2.48%   2.55%   3.40%         714       736       980   _subframe_block
   2.21%                         637                       render_wall_1x1
   1.97%   1.97%   9.27%         568       568      2672   _P_SetMobjState
   1.88%   1.92%   6.03%         541       552      1737   _V_CopyRect
   1.77%                         511                       init_font
   1.77%   1.77%   1.93%         510       510       555   _G_BuildTiccmd
   1.50%   1.54%   1.54%         432       443       443   _frame_event
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

This is probably from pressing "TAB" key (which didn't show map) first time during game play:

Code: Select all

Time spent in profile = 4.53467s.
...
Visits/calls:
- max = 57300, in copy16_d at 0x4d53c, on line 7572
- 305036 in total
Executed instructions:
- max = 139354, in _V_DrawPatch+216 at 0x3e71c, on line 4352
- 4919752 in total
Used cycles:
- max = 2231180, in _V_DrawPatch+218 at 0x3e71e, on line 4353
- 72747424 in total
Instruction cache misses:
- max = 30187, in _V_CopyRect_+84 at 0x3e446, on line 4168
- 821387 in total
...
Executed instructions:
  26.41%  26.46%  29.36%     1299400   1301568   1444533   _V_DrawPatch
   6.99%   6.99%   7.03%      343824    343984    345724   _subframe_block
   6.91%   6.94%   7.35%      340068    341202    361532   _AM_drawWalls
   6.08%                      299334                       copy16_d
   4.08%                      200532                       copy16
   3.96%                      194790                       render_wall_1x1
   3.28%   3.29%   3.44%      161555    161834    169267   _W_CheckNumForName
   3.25%                      159867                       copy256_d
   3.13%                      154132                       copy256
   2.40%   2.41%   3.82%      118255    118477    188077   _P_RunThinkers
   2.14%                      105080                       less256_d
   2.08%   2.09%  28.29%      102374    102680   1391930   _V_CopyRect_
   1.98%                       97318                       common2
   1.92%                       94572                       top_down
   1.77%   1.77%   1.90%       86955     87192     93285   _P_UpdateSpecials
   1.62%   1.63%   8.61%       79840     80250    423704   _V_CopyRect
   1.39%   1.39%   1.47%       68187     68342     72404   _M_AddToBox
   1.25%   1.26%  10.96%       61530     61760    539093   _TryRunTics
   1.24%   1.24%  27.75%       60892     61027   1365073   _STlib_drawNum
   1.02%                       50260                       less256
   0.98%   0.98%   2.58%       48110     48378    126851   _V_MarkRect
   0.96%                       47189                       both_even_d
   0.95%                       46674                       R_VisPlaneShader
   0.95%  28.71%  32.59%       46602   1412567   1603581   _memcpy
   0.78%   0.78%   0.78%       38199     38285     38285   _AM_clipMline
   0.63%                       31068                       exit_4
   0.60%   0.60%   9.99%       29529     29585    491250   _HUlib_drawTextLine
   0.57%   0.57%   1.05%       28056     28133     51895   _I_GetTime
   0.54%   0.54%   6.22%       26354     26604    306187   _NetUpdate
   0.51%                       25076                       both_even
...
Instruction cache misses:
  11.59%  11.66%  11.74%       95210     95746     96411   _AM_drawWalls
  11.01%  11.14%  15.91%       90448     91539    130705   _V_DrawPatch
   4.99%   5.01%  22.51%       41002     41119    184882   _V_CopyRect_
   4.83%   4.84%  26.21%       39678     39741    215263   _STlib_drawNum
   4.47%                       36681                       both_even_d
   3.55%   3.57%  11.54%       29176     29338     94827   _V_CopyRect
   3.13%   3.14%   5.94%       25680     25797     48824   _P_RunThinkers
   3.11%   3.12%   3.13%       25515     25608     25738   _M_AddToBox
   3.10%   3.12%  19.87%       25495     25615    163240   _TryRunTics
   3.07%                       25250                       less256
   2.82%                       23188                       top_down
   2.76%   2.77%   5.93%       22648     22774     48698   _V_MarkRect
   2.70%   2.70%   2.70%       22138     22181     22181   _AM_clipMline
   2.62%  23.78%  24.53%       21485    195302    201489   _memcpy
   2.48%                       20344                       copy256
   2.27%   2.27%   8.44%       18620     18648     69294   _HUlib_drawTextLine
   2.13%                       17505                       copy16_d
   1.96%   1.98%   7.01%       16133     16269     57604   _NetUpdate
   1.85%   1.86%   2.36%       15236     15252     19366   _I_GetTime
   1.62%   1.63%   2.89%       13331     13390     23728   _GetPackets
   1.53%                       12562                       common2
   1.41%                       11586                       less2_d
   1.24%                       10188                       less256_d
4.5s during hectic game play is pretty long just to show a short text message...
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

I don't know much about the code activated when you hit TAB for AutoMap - but I expect its a lot of stuff that's *only* used in that state, and therefore very slow. Certainly looks like it in the profile.

(Since I'm not doing anything with AutoMap at the moment, it can probably be ignored for now).

I may try to get the current source into a decent state so you can retry with the newer sprite code for the extreme case above. I'm not finished with the other stuff I'm doing but it could be turned off temporarily.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

I have committed the most recent code with some work-in-progress disabled.

There are some important notes/caveats though:

- The very last makefile changeset contains edits to link 68000 libs instead of 68030+FPU, for the sake of FPU-less Falcons. this might not work with the native GCC 2.95 env (which I can't test) so you might have to fix it. I don't have a plain 68030 build of the gcc libs yet so this can be considered a temporary hack.

- New wall rendering code is active but texture unrolling is disabled (it breaks postmapped transparency shading) - so there will be small pixel-level glitches seen in most walls. please ignore those for now - they are not memory corruption and don't hurt anything.

- Transparency filling will be quicker than before, but profiling in Hatari will be inaccurate. there is a (deliberate) nonlinear speed gain with large sprites from rising data cache hitrate and the effect is quite significant. this effect is not seen in Hatari so it will appear much slower when sprites are large. still, some noticeable improvement should be found anyway.

- The new sprite system is probably slightly slower for very small sprites. I haven't checked this, but it's reasonable to assume. So profile runs which avoid drawing large sprites might register an overall rise in sprite cost. This doesn't have to be a permanent situation since different sprite code could be used for faraway sprites but I won't do anything with that until/unless I see a problem. If there is a difference its probably very small anyway.

- Wall transparencies (e.g. end room in e1m1) use the new method but could do with a bit more work since the number of sub-spans per column in these textures is quite large compared with typical sprites. I'm considering DSP-assisted postmap addressing to help speed up that (worst-) case kind of texture. In any case the current code is still much faster than the old one.

- This code hasn't been much tested and some of it is still work-in-progress. Worst case it will bomb out somewhere on an untested map :)
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Among other things (like speeding up transparency and replacing 2D overlay drawing), the problem I've been working on over the last few evenings is mipmap filtering for wall textures.

In fact the real aim is not mipmap filtering for nicer visuals (as is normally the case) but rather employing CPU datacache as much as possible for texture addressing and lighting indirection to make wall rendering faster - something which depends on a combination of mipmaps and colour index packing.

In other words, increasing the probability that the next texture sample *and* it's lighting table counterpart are both already in the datacache, for any given z-distance.

One added benefit is that rendering should look better - the textures will be filtered according to distance, and some truecolour capability will show a bit more without necessarily replacing textures.

I'm completely aware that a lot of this can be sidestepped by just not lighting stuff properly, or dropping depth-cue and prelighting whole walls at a time into a surface cache with only a few permitted light levels vs cache size. But Doom lighting is part of the game and look, and the TBH the real saving isn't massive - it gains you something like one window size step increase for about the same cost (if even that). This isn't a significant win versus what gets lost from the look so I'm trying to keep the visuals true to what was intended even if it makes the coding a bit harder.


It is taking some time to get everything working because mipmaps are a b***h when you have multiple texture formats (e.g. raw8, masking/postmaps, raw16), some of them are odd sizes (never mind not square) and some textures are virtually padded from their original size with some added unroll border...
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero, I can see the problem with demo replay having re-enabled demos again in my build.

It doesn't appear to be the problem I first thought (sound propagation optimizations not being disabled with TIMEBASE_CONTROL==1) but still have to narrow it down properly - may be related to P_AproxDistance & P_MobjThinker optimizations applied shortly afterwards. Those mods are limited to a couple of changesets so should be solved when I get time to go through them and test.

[EDIT]

A regression occurred at rev #093e4a182492 in the core.asm (DSP) module around 3 weeks ago. It might not be the only cause as it could be hiding other faults since then but the problem started there. Demos prior to that rev seem ok but this rev is clearly broken.

Fixed.

The level I had trouble with looks correct now with (TIMEBASE_CONTROL==1), but there may still be issues with the recursive sound propagation optimizations I made more recently with (TIMEBASE_CONTROL>=2). It's difficult to tell since demos will desync anyway but I'll try to check this in isolation later. For now your auto-profiling should work ok.

ALGO_P_CrossBSPNode probably needs re-profiling since this code was misbehaving and was one of the higher DSP costs.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

This case is from the very beginning of the 4th Doom1 level (TIMEBASE_CONTROL=3):

Code: Select all

Time spent in profile = 0.37274s.

Visits/calls:
- max = 794, in _P_PointOnDivlineSide at 0x30388, on line 3260
- 8745 in total
Executed instructions:
- max = 9600, in _subframe_block+456 at 0x1ef2c, on line 766
- 649621 in total
Used cycles:
- max = 192068, in _subframe_block+470 at 0x1ef3a, on line 771
- 5979720 in total
Instruction cache misses:
- max = 1110, in _P_PointOnDivlineSide+74 at 0x303d2, on line 3271
- 92849 in total
...
Executed instructions:
  24.59%                      159739                       R_DrawSurface_NW
  15.03%  15.04%  15.32%       97651     97711     99496   _subframe_block
   8.62%                       55986                       MARKER
   5.17%                       33576                       R_VisPlaneShader
   4.32%   4.33%  20.96%       28050     28133    136167   _P_RunThinkers
   3.62%   3.64%   5.78%       23529     23639     37535   _P_PointOnDivlineSide
   2.29%   2.29%   2.29%       14888     14906     14906   _R_DrawColumn
   2.05%   2.05%   8.00%       13308     13343     51953   _P_BlockLinesIterator
   1.99%   2.00%   3.03%       12944     12986     19698   stream_texture
   1.80%   1.80%   5.51%       11688     11709     35789   _PIT_AddLineIntercepts
   1.50%   1.50%   2.68%        9718      9753     17396   _V_DrawPatch
   1.26%   1.27%   1.27%        8176      8257      8257   stack_visplane_area
   1.21%   1.21%   1.23%        7844      7844      7991   R_AddSpriteSpans
   1.14%   1.14%   1.14%        7416      7416      7416   _R_PointInSubsector
   0.98%                        6392                       R_VisPlaneSkyShader
   0.94%                        6114                       R_StackTransparentSurface
   0.91%                        5913                       R_AddLine_loop
   0.88%   0.88%  12.83%        5691      5698     83350   _P_PathTraverse
   0.82%   0.84%   1.43%        5327      5459      9275   _BM_P_CrossBSPNode
   0.81%   0.81%   0.81%        5276      5283      5283   _P_UpdateSpecials
   0.78%                        5091                       R_SwitchSurface_T2
   0.75%   0.75%   4.48%        4894      4901     29086   _P_BlockThingsIterator
   0.73%   0.74%   0.74%        4760      4787      4787   _P_InterceptVector
   0.71%   0.71%   3.01%        4642      4642     19548   _R_DrawMaskedColumn
...
Instruction cache misses:
   9.68%   9.75%  10.04%        8990      9050      9320   _P_PointOnDivlineSide
   7.93%   7.94%  28.88%        7360      7375     26816   _P_BlockLinesIterator
   7.87%   7.92%  47.84%        7311      7352     44416   _P_RunThinkers
   7.73%   7.74%  18.85%        7178      7187     17501   _PIT_AddLineIntercepts
   3.62%   3.62%   3.70%        3357      3357      3437   R_AddSpriteSpans
   2.93%   2.95%   2.95%        2723      2737      2737   _P_InterceptVector
   2.91%   2.91%  35.73%        2701      2704     33174   _P_PathTraverse
   2.36%                        2194                       R_DrawSurface_NW
   2.12%   2.12%   4.64%        1972      1972      4309   _PIT_AddThingIntercepts
   2.08%   2.08%  34.58%        1931      1934     32107   _P_SetMobjState
   1.98%                        1838                       MARKER
   1.84%   1.84%   2.09%        1709      1712      1940   _PIT_CheckLine
   1.72%   1.74%   1.86%        1600      1616      1727   R_ViewTestSpriteLines
   1.65%   1.65%   6.84%        1533      1536      6353   _P_BlockThingsIterator
   1.48%   1.48%   2.43%        1378      1378      2252   _R_DrawMaskedColumn
   1.38%   1.39%   6.89%        1281      1292      6397   _P_CheckPosition
   1.34%                        1245                       build_ssector
   1.07%   1.08%   6.97%         995      1001      6475   R_FlushDeferredSurfaces
   1.03%   1.03%   3.48%         958       958      3234   _R_DrawVisSprite

subframe_block code using most cycles:

Code: Select all

...
$01ee06 :             bra       $1ee70                     0.00% (16, 128, 0)
[...]
$01ee70 :             move.l    d0,d4                      0.74% (4800, 19200, 32)
$01ee72 :             swap      d4                         0.74% (4800, 19264, 16)
$01ee74 :             move.b    (a0,d4.w),d6               0.74% (4800, 57728, 16)
$01ee78 :             add.l     d1,d0                      0.74% (4800, 68, 0)
$01ee7a :             move.l    (a1,d6.w*4),(a4)+          0.74% (4800, 115200, 0)
$01ee7e :             dbra      d7,$1ee70                  0.74% (4800, 19264, 0)
$01ee82 :             bra       $1ee86                     0.00% (16, 128, 0)
$01ee86 :             movea.w   #1,a6                      0.00% (16, 64, 0)
$01ee8a :             move.w    $2e(sp),d0                 0.00% (16, 128, 0)
$01ee8e :             cmp.w     a6,d0                      0.00% (16, 64, 16)
$01ee90 :             ble       $1ef50                     0.00% (16, 128, 16)
$01ee94 :             move.w    d0,$2c(sp)                 0.00% (16, 192, 16)
$01ee98 :             movea.w   a6,a0                      0.00% (32, 0, 0)
$01ee9a :             lea       $b6340,a1                  0.00% (32, 320, 16)
$01eea0 :             movea.l   (a1,a0.l*4),a5             0.00% (32, 576, 16)
$01eea4 :             move.l    $30(sp),d7                 0.00% (32, 448, 16)
$01eea8 :             subq.w    #1,d7                      0.00% (32, 64, 0)
$01eeaa :             movea.l   $34(sp),a4                 0.00% (32, 384, 0)
$01eeae :             move.l    4(a5),d0                   0.00% (32, 384, 0)
$01eeb2 :             move.l    8(a5),d1                   0.00% (32, 384, 0)
$01eeb6 :             movea.l   (a5),a0                    0.00% (32, 448, 16)
$01eeb8 :             movea.l   $14(a5),a1                 0.00% (32, 448, 16)
$01eebc :             clr.l     d6                         0.00% (32, 0, 0)
$01eebe :             bra       $1ef2c                     0.00% (32, 256, 0)
[...]
$01ef2c :             move.l    d0,d4                      1.48% (9600, 38336, 36)
$01ef2e :             swap      d4                         1.48% (9600, 38468, 17)
$01ef30 :             move.b    (a0,d4.w),d6               1.48% (9600, 115324, 16)
$01ef34 :             add.l     d1,d0                      1.48% (9600, 68, 2)
$01ef36 :             move.l    (a1,d6.w*4),d4             1.48% (9600, 153656, 0)
$01ef3a :             add.l     d4,(a4)+                   1.48% (9600, 192068, 18)
$01ef3c :             dbra      d7,$1ef2c                  1.48% (9600, 38660, 16)
$01ef40 :             bra       $1ef44                     0.00% (32, 320, 16)
...
Code doing most cache misses:

Code: Select all

_P_PointOnDivlineSide:
$030388 :             movem.l   d2-d4,-(sp)                0.12% (794, 27116, 397)
$03038c :             move.l    $10(sp),d0                 0.12% (794, 11120, 398)
$030390 :             move.l    $14(sp),d1                 0.12% (794, 11120, 398)
$030394 :             movea.l   $18(sp),a0                 0.12% (794, 11120, 398)
$030398 :             move.l    8(a0),d4                   0.12% (794, 12456, 718)
$03039c :             bne.s     $303b6                     0.12% (794, 4776, 2)
[...]
$0303b6 :             move.l    $c(a0),d2                  0.12% (794, 9528, 0)
$0303ba :             bne.s     $303d2                     0.12% (794, 9280, 763)
[...]
$0303d2 :             move.l    d0,d3                      0.12% (794, 1604, 1110)
$0303d4 :             sub.l     (a0),d3                    0.12% (794, 9528, 0)
$0303d6 :             sub.l     4(a0),d1                   0.12% (794, 9528, 0)
$0303da :             move.l    d2,d0                      0.12% (794, 2784, 696)
$0303dc :             eor.l     d4,d0                      0.12% (794, 392, 0)
$0303de :             eor.l     d3,d0                      0.12% (794, 3176, 696)
$0303e0 :             eor.l     d1,d0                      0.12% (794, 3176, 0)
$0303e2 :             blt.s     $30402                     0.12% (794, 6992, 696)
$0303e4 :             asr.l     #8,d2                      0.08% (551, 2208, 2)
$0303e6 :             asr.l     #8,d3                      0.08% (551, 4156, 488)
$0303e8 :             muls.l    d3,d2,d3                   0.08% (551, 27640, 298)
$0303ec :             move.w    d3,d2                      0.08% (551, 2204, 0)
$0303ee :             swap      d2                         0.08% (551, 3396, 298)
$0303f0 :             asr.l     #8,d1                      0.08% (551, 2204, 0)
$0303f2 :             asr.l     #8,d4                      0.08% (551, 3396, 298)
$0303f4 :             muls.l    d4,d1,d4                   0.08% (551, 27640, 298)
$0303f8 :             move.w    d4,d1                      0.08% (551, 2204, 0)
$0303fa :             swap      d1                         0.08% (551, 2656, 113)
$0303fc :             cmp.l     d1,d2                      0.08% (551, 2204, 0)
$0303fe :             sle       d0                         0.08% (551, 2656, 113)
$030400 :             bra.s     $30406                     0.08% (551, 4408, 0)
$030402 :             eor.l     d3,d2                      0.04% (243, 972, 138)
$030404 :             slt       d0                         0.04% (243, 972, 0)
$030406 :             extb.l    d0                         0.12% (794, 3812, 272)
$030408 :             neg.l     d0                         0.12% (794, 3176, 0)
$03040a :             movem.l   (sp)+,d2-d4                0.12% (794, 31816, 0)
$03040e :             rts
Attached is a callgraph of how the cache misses pile up from it.
You do not have the required permissions to view the files attached to this post.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Thanks.

I recommend filtering out symbols like R_SwitchSurface[*] and MARKER as these are hiding or breaking up real functions in the profile view. I still have some things to tidy up.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote: - max = 794, in _P_PointOnDivlineSide at 0x30388, on line 3260
This one is a bit interesting, because it's on my 'list'.
Eero Tamminen wrote: Instruction cache misses:
- max = 1110, in _P_PointOnDivlineSide+74 at 0x303d2, on line 3271
- 92849 in total
It's also registering multiple offences :)
Eero Tamminen wrote: ...
Executed instructions:
24.59% 159739 R_DrawSurface_NW
15.03% 15.04% 15.32% 97651 97711 99496 _subframe_block
8.62% 55986 MARKER
5.17% 33576 R_VisPlaneShader
I think 'MARKER' is part of the new sprite routine, IIRC. I used the label for size calculations and it's still in there. Filter that label and the real name will show I think.
Eero Tamminen wrote: subframe_block code using most cycles:

Code: Select all

...
$01ee06 :             bra       $1ee70                     0.00% (16, 128, 0)
[...]
$01ee70 :             move.l    d0,d4                      0.74% (4800, 19200, 32)
$01ee72 :             swap      d4                         0.74% (4800, 19264, 16)
$01ee74 :             move.b    (a0,d4.w),d6               0.74% (4800, 57728, 16)
$01ee78 :             add.l     d1,d0                      0.74% (4800, 68, 0)
$01ee7a :             move.l    (a1,d6.w*4),(a4)+          0.74% (4800, 115200, 0)
$01ee7e :             dbra      d7,$1ee70                  0.74% (4800, 19264, 0)
[/quote]

subframe_block is the inner loop of the audio mixer, and it is temporary - not at all optimized. We can ignore it for now.

[quote="Eero Tamminen"]
Code doing most cache misses:[code]
_P_PointOnDivlineSide:
$030388 :             movem.l   d2-d4,-(sp)                0.12% (794, 27116, 397)
$03038c :             move.l    $10(sp),d0                 0.12% (794, 11120, 398)
$030390 :             move.l    $14(sp),d1                 0.12% (794, 11120, 398)
$030394 :             movea.l   $18(sp),a0                 0.12% (794, 11120, 398)
$030398 :             move.l    8(a0),d4                   0.12% (794, 12456, 718)
$03039c :             bne.s     $303b6                     0.12% (794, 4776, 2)
[/quote]

This is one for the DSP. In fact it's already on the DSP for LOS tests but it's not being used much by the game yet. Same goes for InterceptVector() variants.

[quote="Eero Tamminen"]
Attached is a callgraph of how the cache misses pile up from it.[/quote]

hehe. I'll take a look - fortunately it will be easy to 'disappear' that one.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3989
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Note that 'R_DrawSurface_NW' is the new 'no-wrapping' path for the old 'render_wall_1x1' which we've been seeing in profiles for a long time now. There are other paths but if most of the walls are fairly near the viewer they won't activate.

...and it seems 'MARKER' is hiding the new transparency/sprite (postmapping) shader.

Code: Select all

*-------------------------------------------------------*
R_SpriteColumnShader_Masked2:
MARKER: ;25d58
*-------------------------------------------------------*
			rsreset
*-------------------------------------------------------*
.palette:		rs.l	1
.postmap:		rs.l	1
.source:		rs.l	1
*-------------------------------------------------------*
.frame_:		rs.b	0	
*-------------------------------------------------------*

Return to “680x0”