Bad Mood : Falcon030 'Doom'

All 680x0 related coding posts in this section please.

Moderators: Zorro 2, Moderator Team

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote: Thanks, it's not anymore so visible in profiles.
Good news.
Eero Tamminen wrote: Audio went from 8% CPU usage to 4.3%, which improved the average FPS from 2.20 to 2.26, a 2.7% FPS improvement.
It's clear, but a bit modest improvement.
Sure - it's a modest improvement in the whole scheme of costs, because it was just a single thing - there are lots of other modest/small costs elsewhere which are adding up to drown it.

I think this is probably good enough to move on. It's doing exactly the same amount of work as before, just taking half the time.
Eero Tamminen wrote: I'll hopefully have time tomorrow to check how large impact that and your earlier optimizations had on worst frame numbers.
Those other changes are more closely related to game/thinking time so that's where to look if you get the opportunity.


Eero Tamminen wrote: Profile for hash insert which is 3rd on list and has instructions with highest instruction & cycle counts, looks like this:
hash_insert:
Yep I'm aware there is a *LOT* of loading/initialization oriented code which takes its time (like building the WAD directory hashtable ^^^^). I never spent any time optimizing that stuff and I may never do it - there are too many runtime costs to work on, which get in the way of gameplay.

The most important loading cost is the texture preprocessing because it turns seconds into minutes of waiting - and I will deal with that - but the other stuff, it depends. I may be working on some other project first :)
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

Eero Tamminen wrote: Audio went from 8% CPU usage to 4.3%, which improved the average FPS from 2.20 to 2.26, a 2.7% FPS improvement. It's clear, but a bit modest improvement. I'll hopefully have time tomorrow to check how large impact that and your earlier optimizations had on worst frame numbers.
Things don't add up correctly. As these went also away:

Code: Select all

   5.43%   5.43%   5.75%     7228956   7239562   7663211   D_AllocPreview_Dynamic
   2.42%   2.42%   2.55%     3218664   3223277   3392055   D_CachePreview
   1.87%   1.86%   2.03%     2493620   2483560   2705661 * D_AllocPreview_App
   1.85%   1.85%   1.99%     2458524   2463077   2657742   D_AllocPreview_Fixed
   1.84%   1.84%   1.98%     2448028   2452871   2636755   D_AllocPreview_Unknown
   1.15%   1.15%   1.25%     1530120   1533695   1669126   D_AllocPreview_Free
There should be much larger change to FPS than 2.7%, more like 10-20%.

Or are these debug things and some part of audio done while CPU would be waiting for DSP i.e. free?
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

I did worst-frame profiling for the TIMERBASE_CONTROL=1 Doom II timedemo, from first attack until player dies.

Results were "interesting".

There are still sprite and texture loads:

Code: Select all

grep -A1 "m a0" hatari.log |cut -b 67-|grep -v '^$'|tr -d .|sed 's/^/- /' 
- PUFFB0
- PUFFA0
- BLUDB0
- BLUDA0
- PUFFC0
- PUFFD0
- TROOD4D6
- TROOA3A7
- TROOE1
- TROOF1
- RW20_2
- TROOG1
- BAL1B0
- DOOR9_2
- BAL1A0
- BLUDC0
- TROOA5
- TROOB4B6
- TROOC2C8
- TROOD2D8
- TROOE2E8
- TROOB3B7
- TROOI0
- TROOJ0
- TROOK0
- TROOL0
- BAL2A0
- RW21_4
- SW2_7
- DOOR9_1
- W46_37
- TROOD1
- TROOC1
- TROOB5
- TROOA4A6
- W46_38
- POSSC1
- POSSD1
- TROOC5
- TROOD5
- POSSD2D8
- POSSB3B7
- POSSC3C7
- POSSD3D7
- POSSE1
- POSSF1
There were also extra reads from W_CacheLumpNum().

Worst frame rendering takes actually 8.5s, in middle of the game, and it takes that long because several HD textures are loaded within that single frame:
- RW21_4
- SW2_7
- DOOR9_1
- W46_37

Rendering time goes on CPU side to:

Code: Select all

Time spent in profile = 8.48290s.
...
Visits/calls:
  55.10%  55.22%      147432    147764   correct_element
  10.30%               27564             repair_start
  10.30%               27558             repair_next
  10.27%  10.44%       27470     27940   hash_insert
  10.26%               27461             repair_found
   0.63%                1697             init_font

Executed instructions:
  40.75%  40.81%  41.63%     6528015   6537659   6669137   hash_insert
  17.49%  17.53%  18.11%     2801664   2808880   2901734   correct_element
  11.97%  12.00%  12.33%     1917412   1921829   1975683   D_TextureRemapPixelSubBlock_16_8
   9.82%   9.83%  10.00%     1572725   1574909   1601761   D_TextureMipGen_8_16
   5.84%   5.85%  24.10%      936018    937792   3861037   create_local_palette_64levels
   2.07%   2.07%   2.07%      332212    332432    332432   _BM_A_Mux1x2
   1.88%   1.88%   1.95%      301398    301754    312610   D_GeneratePostMap
   1.84%   1.84%  34.95%      294954    295480   5600062   D_TextureRemapInit
   1.64%                      263336                       D_PatchRender_16_16
   1.54%                      247149                       repair_found
   0.83%   0.83%   0.85%      132216    132503    136762   D_TexturePatchLoad
   0.75%                      120373                       D_PatchRender_8_8
   0.56%                       89306                       R_DrawSurface_NoTileNoMip
   0.52%                       82710                       repair_next
...
Instruction cache misses:
  12.37%  12.37%  20.74%       10915     10915     18304   _subframe_block
  10.10%                        8914                       init_font
   8.24%   8.37%   8.37%        7268      7389      7389   _BM_A_Mux1x2
   7.19%   7.20%   7.20%        6344      6355      6355   _frame_event
   6.43%  10.28%  18.24%        5674      9068     16097   correct_element
   5.80%   8.22%  12.91%        5120      7250     11392   D_TextureRemapPixelSubBlock_16_8
   4.07%                        3588                       _audio_mux_asm
   3.74%   4.95%   7.25%        3298      4367      6401   D_TextureMipGen_8_16
   3.69%   3.69%  31.63%        3257      3257     27916   _audio_mux_frame
   2.74%   2.77%   3.08%        2415      2440      2716   R_AddSpriteSpans
   2.63%   3.50%  23.72%        2318      3090     20930   create_local_palette_64levels
   1.72%   5.83%  17.33%        1516      5147     15289   hash_insert
   1.68%                        1481                       build_ssector
   1.57%   1.57%   1.57%        1389      1389      1389   R_ViewTestSpriteLines
   1.31%                        1155                       unlink_chunk
   1.17%                        1035                       R_AddLine_loop
   1.13%   1.16%  26.66%         995      1026     23529   D_CacheFlushPartial
   1.10%   1.11%  25.66%         968       982     22647   deallocate_chunk
   0.96%   0.96%  83.59%         846       846     73762   R_FlushDeferredSurfaces
   0.96%   0.96%   1.05%         845       851       924   internal_allocate
   0.93%                         825                       repair_found
   0.77%   5.10%  24.55%         678      4501     21665   internal_deallocate
   0.76%   0.76%  85.73%         675       675     75651   R_SubSectorTryFlush
Hash usage may need a serious look. Attached is a callgraph. If I'm reading it correctly, the hash usage comes mostly through following chain: D_TextureRemapInit -> allocate_scratch -> deallocate_chunk -> internal_deallocate -> insert_hash.

Would it be possible to use (without bloating memory usage too much) some structure that you can blow away in one go?


EDIT: On DSP side R_DoColumnPerspCorrect() is just waiting for data:

Code: Select all

p:0474  0aa981 000474  (06 cyc)  jclr #1,x:$ffe9,p:$0474                          97.88%
You do not have the required permissions to view the files attached to this post.
Last edited by Eero Tamminen on Tue Aug 06, 2013 9:57 pm, edited 1 time in total.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote:
Things don't add up correctly. As these went also away:

There should be much larger change to FPS than 2.7%, more like 10-20%.

Or are these debug things and some part of audio done while CPU would be waiting for DSP i.e. free?
Those are debug graphs for checking memory allocation/leaks. They must have been checked in enabled in an earlier revision you were testing. They are expensive and should normally be off.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:
Eero Tamminen wrote:
Things don't add up correctly. As these went also away:

There should be much larger change to FPS than 2.7%, more like 10-20%.

Or are these debug things and some part of audio done while CPU would be waiting for DSP i.e. free?
Those are debug graphs for checking memory allocation/leaks. They must have been checked in enabled in an earlier revision you were testing. They are expensive and should normally be off.
Yes. That's why I was wondering why there wasn't larger difference in FPS.

My previous build might have been from defersurfs branch and therefore miss the large mipmaps changes you did. If mipmap commit caused the single 8.5s frame (I don't remember such freeze from earlier run), it might explain where the FPS went.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote:I did worst-frame profiling for the TIMERBASE_CONTROL=1 Doom II timedemo, from first attack until player dies.

Results were "interesting".

There are still sprite and texture loads:
Given that the earlier list was made from a profiling run, it's probably not exhaustive, no. We should probably expect a few more to turn up depending on the level etc.

However some of those look like they were already added to the list (?) so I'm not sure what's going on there, unless the precaching is being fiddled with in a game-specific way.
Eero Tamminen wrote: Worst frame rendering takes actually 8.5s, in middle of the game, and it takes that long because several HD textures are loaded within that single frame:
- RW21_4
- SW2_7
- DOOR9_1
- W46_37
Yes I'm going to ignore that for now as the local texture caching should deal with the slow parts.

Eero Tamminen wrote: Rendering time goes on CPU side to:

Code: Select all

Time spent in profile = 8.48290s.
...
Visits/calls:
  55.10%  55.22%      147432    147764   correct_element
  10.30%               27564             repair_start
  10.30%               27558             repair_next
  10.27%  10.44%       27470     27940   hash_insert
  10.26%               27461             repair_found
   0.63%                1697             init_font
[/quote]

That all looks loading-specific. Texture gamma correction and memory management.

[quote="Eero Tamminen"]
Hash usage may need a serious look.  Attached is a callgraph.  If I'm reading it correctly, the hash usage comes mostly through following chain: D_TextureRemapInit -> allocate_scratch -> deallocate_chunk -> internal_deallocate -> insert_hash.
[/quote]

That particular hash is for memory defragmentation. Again, it's all related to texture loading and remapping. It should just go away once the textures have been generated once (probably as an exhaustive batch, by starting doom with a commandline switch).

[quote="Eero Tamminen"]
DSP is busy for whole 8.5s?[/quote]

I think the DSP is just stalled, waiting for the CPU to finish messing with textures in the middle of a render. 

I have just fixed a few problems remaining in the mipmapping and texture loading so the glitches are gone and all textures except transparencies now mip properly. I'm still trying to fix one issue with 2-sided middle textures that aren't actually transparent as this is now temporarily broken. Should be a new version in by tomorrow.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote:
Yes. That's why I was wondering why there wasn't larger difference in FPS.

My previous build might have been from defersurfs branch and therefore miss the large mipmaps changes you did. If mipmap commit caused the single 8.5s frame (I don't remember such freeze from earlier run), it might explain where the FPS went.
That would make sense, yes. The mipmapping can be turned off and the original textures used as before, I'll try to add a switch or something for that - although it's a bit more hassle as it's in the assembler code and I'd need to pass something through the API to toggle it, or something. But the mipmapping loading times are probably skewing things a lot yes.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:
Eero Tamminen wrote: Worst frame rendering takes actually 8.5s, in middle of the game, and it takes that long because several HD textures are loaded within that single frame:
- RW21_4
- SW2_7
- DOOR9_1
- W46_37
Yes I'm going to ignore that for now as the local texture caching should deal with the slow parts.
This frame was when player was shooting at the orb across the open area.

There were also 4.5s, 2.6s, 1.6s 1.3s and 1.1s frames. I guess they were slow for the same reason. Rest of the frames were 0.35s or less.

I commited change that does worst frame profiling automatically for timedemo & TIMERBASE_CONTROL=1, you just need to give "-w" option to profile.sh (and "-s" to get info on resource loads to hatari.log).
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Ok I finished fixing the mipmap/tiling/transparency detection stuff to set up the texture flags properly for rendering, so everything should look correct again. Code is checked in.


If you want to disable mipmaps to reduce the texture loading times, you can open up this file:

bmengine\include\bldconfig.inc

...and place a semicolon in front of this label:

bldcfg_wall_mipgen

(You should find it on line #71 of the file)


As for the reason its loading textures in the middle of the level - the most obvious explanation is the texture cache fragmenting or filling up. Fragmentation would also explain why the memory allocator is going a bit nuts. This *will* happen if the demo is allowed to run more than one level with all the texture preprocessing stuff happening - but I wouldn't expect it to happen *within* a single level, unless there's a leak or the level is using a huge number of textures. Since I haven't been testing Doom II with the latest code, it might just be bigger levels. In any case, it won't happen when the textures are loaded locally.

Just in case, I'll take a look at Doom II here tomorrow to make sure things are working properly. Which level were you using?
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote: I commited change that does worst frame profiling automatically for timedemo & TIMERBASE_CONTROL=1, you just need to give "-w" option to profile.sh (and "-s" to get info on resource loads to hatari.log).
Ok I'll give that a try. Thanks.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote: There are still sprite and texture loads:

Code: Select all

grep -A1 "m a0" hatari.log |cut -b 67-|grep -v '^$'|tr -d .|sed 's/^/- /' 
- PUFFB0
- PUFFA0
- BLUDB0
- BLUDA0
- PUFFC0
- PUFFD0
- TROOD4D6
- TROOA3A7
- TROOE1
- TROOF1
- RW20_2
- TROOG1
- BAL1B0
- DOOR9_2
- BAL1A0
- BLUDC0
- TROOA5
- TROOB4B6
- TROOC2C8
- TROOD2D8
- TROOE2E8
- TROOB3B7
- TROOI0
- TROOJ0
- TROOK0
- TROOL0
- BAL2A0
- RW21_4
- SW2_7
- DOOR9_1
- W46_37
- TROOD1
- TROOC1
- TROOB5
- TROOA4A6
- W46_38
- POSSC1
- POSSD1
- TROOC5
- TROOD5
- POSSD2D8
- POSSB3B7
- POSSC3C7
- POSSD3D7
- POSSE1
- POSSF1
I did a quick check just now using DOOMU.WAD and there are no resource cache leaks - there is a lot of fragmentation due to the multi-stage texture conversion process, especially for highcolour external textures but it takes a couple of levels for the 14MB ram pool to get used up by that and the cache begins kicking stuff out to make space.

In other words, the first level played doesn't use up the cache so all of the precached material should still be in there. i.e. the precache list should be effective for at least the first level (probably not subsequent levels, currently).

I also checked the precache list and PUFF???? is definitely in there.

I haven't tried this with Doom II yet, I'll do that next.

So my next question would be - are you sure only one level was played before the PUFF sprites got 'reloaded'? If this is the case, it can only be due to two things - either the Doom II level is just too big with texture conversion running 'live', or the precaching just isn't working at all (for sprites).

I'll see what else I can find.


P.S. I was going to look at sprite clipping issues next, but it seems like texture conversion is a big pain for profiling etc. and it's annoying to wait, so I may just look at local caching / offline pre-conversion for that first. This will also help get a better understanding of true memory footprint and loading times.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:So my next question would be - are you sure only one level was played before the PUFF sprites got 'reloaded'? If this is the case, it can only be due to two things - either the Doom II level is just too big with texture conversion running 'live', or the precaching just isn't working at all (for sprites).
I just let Doom II start without any args and worst frame profiling begins automatically when player shoots for the first time in timedemo. Profiling ends when player dies at which point Hatari is quit. I.e. there should be only single level, unless Doom loads one at startup without showing anything on screen.

I've updated "eero.conf" to do above, you can use that as your own profile.conf example.

dml wrote: P.S. I was going to look at sprite clipping issues next, but it seems like texture conversion is a big pain for profiling etc. and it's annoying to wait, so I may just look at local caching / offline pre-conversion for that first. This will also help get a better understanding of true memory footprint and loading times.
Sounds good to me, the wait indeed is a bit annoying. :-)
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

I now have BM loading pre-conditioned 'render ready' textures from a local directory (and obviously - generating them as part of the loading process!). So the first time you start it up it does take its time and uses quite a lot of RAM to process all the image data, but the second time it loads really fast and uses much less ram.

The textures will persist for as long as you keep the cache folder around. If you delete the cache folder, it will generate it again next time.

This cache deals with engine resources directly so all the high-definition textures and mipmapping/colour remapping stuff gets collapsed into it. Level loading time (for textures anyway) is now just a function of disk speed and the size of the level.


This is currently working for wall textures. I'm hoping to cover sprites next, and then floor/ceiling textures (which are a bit different, but simpler).

As mentioned before, I plan to add a commandline switch at some point which converts the whole WAD into locally cached data so it runs quickly on the first try.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Code for that is now checked in. Last change creates the cache directory automatically, which I was having to do by hand previously.

Will look at sprites next.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Sprites are now locally cached also.

Should be possible to profile the game now without nasty pauses happening during gameplay. Floor textures not locally cached yet but they have always been perfectly precached by BM so this won't get in the way of profiling, even if it does mean loading times are still longer than they should be.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

Much better now!

Here's the worst frame from Doom II timedemo.

Worst thinking, CPU side:

Code: Select all

Time spent in profile = 0.28942s.
...
Executed instructions:
  16.76%  17.00%  18.91%       76562     77616     86381   _BM_P_CrossBSPNode
  13.69%  13.71%  14.26%       62538     62610     65144   _R_PointInSubsector
   7.63%   7.65%  70.16%       34848     34950    320403   _P_RunThinkers
   5.05%   5.07%  30.15%       23049     23141    137669   _P_CheckPosition
   4.47%   4.48%   4.48%       20423     20461     20461   _PIT_CheckThing
   4.36%   4.38%   8.86%       19931     19983     40444   _P_BlockThingsIterator
   4.18%   4.20%  12.50%       19080     19167     57063   _P_PathTraverse
   3.69%   3.70%   3.70%       16860     16880     16880   _BM_A_Mux3x2
   3.31%   3.32%  14.14%       15131     15145     64590   _P_RecursiveSound
   3.30%   3.31%   3.31%       15060     15114     15114   _PIT_AddLineIntercepts_L
   3.18%   3.19%   3.30%       14521     14559     15087   _V_DrawPatch
   3.09%   3.09%   3.09%       14106     14106     14106   _BM_A_Mux2x2
   1.84%   1.84%   4.60%        8388      8419     20986   _P_BlockLinesIterator
   1.77%   1.77%   1.77%        8080      8094      8094   _P_UpdateSpecials
   1.54%   1.55%  21.25%        7038      7062     97067   _BM_P_CheckSight
   1.41%   1.41%   1.41%        6452      6459      6459   _R_DrawColumn
   1.35%   1.35%   2.20%        6176      6183     10033   _PIT_CheckLine
   1.23%   1.23%  30.53%        5612      5612    139436   _P_Move
   1.19%   1.19%  34.30%        5414      5428    156629   _P_TryMove
   1.17%   1.17%  17.76%        5360      5360     81094   _P_LookForPlayers
   1.07%   1.08%   3.06%        4908      4943     13968   _R_DrawVisSprite
   0.88%   0.88%   1.56%        4002      4002      7141   _PIT_AddThingIntercepts
   0.78%   0.78%  55.52%        3561      3575    253535   _P_SetMobjState
   0.72%   0.72%   0.72%        3310      3310      3310   _P_BoxOnLineSide
   0.69%                        3157                       copy16_d
   0.69%   0.69%   0.69%        3135      3139      3139   _P_PointOnDivlineSide
   0.57%   0.57%  26.53%        2621      2621    121153   _P_NewChaseDir
...
Instruction cache misses:
  10.27%  10.31%  37.44%       12138     12186     44268   _P_CheckPosition
  10.10%  10.12%  17.32%       11943     11969     20478   _P_BlockThingsIterator
   7.18%   7.20%   7.20%        8489      8509      8509   _PIT_CheckThing
   7.12%   7.63%   7.94%        8419      9016      9383   _BM_P_CrossBSPNode
   6.00%   6.04%  72.50%        7088      7137     85714   _P_RunThinkers
...
Visits/calls:
  11.40%  11.40%         860       860   _PIT_CheckThing
   7.70%                 581             copy16_d
   7.53%  18.92%         568      1428   _P_BlockThingsIterator
   4.25%  19.90%         321      1502   _P_RecursiveSound
   4.24%   5.41%         320       408   _PIT_CheckLine
   3.17%   3.22%         239       243   _R_PointInSubsector
   3.10%   3.10%         234       234   _PIT_AddLineIntercepts_L
   2.49%  35.03%         188      2643   _P_TryMove
   2.49%  30.86%         188      2329   _P_CheckPosition
   2.17%  32.59%         164      2459   _P_Move
...
DSP side:

Code: Select all

Used cycles:
  78.52%                     7291176                       command_base
  10.51%                      975986                       ALGO_P_CrossBSPNode
   6.07%                      563742                       P_CrossSubsector_body
   1.74%   1.74%   1.74%      161518    161630    161630   InterceptVectorsUF
   1.11%                      102852                       Divs48_Real
   1.07%                       99426                       ALGO_P_LineIntercept
   0.68%   0.68%   0.68%       62882     62882     62882   TestLineSegVectorBisection
...
Visits/calls:
  22.18%  22.18%        1367      1367   TestLineSegVectorBisection
  21.11%  21.11%        1301      1301   InterceptVectorsUF
  14.44%                 890             P_CrossSubsector_body
   9.44%                 582             VECOP_return
   8.99%   8.99%         554       554   InterceptVectors
   8.88%                 547             Divs48_Real
   5.58%                 344             command_base
   3.80%                 234             ALGO_P_LineInterceptL
   3.80%                 234             ALGO_P_LineIntercept
   1.66%                 102             ALGO_P_CrossBSPNode
Worst rendering, CPU side:

Code: Select all

Time spent in profile = 0.23794s.
...
Executed instructions:
  25.22%                      116472                       R_SpriteColumnShader_Masked2
  12.47%                       57580                       R_AdvanceSurface_NMip0
  10.47%                       48341                       R_VisPlaneSkyShader
   7.07%                       32651                       R_AdvanceSurface_TMip3
   6.57%   6.57%   6.57%       30348     30368     30368   _BM_A_Mux3x2
   3.50%   3.51%   3.51%       16180     16211     16211   stream_texture
   3.50%                       16164                       R_VisPlaneShader
   3.34%                       15421                       R_AdvanceSurface_TMip2
   2.90%                       13416                       R_AdvanceSurface_TMip0
   2.88%                       13320                       R_AdvanceSurface_NMip3
   2.30%   2.33%   2.33%       10628     10762     10762   stack_visplane_area
   1.88%                        8673                       R_StackTransparentSurface
   1.81%   1.81%   1.90%        8371      8378      8777   R_AddSpriteSpans
   1.44%                        6647                       R_AddLine_loop
   1.41%                        6490                       R_BSPHyperPlane
   1.32%   1.32%   1.32%        6087      6107      6107   R_ViewTestSpriteLines
   0.99%                        4563                       R_AdvanceSurface_TMip4
   0.93%   0.94%   2.95%        4300      4321     13614   get_ssector
   0.89%                        4094                       build_ssector
   0.87%                        4024                       R_DrawSurface_NMip
   0.77%   0.77%   0.77%        3569      3569      3569   add_wall_segment
   0.72%                        3308                       R_BSPHyperPlane_LHS
   0.67%   0.67%   0.67%        3079      3086      3086   init_stategroups
   0.63%                        2892                       R_BSPHyperPlane_RHS
...
Instruction cache misses:
  15.67%  15.68%  16.38%        4337      4340      4533   R_AddSpriteSpans
   7.31%   7.35%   7.35%        2023      2034      2034   R_ViewTestSpriteLines
   6.87%                        1900                       build_ssector
   5.89%                        1631                       R_SpriteColumnShader_Masked2
   4.94%                        1367                       R_AddLine_loop
   4.58%   4.58%  25.43%        1268      1268      7037   R_FlushDeferredSurfaces
   3.17%   3.17%  32.97%         878       878      9122   R_SubSectorTryFlush
...
DSP side:

Code: Select all

Used cycles:
  33.26%                     2539446                       command_base
  25.87%  27.67%  27.67%     1975150   2112702   2112702   R_DoColumnPerspCorrect
   9.52%                      726436                       R_VPRenderSky
   6.39%                      487776                       R_VPLoadTexture
   5.11%   5.11%   5.11%      390420    390420    390420   VPRenderSpanDT
   2.68%   2.68%   2.68%      204410    204410    204410   extract_subvisplane
   2.62%                      199692                       R_DoColumnTextureUV
   2.47%                      188906                       R_ViewTestAddLine
   1.64%                      125394                       project_node
   1.49%   1.83%   6.89%      113984    139714    526364   AddLowerWall
   1.03%                       78942                       R_CheckBBoxPair
   0.94%   1.14%  22.31%       71788     86794   1703498   AddMidWall
   0.92%   1.17%   3.56%       70128     89114    271488   AddUpperWall
   0.62%   0.62%   0.62%       47156     47156     47156   R_BufferSurface
   0.54%   0.78%   2.32%       40966     59924    177176   AddTransWall
...
Visits/calls:
  11.02%                 886             R_DoColumnTextureUV
   8.07%                 649             lowerwall_skip
   6.96%   6.96%         560       560   R_DoColumnPerspCorrect
   5.92%                 476             upperwall_skip
   5.87%                 472             command_base
   5.14%                 413             trans_skip
   4.20%                 338             midwall_skip
   3.72%   3.72%         299       299   R_DoColumnConstantClone
   3.51%                 282             R_SetSSectorLuma
   2.74%                 220             oct_5
   2.74%                 220             R_CheckBBox_end
   2.74%   4.10%         220       330   R_CheckBBox
   2.65%                 213             project_node
   2.39%   2.39%         192       192   VPRenderSpanDT
   2.39%                 192             VPRenderPlane_continue
   2.14%                 172             R_CheckBBox_pass
   1.74%   1.74%         140       140   divs_x1_a
   1.69%   1.69%         136       136   new_luminance
At least with TIMERBASE_CONTROL=1, GCC 2.x and Hatari WinUAE CPU core timings, thinking can take more instructions in Doom II timedemo than rendering.

As getting worst frame timings for Doom II timedemo is now trivial, you might want to do it also for GCC 4.x build... :-)
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote:Much better now!
Good :)
Eero Tamminen wrote: 16.76% 17.00% 18.91% 76562 77616 86381 _BM_P_CrossBSPNode
13.69% 13.71% 14.26% 62538 62610 65144 _R_PointInSubsector
7.63% 7.65% 70.16% 34848 34950 320403 _P_RunThinkers
5.05% 5.07% 30.15% 23049 23141 137669 _P_CheckPosition
4.47% 4.48% 4.48% 20423 20461 20461 _PIT_CheckThing
4.36% 4.38% 8.86% 19931 19983 40444 _P_BlockThingsIterator
4.18% 4.20% 12.50% 19080 19167 57063 _P_PathTraverse
That bunch is quite annoying. Especially the first one which is *raytracing* all over the map, and gives me a headache. It's not really worth trying to optimize it much more as it already half on the DSP and half in the 030 instruction cache. There is one last bit that might be DSP'd but I'm pretty sure the biggest problem with it is just being over-used from too many random places.

The best fix is the one that's already implemented through TIMEBASE_CONTROL >= 3 (which you obviously can't profile since it breaks demo replay), in effect managing the AI's use of it to a sensible level. No code optimization is going to beat that.

Some of the others I still have to look at but as you can see the distribution is very flat, around 4-5% per item - so once again they are best squashed through management at level above. I'll rewrite P_PointInSubsector and inspect the others for opportunities.

Eero Tamminen wrote: Worst rendering, CPU side:
Executed instructions:
25.22% 116472 R_SpriteColumnShader_Masked2
12.47% 57580 R_AdvanceSurface_NMip0
10.47% 48341 R_VisPlaneSkyShader
7.07% 32651 R_AdvanceSurface_TMip3
6.57% 6.57% 6.57% 30348 30368 30368 _BM_A_Mux3x2
3.50% 3.51% 3.51% 16180 16211 16211 stream_texture
3.50% 16164 R_VisPlaneShader
3.34% 15421 R_AdvanceSurface_TMip2
2.90% 13416 R_AdvanceSurface_TMip0
2.88% 13320 R_AdvanceSurface_NMip3
2.30% 2.33% 2.33% 10628 10762 10762 stack_visplane_area
Well that list is a bit easier to handle - because the only one I can really optimize more is the one at the bottom, and the sky shader :) the rest are all near limits already for the way they currently work.

Unfortunately several of the more recent optimizations to those areas are data-cache oriented (mipmaps + packed palettes -> locality of reference), so you won't see that in your profiles until Hatari supports data cache! This is especially true for sprites near the player - the bigger they get the more the data cache speeds up drawing. So Hatari is exaggerating the cost for these in the 'worst' frame profile. Not much we can do about that one.

Eero Tamminen wrote: ...
Instruction cache misses:
15.67% 15.68% 16.38% 4337 4340 4533 R_AddSpriteSpans
7.31% 7.35% 7.35% 2023 2034 2034 R_ViewTestSpriteLines
Those still need attention but not before I fix the bugs.

Eero Tamminen wrote: At least with TIMERBASE_CONTROL=1, GCC 2.x and Hatari WinUAE CPU core timings, thinking can take more instructions in Doom II timedemo than rendering.

As getting worst frame timings for Doom II timedemo is now trivial, you might want to do it also for GCC 4.x build... :-)
Yes I'll do a test with gcc4 to see what's different.

Thanks for the results
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Uh, there is one other thing I'll need to do with the file caching stuff - there may be collisions between WAD files using the same name for different resources, so I'll have to add another level of directory specialization e.g.

BMC/DOOM2/etc..

I'll report any differences in 'worst frame' results from Doom2 next.

BTW Eero, I assume the profiling script is *not* using the --timedemo arg? It looks too slow for that - looks like a full 35Hz tick to me (PC mode). That is the best way to profile it - but it would help explain the 2-3fps and the reason rendering isn't always dominating...

The --timedemo arg skips the title screen and replays at a ratio of 1 tick : 1 render, so in effect something like a 4-5Hz tick (8x less thinking work!)
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Results from Doom II, using gcc4 (thinking part). Rendering part not so interesting to compare between compilers, mostly assembly language now except for some leftover patch drawing in the game code.

This one:

17.41% 17.65% 20.68% 77320 78385 91854 _BM_P_CrossBSPNode

...is one of the few 'thinking' components which is also assembly language. The fact the % cost has apparently risen, given that it's a fixed cost from the compiler point of view, means all the rest must have dropped a bit. Although not much. Most of what has changed seems to be reordering of things and more cache misses in more areas. An imperfect trade.

Code: Select all

CPU profile information from 'badmood-worst-think-CPU.txt':
- Hatari v1.7.0, WinUAE CPU core

Time spent in profile = 0.27448s.

Calls:
- max = 739, in _PIT_CheckThing at 0x3105a, on line 2852
- 5638 in total
Executed instructions:
- max = 4248, in _BM_P_CrossBSPNode+76 at 0x53252, on line 7441
- 444136 in total
Used cycles:
- max = 46716, in _BM_P_CrossBSPNode+54 at 0x5323c, on line 7431
- 4403332 in total
Instruction cache misses:
- max = 2217, in _PIT_CheckThing at 0x3105a, on line 2853
- 104932 in total

Calls:
  13.11%  13.11%         739       739   _PIT_CheckThing
   8.73%  21.89%         492      1234   _P_BlockThingsIterator
   4.77%   7.95%         269       448   _PIT_CheckLine
   4.24%   5.89%         239       332   _PIT_AddLineIntercepts_L
   4.04%  17.88%         228      1008   _P_BlockLinesIterator
   3.51%   3.51%         198       198   _R_PointInSubsector
   2.98%   3.03%         168       171   _P_PointOnDivlineSide.constprop.
   2.87%   2.91%         162       164   _memmove
   2.87%                 162             _bcopy
   2.78%  37.09%         157      2091   _P_CheckPosition
   2.78%  41.88%         157      2361   _P_TryMove
   2.39%  36.80%         135      2075   _P_Move
   2.36%   7.34%         133       414   _P_BlockThingsIterator.constprop
   2.20%  50.11%         124      2825   _P_SetMobjState
   2.13%  33.98%         120      1916   _P_TryWalk
   2.08%   2.08%         117       117   _P_InterceptVector.constprop.7

Executed instructions:
  17.41%  17.65%  20.68%       77320     78385     91854   _BM_P_CrossBSPNode
  14.65%  14.69%  14.69%       65083     65260     65260   _R_PointInSubsector
   7.14%   7.16%  69.26%       31710     31815    307589   _P_RunThinkers
   5.33%   5.33%   5.33%       23654     23674     23674   _BM_A_Mux3x2
   4.45%   4.47%  11.20%       19784     19864     49734   _P_BlockLinesIterator
   4.34%   4.35%  28.94%       19258     19305    128535   _P_CheckPosition
   3.98%   4.00%   4.00%       17666     17744     17744   _PIT_CheckThing
   3.78%   3.79%   3.99%       16803     16821     17724   _V_DrawPatch
   3.58%   3.59%   8.41%       15878     15962     37354   _P_BlockThingsIterator
   3.33%   3.35%  15.56%       14797     14875     69095   _P_PathTraverse
   2.70%   2.71%   4.49%       11999     12055     19942   _PIT_AddLineIntercepts_L
   2.23%                        9926                       _bcopy

Used cycles:
  18.07%  18.59%  21.34%      795484    818716    939544   _BM_P_CrossBSPNode
   9.91%   9.99%   9.99%      436444    439680    439680   _R_PointInSubsector
   6.01%   6.06%  66.55%      264632    266672   2930212   _P_RunThinkers
   4.76%   4.77%   4.77%      209616    209920    209920   _BM_A_Mux3x2
   4.64%   4.66%  25.89%      204360    205112   1139944   _P_CheckPosition
   4.00%   4.04%   8.77%      176160    177828    386272   _P_BlockThingsIterator
   3.96%   3.99%   3.99%      174488    175840    175840   _PIT_CheckThing
   3.87%                      170336                       _bcopy
   3.78%   3.81%  11.92%      166456    167788    524944   _P_BlockLinesIterator
   3.25%   3.28%   5.44%      143108    144268    239360   _PIT_AddLineIntercepts_L
   2.77%   2.78%   3.06%      121892    122276    134780   _V_DrawPatch
   2.42%   2.43%   2.43%      106428    106876    106876   _P_UpdateSpecials
   2.37%   2.41%  16.15%      104548    105904    710968   _P_PathTraverse
   2.07%   2.07%  23.41%       91224     91224   1030768   _BM_P_CheckSight


Instruction cache misses:
   8.18%   8.21%  34.76%        8588      8613     36475   _P_CheckPosition
   7.52%   7.56%  18.03%        7888      7933     18915   _P_BlockLinesIterator
   7.32%   7.35%   7.35%        7678      7714      7714   _PIT_CheckThing
   6.95%   7.53%   8.05%        7290      7902      8451   _BM_P_CrossBSPNode
   6.32%   6.36%  13.86%        6629      6676     14539   _P_BlockThingsIterator
   4.73%   4.81%   4.81%        4968      5047      5047   _R_PointInSubsector
   4.02%   4.02%  12.07%        4215      4215     12666   _BM_P_CheckSight
   3.48%   3.51%  20.75%        3648      3684     21776   _P_PathTraverse
   2.94%   2.94%   2.94%        3087      3087      3087   _P_InterceptVector.constprop.7
   2.69%   2.72%   5.12%        2826      2852      5368   _PIT_AddLineIntercepts_L
   2.61%   2.62%   5.35%        2738      2747      5614   _PIT_CheckLine
   2.13%   2.14%  33.68%        2234      2243     35341   _P_Move
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

I got temporarily confused when the performance of some functions appeared to vary wildly with tiny optimizations or other changes, before I remembered that 'worst frame' profiling can pick a completely different frame to report if anything at all is changed in the code.

>doh<
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Pushed more optimizations for P_CheckSight related, and fast paths for small sprites which use pixel testing and quick column setup instead of postmap rendering, which is better for big sprites with big holes in them.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Sprite clipping is now fixed.

The improved seg/sprite classify rules I had figured out on paper seem to work in all the Doom II map11 cases I noticed before. Haven't checked other maps but I expect most or all are now gone.

Hopefully that's the last of the floating corpses etc.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:I got temporarily confused when the performance of some functions appeared to vary wildly with tiny optimizations or other changes, before I remembered that 'worst frame' profiling can pick a completely different frame to report if anything at all is changed in the code.
It's better to start checking by looking at the time taken by the whole worst frame, maybe also time taken by all frames in the timedemo (i.e. running profile.sh without -w option and checking all frames profile). And only then looking at timings for individual functions. If total time has decreased, things should have improved...

There are two issues with this measure:
- Hatari cycle accuracy for Falcon emulation
- worst frame profiling using instruction count as measure for worst frame, not cycles
(for DSP I could use cycles, those are easy to count and accurate, for CPU cycles it would be annoying amount of ifdef code)

Because interrupts can trigger at different times, that can also change things around a bit, especially in worst frame, so running profile couple of times could also be a good idea. For the thinking side and node traversing, best is probably looking at the callgraphs to see how the costs accumulate. Are the callgraphs now readable enough?

PS. I should have time on the weekend to implement the spinloop profiling code. That will hopefully help if/when you move some functionality to DSP side. :-)
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote: It's better to start checking by looking at the time taken by the whole worst frame, maybe also time taken by all frames in the timedemo (i.e. running profile.sh without -w option and checking all frames profile). And only then looking at timings for individual functions. If total time has decreased, things should have improved...
This is what I started doing when I realized it wasn't a good idea to do comparative profiling with the worst frame data. I actually forgot I was looking at worst frame data - I got used to reading the summed info in the same format :)
Eero Tamminen wrote: Are the callgraphs now readable enough?
Yes I think the callgraphs are fine. All the profiling stuff works very well now.
Eero Tamminen wrote: PS. I should have time on the weekend to implement the spinloop profiling code. That will hopefully help if/when you move some functionality to DSP side. :-)
Great :)

I think the DSP is dangerously short of RAM for moving much more stuff on there. But there is plenty of code already there which could use better optimization of this kind once the spin sites are 'documented' in context, in some helpful way.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Animated textures are now working. The type that cycles through texture IDs for certain walls every fraction of a second.

There aren't many of these (took me some wandering around to find one) but it means there's a mechanism now for animating the switches when they are activated, which is more important. The switches don't work yet but it should be trivial now.

The same can be used for animated floors but this doesn't seem to be used for much more than making water/lava look like it's moving (with only 3 frames!) so this isn't really necessary once the water shader stuff is attached.

Return to “680x0”