Bad Mood : Falcon030 'Doom'

All 680x0 related coding posts in this section please.

Moderators: Zorro 2, Moderator Team

User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:I recommend filtering out symbols like R_SwitchSurface[*] and MARKER as these are hiding or breaking up real functions in the profile view. I still have some things to tidy up.
As you know what the bmengine symbols are for, I think it's better if you add that filtering. :-)
MARKER seems to be most of R_DrawTSurface_Masked2().

Here's another profile from Doom1, from longer gameplay and another level (and of course C-code built with gcc v2.x):

Code: Select all

Time spent in profile = 0.49573s.

Visits/calls:
- max = 592, in R_BSPHyperPlane_RHS at 0x4ad82, on line 6692
- 6823 in total
Executed instructions:
- max = 20008, in D_GeneratePostBuffer+78 at 0x496d2, on line 5909
- 1099371 in total
Used cycles:
- max = 161216, in D_GeneratePostBuffer+110 at 0x496f2, on line 5923
- 7952784 in total
Instruction cache misses:
- max = 1080, in _P_RunThinkers+376 at 0x35d42, on line 3340
- 68909 in total
...
Executed instructions:
  32.33%  32.36%  35.22%      355372    355800    387240   D_GeneratePostBuffer
  11.51%  11.52%  11.92%      126483    126625    131015   D_PatchRender_8_8
   8.08%                       88782                       R_SwitchSurface_T8
   7.95%   7.95%   8.04%       87349     87369     88376   _subframe_block
   5.24%   5.25%  10.00%       57643     57743    109918   _P_RunThinkers
   3.69%                       40515                       R_VisPlaneShader
   2.72%                       29918                       MARKER
   2.25%                       24702                       R_DrawSurface_NW
   2.21%                       24249                       R_SwitchSurface_T4
   2.10%   2.10%   2.50%       23039     23099     27469   turbo_memclr
   1.91%   1.91%   1.91%       20948     20986     20986   _R_DrawColumn
   1.76%                       19377                       R_SwitchSurface_T2
   1.62%   1.63%   2.35%       17858     17934     25835   _V_DrawPatch
   1.18%   1.18%   1.18%       12944     12994     12994   stream_texture
   1.04%                       11465                       R_VisPlaneSkyShader
   0.95%   0.95%   0.95%       10407     10466     10466   stack_visplane_area
   0.85%   0.85%   2.76%        9354      9361     30347   _R_DrawMaskedColumn
   0.75%   0.75%   0.75%        8241      8248      8248   _R_PointInSubsector
   0.56%                        6102                       R_StackTransparentSurface
   0.55%   0.56%   4.08%        6089      6151     44806   _P_SetMobjState
   0.51%   0.51%   1.13%        5620      5651     12425   _P_UpdateSpecials
D_GeneratePostBuffer() and D_PatchRender_8_8() are new ones.

DSP side:

Code: Select all

Used cycles:
  67.72%                    10770918                       command_base
  13.74%  14.54%  14.54%     2184912   2312488   2312488   R_DoColumnPerspCorrect
   6.10%   6.10%   6.10%      969762    969762    969762   VPRenderSpanDT
   2.46%                      391488                       R_VPLoadTexture
   1.27%   1.27%   1.27%      201962    201962    201962   extract_subvisplane
   1.04%                      165012                       R_ViewTestAddLine
   1.02%                      162458                       R_DoColumnTextureUV
   0.93%                      148550                       R_VPRenderSky
   0.57%                       89928                       ALGO_P_CrossBSPNode
   0.53%                       84776                       R_VPRenderPlane
Btw. Would you like separately worst render and thinking parts (i.e. them not necessarily being from same frame)?

I need separate Hatari counters for that, but I could e.g. use DSP instruction counter for determining worst rendering part and CPU instruction counter for determining worst thinking part.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote: As you know what the bmengine symbols are for, I think it's better if you add that filtering. :-)
MARKER seems to be most of R_DrawTSurface_Masked2().
Ok, but I'm still working on live changes so a fix won't appear in HG immediately. I'll do something about them for the next checkin.
Eero Tamminen wrote: D_GeneratePostBuffer() and D_PatchRender_8_8() are new ones.
These are 'offline' operations used to prepare textures after loading. It must have pulled in new textures while playing, possibly for sprites. The wall textures should be fully precached, I'm less sure sprites are all precached properly. Seems like Doom provides a 'magic list' and it's probably not perfect.
Eero Tamminen wrote: DSP side:

Code: Select all

Used cycles:
  67.72%                    10770918                       command_base
  13.74%  14.54%  14.54%     2184912   2312488   2312488   R_DoColumnPerspCorrect
   6.10%   6.10%   6.10%      969762    969762    969762   VPRenderSpanDT
   2.46%                      391488                       R_VPLoadTexture
   1.27%   1.27%   1.27%      201962    201962    201962   extract_subvisplane
   1.04%                      165012                       R_ViewTestAddLine
   1.02%                      162458                       R_DoColumnTextureUV
   0.93%                      148550                       R_VPRenderSky
   0.57%                       89928                       ALGO_P_CrossBSPNode
   0.53%                       84776                       R_VPRenderPlane
Hmm, I don't see the CrossBSPNode hotspot in there now. Coincidence? Or did I just fix a horrible performance bug too :)
Eero Tamminen wrote: Btw. Would you like separately worst render and thinking parts (i.e. them not necessarily being from same frame)?
That would be cool.
Eero Tamminen wrote: I need separate Hatari counters for that, but I could e.g. use DSP instruction counter for determining worst rendering part and CPU instruction counter for determining worst thinking part.
Tracking worst case on either DSP or CPU or both is a good thing generally (but I'll let you be the judge of what's good for Hatari).
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

The call subgraph represented by P_BlockLinesIterator and PIT_AddThingsIterator is using a delegation pattern in order to reuse some code. While this is a tidy arrangement, it's breaking up often-used functions and caching poorly.

Since the amount of reuse is very limited it should just require specialization of two operations and some inlined DSP co-pro ops to fix the poor cache hit rate and clean up this part of the profile graph.

Taken together, this stuff seems to register quite high in most profile runs so fixing it is probably a worthwhile task.

BTW Eero, is there a nice way to mark symbols as interrupt/exception types which can't be treated as a call or branch target? i.e. no lines linking those nodes to background code - but they do remain in the graph along with their own subgraphs? There does appear to be more than the simple 'ignore' switch but I'm not sure of the details yet

[EDIT]

It looks like the '--ignore-to' option according to the docs and this is used in the profiling script - but does that exclude return links from the interrupt to background code as well? And does it completely ignore the interrupt or just place it in an isolated subgraph?
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

I made a few changes which clean up profiling results, particularly CPU and DSP callgraph.

Some of the misleading stuff (like the funny P_CrossSubsector mid-function entrypoint with 95%+ visits/calls) has been fixed using bounding labels, local labels and equates to specify funny entrypoints. Things look a bit more sensible and familiar to me now.

(from 'badmood-1-think-DSP')

Code: Select all

Calls:
  29.72%  29.72%         496       496   TestLineSegVectorBisection
  21.45%                 358             P_CrossSubsector_body
  19.83%  39.25%         331       655   InterceptVectors
  19.41%  19.41%         324       324   Divs48_Real
   4.79%                  80             command_base
   4.79%                  80             ALGO_P_CrossBSPNode

Executed instructions:
  87.56%                     1585941                       command_base
   6.76%                      122509                       ALGO_P_CrossBSPNode
   3.32%                       60114                       P_CrossSubsector_body
   1.41%   1.41%   1.41%       25513     25513     25513   Divs48_Real
   0.60%   0.60%   0.60%       10912     10912     10912   TestLineSegVectorBisection

Used cycles:
  90.95%                     9513726                       command_base
   5.07%                      530672                       ALGO_P_CrossBSPNode
   3.02%                      315482                       P_CrossSubsector_body
   0.58%   0.58%   0.58%       60872     60872     60872   Divs48_Real
Hatari profile data processor

I can also see a notable cache miss rate (2x normal surfaces) in the new sprite routine, which I was pretty sure I had measured under 256 bytes for the whole surface operation. It may just be that it's closer to filling the cache than normal wall code and visited many times, with the iterator emptying the cache each time. Or there could be a code alignment problem, or I measured it wrong. I'll look into this later.

I haven't tried the worst-frame profiling system yet but so far the sprite code doesn't seem to as dominant (instructions, cycles) as before which is good.

Code: Select all

Instruction cache misses:
  19.04%  19.15%  20.33%        1917      1928      2047   R_AddSpriteSpans
  16.64%                        1675                       R_SpriteColumnShader_Masked2
  11.88%  11.88%  11.88%        1196      1196      1196   R_ViewTestSpriteLines
   8.00%                         805                       R_DrawSurface_NW
   4.67%   4.67%   4.67%         470       470       470   _audio_mux_frame
   4.10%                         413                       build_ssector
[EDIT]

lol... re-measured it at 258 bytes for the loop. Somehow I managed to grow it by a few bytes after checking it. doh.

I suppose its 2x good news - 1) Hatari profiler is cool, 2) there's room for even more speedy sprites...

...correction...

the 'hot' part is only 254 bytes, the extra 4 is for handling return to the iterator. The whole thing is sitting on an 8-byte boundary. I would expect that to cache safely.

So it is either iterator overhead on each visit, or it wants a 16-byte boundary for fully used cache lines. I haven't really looked at that situation before but it looks like a good time to investigate more closely.

I think the average encounter rate divided by the average miss rate will tell me if it's visits or some problem with alignment.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

This is interesting. Hatari profiler shows the cache miss rate at the very head of the sprite column loop is of the same order-of-magnitude as the number of iterations of the loop.

The sprite routine gets called 1009 times, with 13777 column operations, so approx 13 columns per sprite. The cache miss rate though is 27554 at the column loop head. The cache fetches multiple words on a miss so that's to be expected (subsequent ops show miss rate of 0, then another nonzero figure soon after, but more sensible).

I can't really explain that, unless 8-byte alignment is no good for a 254-byte loop. That's news to me, but possible I suppose. Cache lines are large and there are 2 levels of tagging/identity involved. Something to check between Hatari and real hardware perhaps?

I'll try to shuffle it onto a 16-byte boundary to see what happens.

$04fdde : ext.l d5 0.00% (1009, 8184, 1009)
$04fde0 : ext.l d4 0.00% (1009, 4036, 0)
$04fde2 : lsl.l #8,d4 0.00% (1009, 8128, 1009)
$04fde4 : bra.s $4fe28 0.00% (1009, 8128, 0)
[...]
$04fde8 : swap d7 0.02% (13777, 110272, 27554)
$04fdea : movea.l 8(sp),a2 0.02% (13777, 165324, 0)
$04fdee : lea $fff2(a3),a3 0.02% (13777, 55108, 0)
$04fdf2 : movea.l a3,a1 0.02% (13777, 55220, 968)



[EDIT]

So I moved the whole thing onto a 16 byte boundary, and the miss rate actually got worse. I can't really explain that, so I'm going to ignore it until I can compare results properly on a real machine.

$04fdde : ext.l d5 0.00% (1199, 9648, 1200)
$04fde0 : ext.l d4 0.00% (1199, 4796, 0)
$04fde2 : lsl.l #8,d4 0.00% (1199, 9592, 1199)
$04fde4 : bra.s $4fe30 0.00% (1199, 9592, 0)
[...]
$04fdf0 : swap d7 0.04% (17152, 137448, 34304)
$04fdf2 : movea.l 8(sp),a2 0.04% (17152, 206104, 0)
$04fdf6 : lea $fff2(a3),a3 0.04% (17152, 68664, 0)
$04fdfa : movea.l a3,a1 0.04% (17152, 68940, 1145)

[EDIT]

Actually I think there is maybe a sensible explanation. The prefetch mechanism is probably looking beyond the condition at the end of the loop and pulling in at least one more op, which happens to be a big one. This is indicated by the zero miss rate for the code beyond the loop. That probably means Hatari is correct and the loop plus potential prefetch needs to be squeezed into 256 bytes. I haven't run into this before but OTOH it's a strange situation and not one to get much feedback on, except measuring results or looking in the profiler.

An interesting test is to stick a nop beyond the loop to see if it influences prefetch at all (or failing that, just shrink the code an op or two).

$04fee8 : swap d7 0.04% (18351, 73428, 12)
$04feea : dbra d7,$4fdf0 0.04% (18351, 151640, 0)

$04feee : lea $c(sp),sp 0.00% (1199, 4796, 0)
$04fef2 : bra $4fc3c 0.00% (1199, 9592, 0)


[EDIT]

...well I must have been half-right, because applying the nop does seem to prevent prefetching beyond the loop. The lea/bra now shows cache misses. Unfortunately, the loop head looks exactly the same as it was - lots of misses. The whole operation is now 256 bytes, sitting on a 16-byte boundary, and still doesn't cache properly. :-z

$04fee8 : swap d7 0.05% (18617, 74488, 10)
$04feea : dbra d7,$4fdf0 0.05% (18617, 153972, 0)
$04feee : nop 0.00% (1179, 4716, 1179)

$04fef0 : lea $c(sp),sp 0.00% (1179, 9432, 1179)
$04fef4 : bra $4fc3c 0.00% (1179, 14148, 1179)

'nop' acts strangely on 68040/60 and flushes the pipeline. It also acts a little strange on 030 (doesn't schedule/parallel like other ops) and might affect prefetch in some way. I really doubt that but I'll try with another 'normal' op just in case.

...no change. At this point I'm open to ideas on what's going on here :) maybe Hatari CPU is prefetching deeper than is shown via cache miss info?

I'll just look at shrinking the code further but an explanation for this one would be good.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

I had to shrink the loop to 250 bytes *AND* put it on a cache line boundary to stop cache misses at the loop head. 250 bytes and 4-byte alignment wasn't enough.

I can understand the need for line-alignment but I can't really explain the 254 (+2 prefetch) not working on the same alignment. It looks like an uncommon case so I'll move on.
User avatar
Cyprian
10 GOTO 10
10 GOTO 10
Posts: 3362
Joined: Fri Oct 04, 2002 11:23 am
Location: Warsaw, Poland

Re: Bad Mood : Falcon030 'Doom'

Post by Cyprian »

really very interesting finding Dml.

I was wondering about that yesterday :)

as far as I know, 030 always prefetch long word, therefore whether cache loop should not fit in 252 (+4 - one long word for prefetch ) bytes instead of 254 (+2 prefetch) mentioned by you?
ATW800/2 / V4sa / Lynx I / Mega ST 1 / 7800 / Portfolio / Lynx II / Jaguar / TT030 / Mega STe / 800 XL / 1040 STe / Falcon030 / 65 XE / 520 STm / SM124 / SC1435
DDD HDD / AT Speed C16 / TF536 / SDrive / PAK68/3 / Lynx Multi Card / LDW Super 2000 / XCA12 / SkunkBoard / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
http://260ste.atari.org
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Cyprian wrote:really very interesting finding Dml.

I was wondering about that yesterday :)

as far as I know, 030 always prefetch long word, therefore whether cache loop should not fit in 252 (+4 - one long word for prefetch ) bytes instead of 254 (+2 prefetch) mentioned by you?
I just checked with a 252 byte loop (with 2x nops to pad beyond the loop) and I'm seeing the same thing at the loop head. But with the 250 byte (actually 248 for this routine) case it looks ok. It is very strange.

I looked quickly at Hatari CPU sources in this area but didn't see anything obvious. I may have to look closer.

(your suggestion was a sensible one, even if something else is also happening on top of that)
User avatar
Cyprian
10 GOTO 10
10 GOTO 10
Posts: 3362
Joined: Fri Oct 04, 2002 11:23 am
Location: Warsaw, Poland

Re: Bad Mood : Falcon030 'Doom'

Post by Cyprian »

dml wrote:I just checked with a 252 byte loop (with 2x nops to pad beyond the loop) and I'm seeing the same thing at the loop head. But with the 250 byte (actually 248 for this routine) case it looks ok. It is very strange.
from the other side, this issue can be Hatari specific, would be cool to check that on real hardware.
ATW800/2 / V4sa / Lynx I / Mega ST 1 / 7800 / Portfolio / Lynx II / Jaguar / TT030 / Mega STe / 800 XL / 1040 STe / Falcon030 / 65 XE / 520 STm / SM124 / SC1435
DDD HDD / AT Speed C16 / TF536 / SDrive / PAK68/3 / Lynx Multi Card / LDW Super 2000 / XCA12 / SkunkBoard / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
http://260ste.atari.org
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Cyprian wrote: from the other side, this issue can be Hatari specific, would be cool to check that on real hardware.
There is probably a long prefetch involved as you say, even on real HW so 252 bytes is probably the safe limit for a loop. I don't know if prefetching half-an-opcode complicates this and forces another prefetch, but even that doesn't explain my test result.

Yeah so far there's nothing to say the 'weird' part of it is real but I'll try to measure it on real HW next time I test.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Perhaps interesting...

uae_u32 get_word_ce030_prefetch (int o)

This is a function in WinUAE core responsible for prefetching the 030 i-cache, and it seems to have been completely rewritten in a recent release. There's a similar operation for 020 which has also had the rewrite treatment.

The other code in there looks largely the same (maybe edited but not radically changed). I don't know if this indicates there is a problem with the version we're using in Hatari but the change history might be worth a closer look sometime.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:It looks like the '--ignore-to' option according to the docs and this is used in the profiling script - but does that exclude return links from the interrupt to background code as well? And does it completely ignore the interrupt or just place it in an isolated subgraph?
Unlike interrupt calls, interrupt returns should be happening so that they can be reliably identified (from RTE instruction). I.e. post-processor should be able to always identify them and therefore they're by default ignored in callgraphs. Unrecognized call instructions are also ignored by post-processor by default as those should happen only due to interrupts.

However, post-processor cannot ignore all the interrupt calls because some of them can happen on BSR & JSR instructions, those you need to tell separately to the post-processor with --ignore-to option. I use that option often also for other (performance-wise) irrelevant functions that get called from all over the program.

Btw. You can actually specify what kind of "visits" are ignored in call counts and callgraphs:

Code: Select all

$ hatari_profile.py --no-calls h
Hatari profile data processor
Valid call types are:
  b -- branch/jump
  e -- exception
  n -- next instruction
  s -- subroutine call
  r -- subroutine return
  u -- unknown
  x -- exception return
ERROR: invalid call type!
By default post-processor ignores "r", "u" and "x" type of instruction address visits, but if you're interested to find in which contexts your interrupt handlers are called, you can change the set.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:These are 'offline' operations used to prepare textures after loading. It must have pulled in new textures while playing, possibly for sprites. The wall textures should be fully precached, I'm less sure sprites are all precached properly. Seems like Doom provides a 'magic list' and it's probably not perfect.
There were many frames where more data was read:

Code: Select all

GEMDOS 0x42 Fseek(7075852, 65, 0)
GEMDOS 0x3F Fread(65, 256, 0x183d6c)
GEMDOS 0x42 Fseek(7052640, 64, 0)
GEMDOS 0x3F Fread(64, 3768, 0x35300c)
0.23375s
...
GEMDOS 0x42 Fseek(7075708, 65, 0)
GEMDOS 0x3F Fread(65, 144, 0x183d6c)
GEMDOS 0x42 Fseek(7068580, 64, 0)
GEMDOS 0x3F Fread(64, 1408, 0x354044)
= 0.24971s
...
GEMDOS 0x42 Fseek(7240924, 64, 0)
GEMDOS 0x3F Fread(64, 11544, 0x3545dc)
= 0.22196s
...
GEMDOS 0x42 Fseek(7252468, 64, 0)
GEMDOS 0x3F Fread(64, 11544, 0x35730c)
GEMDOS 0x42 Fseek(7275080, 64, 0)
GEMDOS 0x3F Fread(64, 2608, 0x35a03c)
= 0.28602s
Reading debugger commands from 'profile-worst-save.ini'...
GEMDOS 0x42 Fseek(7282756, 65, 0)
GEMDOS 0x3F Fread(65, 1976, 0x183d6c)
GEMDOS 0x42 Fseek(7277688, 64, 0)
GEMDOS 0x3F Fread(64, 5068, 0x35ab38)
= 0.34877s
Reading debugger commands from 'profile-worst-save.ini'...
...
GEMDOS 0x42 Fseek(7284732, 65, 0)
GEMDOS 0x3F Fread(65, 1980, 0x183d6c)
= 0.26762s
...
GEMDOS 0x42 Fseek(7289732, 65, 0)
GEMDOS 0x3F Fread(65, 4988, 0x183d6c)
= 0.45662s
Reading debugger commands from 'profile-worst-save.ini'...
GEMDOS 0x42 Fseek(7309812, 65, 0)
GEMDOS 0x3F Fread(65, 3020, 0x183d6c)
GEMDOS 0x42 Fseek(9109784, 65, 0)
GEMDOS 0x3F Fread(65, 944, 0x183d6c)
GEMDOS 0x42 Fseek(7294720, 65, 0)
GEMDOS 0x3F Fread(65, 9276, 0x183d6c)
= 0.49573s
Reading debugger commands from 'profile-worst-save.ini'...
GEMDOS 0x42 Fseek(7314532, 65, 0)
GEMDOS 0x3F Fread(65, 1128, 0x183d6c)
GEMDOS 0x42 Fseek(7303996, 65, 0)
GEMDOS 0x3F Fread(65, 3104, 0x183d6c)
= 0.33565s
GEMDOS 0x42 Fseek(7315660, 65, 0)
GEMDOS 0x3F Fread(65, 188, 0x183d6c)
GEMDOS 0x42 Fseek(7146696, 64, 0)
GEMDOS 0x3F Fread(64, 5976, 0x35c354)
= 0.24713s
GEMDOS 0x42 Fseek(9110728, 65, 0)
GEMDOS 0x3F Fread(65, 924, 0x183d6c)
= 0.26252s
...
GEMDOS 0x42 Fseek(9111652, 65, 0)
GEMDOS 0x3F Fread(65, 1584, 0x183d6c)
= 0.25575s
...
GEMDOS 0x42 Fseek(9113236, 65, 0)
GEMDOS 0x3F Fread(65, 2696, 0x183d6c)
= 0.26066s
GEMDOS 0x42 Fseek(6999416, 64, 0)
GEMDOS 0x3F Fread(64, 8180, 0x35db78)
= 0.23008s
...
GEMDOS 0x42 Fseek(9115932, 65, 0)
GEMDOS 0x3F Fread(65, 3420, 0x183d6c)
= 0.27928s
...
GEMDOS 0x42 Fseek(7007596, 64, 0)
GEMDOS 0x3F Fread(64, 8204, 0x35fcec)
GEMDOS 0x42 Fseek(7018484, 64, 0)
GEMDOS 0x3F Fread(64, 2308, 0x361d10)
= 0.34292s
GEMDOS 0x42 Fseek(7076108, 65, 0)
GEMDOS 0x3F Fread(65, 376, 0x183d6c)
= 0.36076s
GEMDOS 0x42 Fseek(7076560, 65, 0)
GEMDOS 0x3F Fread(65, 132, 0x183d6c)
= 0.38739s
= 0.44981s
GEMDOS 0x42 Fseek(7015800, 64, 0)
GEMDOS 0x3F Fread(64, 2684, 0x3632d4)
= 0.44761s
...                   <= large number of frames
GEMDOS 0x42 Fseek(7076484, 65, 0)
GEMDOS 0x3F Fread(65, 76, 0x183d6c)
= 0.31145s
GEMDOS 0x42 Fseek(7076692, 65, 0)
GEMDOS 0x3F Fread(65, 216, 0x183d6c)
= 0.24392s
...
GEMDOS 0x42 Fseek(7070308, 65, 0)
GEMDOS 0x3F Fread(65, 320, 0x183d6c)
= 0.23899s
GEMDOS 0x42 Fseek(7069988, 65, 0)
GEMDOS 0x3F Fread(65, 320, 0x183d6c)
= 0.23934s
...
GEMDOS 0x42 Fseek(7175884, 64, 0)
GEMDOS 0x3F Fread(64, 3400, 0x3679e0)
= 0.19256s
...
GEMDOS 0x42 Fseek(7210336, 64, 0)
GEMDOS 0x3F Fread(64, 1668, 0x368c2c)
= 0.20303s
GEMDOS 0x42 Fseek(7179284, 64, 0)
GEMDOS 0x3F Fread(64, 8860, 0x3692c8)
= 0.22382s
GEMDOS 0x42 Fseek(7188144, 64, 0)
GEMDOS 0x3F Fread(64, 10368, 0x36b57c)
= 0.23405s
...
GEMDOS 0x42 Fseek(7152672, 64, 0)
GEMDOS 0x3F Fread(64, 6588, 0x36de14)
GEMDOS 0x42 Fseek(7164092, 64, 0)
GEMDOS 0x3F Fread(64, 4792, 0x36f7e8)
= 0.25359s
GEMDOS 0x42 Fseek(7175620, 65, 0)
GEMDOS 0x3F Fread(65, 264, 0x183d6c)
= 0.21647s
...
GEMDOS 0x42 Fseek(7084036, 65, 0)
GEMDOS 0x3F Fread(65, 3788, 0x183d6c)
= 0.27910s
GEMDOS 0x42 Fseek(7093800, 65, 0)
GEMDOS 0x3F Fread(65, 4260, 0x183d6c)
= 0.32255s
...
GEMDOS 0x42 Fseek(7087824, 65, 0)
GEMDOS 0x3F Fread(65, 5976, 0x183d6c)
= 0.34090s
...
GEMDOS 0x42 Fseek(9085512, 65, 0)
GEMDOS 0x3F Fread(65, 904, 0x183d6c)
= 0.24548s
...
GEMDOS 0x42 Fseek(7168884, 64, 0)
GEMDOS 0x3F Fread(64, 4496, 0x370c20)
= 0.29094s
...
GEMDOS 0x42 Fseek(7073568, 65, 0)
GEMDOS 0x3F Fread(65, 2064, 0x183d6c)
= 0.34809s
...
GEMDOS 0x42 Fseek(7212004, 64, 0)
GEMDOS 0x3F Fread(64, 4444, 0x371f30)
= 0.22579s
...
GEMDOS 0x42 Fseek(7212004, 64, 0)
GEMDOS 0x3F Fread(64, 4444, 0x371f30)
= 0.22579s
= 0.21105s
GEMDOS 0x42 Fseek(7236320, 65, 0)
GEMDOS 0x3F Fread(65, 244, 0x183d6c)
= 0.34021s
GEMDOS 0x42 Fseek(7237448, 65, 0)
GEMDOS 0x3F Fread(65, 1344, 0x183d6c)
GEMDOS 0x42 Fseek(7238792, 65, 0)
GEMDOS 0x3F Fread(65, 1176, 0x183d6c)
GEMDOS 0x42 Fseek(7236564, 65, 0)
GEMDOS 0x3F Fread(65, 256, 0x183d6c)
= 0.40793s
GEMDOS 0x42 Fseek(7239968, 65, 0)
GEMDOS 0x3F Fread(65, 848, 0x183d6c)
GEMDOS 0x42 Fseek(7240816, 65, 0)
GEMDOS 0x3F Fread(65, 108, 0x183d6c)
GEMDOS 0x42 Fseek(7216448, 64, 0)
GEMDOS 0x3F Fread(64, 9624, 0x3736f8)
= 0.32487s
...
GEMDOS 0x42 Fseek(7071944, 65, 0)
GEMDOS 0x3F Fread(65, 1624, 0x183d6c)
= 0.56478s
...
GEMDOS 0x42 Fseek(7198512, 64, 0)
GEMDOS 0x3F Fread(64, 10668, 0x37bf18)
= 0.26844s
...
GEMDOS 0x42 Fseek(7070628, 65, 0)
GEMDOS 0x3F Fread(65, 1316, 0x183d6c)
= 0.31416s
Timings are for how long the full frame took, I would assume that they would be much slower on real Falcon for frames reading disk as with GEMDOS emulation there's not really any disk read overhead, just Doom's own processing of the read data.

(I also noticed that CPU instructions count might not be best measure of worst frame, list above has one 5.6s long frame, whereas worst frame breakpoint had caught only 4.9s long frame.)

dml wrote:Hmm, I don't see the CrossBSPNode hotspot in there now. Coincidence? Or did I just fix a horrible performance bug too :)
That profile was clearly CPU bound, so it's hard to say from that.

dml wrote:
Eero Tamminen wrote: Btw. Would you like separately worst render and thinking parts (i.e. them not necessarily being from same frame)?
That would be cool.
Eero Tamminen wrote: I need separate Hatari counters for that, but I could e.g. use DSP instruction counter for determining worst rendering part and CPU instruction counter for determining worst thinking part.
Ok, I'll look into it. It will be even slower because then profiling will re-initialized twice per frame, not just once...

dml wrote:Tracking worst case on either DSP or CPU or both is a good thing generally (but I'll let you be the judge of what's good for Hatari).
That's not really feasible. CPU and DSP cycles are tied together, so just checking how much work is done them for rendering would always give the same frame for CPU & DSP. There would need to be some way to discount the part that DSP is idling and waiting for the other side.

Profiling worst rendering part of frame should give more relevant info for DSP than profiling full frame though.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

Commited the separate worst frame render & thinking part profiling.

Playing Doom1 last level for a while gave:

Thinking part, CPU:

Code: Select all

Time spent in profile = 0.19125s.

Visits/calls:
- max = 888, in _P_PointOnDivlineSide at 0x30388, on line 2582
- 5915 in total
Executed instructions:
- max = 4480, in _subframe_block+456 at 0x1ef2c, on line 765
- 290678 in total
Used cycles:
- max = 89628, in _subframe_block+470 at 0x1ef3a, on line 770
- 3068204 in total
Instruction cache misses:
- max = 1233, in _P_PointOnDivlineSide+74 at 0x303d2, on line 2593
- 85512 in total
...
Executed instructions:
  15.66%  15.66%  16.05%       45521     45521     46666   _subframe_block
   9.88%   9.91%  12.23%       28724     28796     35546   _R_DrawColumn
   9.20%   9.22%  11.58%       26730     26813     33663   _P_PointOnDivlineSide
   7.55%   7.57%  20.39%       21960     22012     59260   _P_RunThinkers
   5.54%   5.56%  24.58%       16094     16157     71436   _P_BlockLinesIterator
   4.88%   4.89%   5.25%       14185     14203     15269   _V_DrawPatch
   4.72%   4.73%  16.52%       13729     13763     48019   _PIT_AddLineIntercepts
   4.37%   4.38%  46.47%       12708     12743    135084   _P_PathTraverse
   3.92%   3.92%  16.15%       11398     11409     46955   _R_DrawMaskedColumn
   3.14%   3.15%   5.92%        9130      9150     17219   _P_TraverseIntercepts
   3.01%   3.02%  10.63%        8756      8787     30897   _P_BlockThingsIterator
   2.63%   2.63%   2.63%        7640      7647      7647   _P_InterceptVector
   1.93%   1.94%   1.94%        5620      5640      5640   _P_UpdateSpecials
   1.82%   1.82%   1.82%        5278      5278      5278   _R_PointInSubsector
   1.72%   1.74%   7.24%        5004      5053     21039   _PIT_AddThingIntercepts
   1.70%   1.71%  15.48%        4948      4959     44993   _P_RecursiveSound
   1.27%   1.27%  17.45%        3678      3705     50722   _R_DrawVisSprite
   0.93%   0.93%   3.33%        2712      2712      9693   _P_CheckPosition
...
Instruction cache misses:
  12.43%  12.48%  12.63%       10633     10675     10801   _P_PointOnDivlineSide
  10.61%  10.65%  35.52%        9073      9104     30377   _P_BlockLinesIterator
  10.00%  10.02%  24.48%        8549      8566     20931   _PIT_AddLineIntercepts
   9.39%   9.42%  61.38%        8033      8052     52487   _P_PathTraverse
   6.39%   6.42%  18.17%        5464      5489     15538   _P_RunThinkers
   5.14%   5.14%   5.14%        4393      4396      4396   _P_InterceptVector
   4.30%   4.30%   7.42%        3674      3680      6341   _R_DrawMaskedColumn
   3.70%   3.73%   8.12%        3161      3186      6947   _PIT_AddThingIntercepts
   3.20%   3.22%  11.88%        2739      2756     10163   _P_BlockThingsIterator
   2.99%   3.00%  10.47%        2554      2568      8957   _R_DrawVisSprite
   2.96%   3.00%   3.11%        2529      2565      2661   _R_DrawColumn
   2.14%   2.15%   6.30%        1831      1842      5389   _P_TraverseIntercepts
DSP side:

Code: Select all

Visits/calls:
- max = 12, in TestLineSegVectorBisection at 0xcd7, on line 132
- 51 in total
Executed instructions:
- max = 1018134, in command_base+3 at 0xd54, on line 212
- 1024849 in total
Used cycles:
- max = 6108804, in command_base+3 at 0xd54, on line 212
- 6136428 in total

Visits/calls:
  23.53%  23.53%          12        12   Divs48_Real
  23.53%  47.06%          12        24   InterceptVectors
  23.53%  23.53%          12        12   TestLineSegVectorBisection
  17.65%                   9             P_CrossSubsector_body
   5.88%                   3             command_base
   5.88%                   3             ALGO_P_CrossBSPNode

Executed instructions:
  99.35%                     1018158                       command_base
Not that much time spent on that part, on this Doom level, except for sound processing.

Btw. In these profiles I'm increasing the screen size from default RGB size so that there's no black between the graphics and statusbar at bottom. Although this is Doom1 level, that might affect how time is split between thinking and rendering.


Then, the render part, CPU side:

Code: Select all

Time spent in profile = 0.23151s.

Visits/calls:
- max = 465, in R_BSPHyperPlane_RHS at 0x4ad82, on line 1372
- 2023 in total
Executed instructions:
- max = 8755, in D_GeneratePostBuffer+78 at 0x496d2, on line 618
- 516512 in total
Used cycles:
- max = 70504, in D_GeneratePostBuffer+110 at 0x496f2, on line 632
- 3714020 in total
Instruction cache misses:
- max = 116, in R_FlushDeferredSurfaces+30 at 0x4a672, on line 873
- 14175 in total
...
Executed instructions:
  33.16%  33.19%  34.68%      171252    171449    179116   D_GeneratePostBuffer
  14.07%                       72686                       R_VisPlaneSkyShader
  10.88%                       56199                       R_AdvanceSurface_T8
   8.60%                       44420                       R_DrawSurface_NW
   7.08%                       36558                       R_VisPlaneShader
   5.37%   5.38%   5.78%       27716     27774     29839   D_PatchRender_8_8
   4.75%                       24529                       R_SpriteColumnShader_Masked2
   3.44%   3.45%   3.50%       17784     17804     18065   _subframe_block
   2.51%   2.51%   2.92%       12944     12989     15074   stream_texture
   1.95%   1.95%   1.95%       10055     10066     10066   turbo_memclr
   1.51%   1.51%   1.51%        7787      7824      7824   stack_visplane_area
   0.87%   0.87%   0.92%        4479      4479      4773   R_AddSpriteSpans
   0.84%                        4340                       R_StackTransparentSurface
   0.71%   0.71%   0.71%        3661      3675      3675   R_ViewTestSpriteLines
   0.60%   0.60%   0.60%        3079      3079      3079   init_stategroups
...
Instruction cache misses:
  17.14%  17.14%  18.29%        2429      2429      2593   R_AddSpriteSpans
  11.03%                        1564                       R_SpriteColumnShader_Masked2
   8.85%   8.89%   8.89%        1254      1260      1260   R_ViewTestSpriteLines
   6.34%                         898                       R_DrawSurface_NW
   3.41%   3.41%  21.36%         483       483      3028   R_FlushDeferredSurfaces
   3.31%                         469                       build_ssector
   2.87%   2.87%   2.87%         407       407       407   add_wall_segment
   2.71%   2.88%   9.16%         384       408      1299   cache_resource
   2.47%                         350                       R_AddLine_loop
   2.01%                         285                       R_VisPlaneSkyShader
   2.00%                         283                       R_AddSurface_loop
   1.96%                         278                       init_font
   1.93%   2.59%   4.23%         274       367       599   D_GeneratePostBuffer
   1.90%                         270                       R_AdvanceSurface_T8
   1.84%   1.92%   2.87%         261       272       407   _subframe_block
   1.64%                         233                       sector_wall
   1.62%                         229                       R_VisPlaneShader
   1.59%   1.59%   1.59%         225       225       225   _frame_event
   1.56%   1.56%  26.18%         221       221      3711   R_SubSectorTryFlush
And DSP side:

Code: Select all

Visits/calls:
- max = 682, in R_DoColumnTextureUV at 0x46d, on line 930
- 3844 in total
Executed instructions:
- max = 518769, in command_base+3 at 0xd54, on line 2299
- 1540652 in total
Used cycles:
- max = 3112614, in command_base+3 at 0xd54, on line 2299
- 7428020 in total
Largest cycle differences (= code changes during profiling):
- max = 2, in R_BeginScene+20 at 0xae5, on line 2147
- 2 in total
...
Used cycles:
  41.97%                     3117174                       command_base
  16.34%  17.96%  17.96%     1213662   1334170   1334170   R_DoColumnPerspCorrect
  13.40%                      995516                       R_VPRenderSky
  12.05%  12.05%  12.05%      895106    895106    895106   VPRenderSpanDT
   5.74%                      426480                       R_VPLoadTexture
   2.04%                      151436                       R_DoColumnTextureUV
   2.01%   2.01%   2.01%      148992    148992    148992   extract_subvisplane
   1.19%   1.38%  15.49%       88598    102872   1150632   AddMidWall
   0.86%                       63914                       R_ViewTestAddLine
   0.51%                       37662                       project_node
This frame contained a sprite load which apparently took most of the CPU in it:

Code: Select all

> profile stack
- 0x49f1c: display_engine (return = 0x43d36)
- 0x4c57a: R_DrawTransparentSurfaces (return = 0x4a0e2)
- 0x48074: cache_resource (return = 0x4c5c6)
- 0x498be: D_CacheLoad_Sprite (return = 0x48188)
- 0x467ce: read_resource (return = 0x498f8)
GEMDOS 0x3F Fread(65, 5976, 0x183d64)
I would like to get names for the sprites that are being loaded. If you would have some symbol at which address a pointer to corresponding resource name is in some register, I could print it out to Hatari log...
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero, thanks... just catching up.
Eero Tamminen wrote:Commited the separate worst frame render & thinking part profiling.
Great, I will try it when I return in the evening.

Eero Tamminen wrote: Visits/calls:
23.53% 23.53% 12 12 Divs48_Real
23.53% 47.06% 12 24 InterceptVectors
23.53% 23.53% 12 12 TestLineSegVectorBisection
17.65% 9 P_CrossSubsector_body
5.88% 3 command_base
5.88% 3 ALGO_P_CrossBSPNode

Executed instructions:
99.35% 1018158 command_base


Not that much time spent on that part, on this Doom level, except for sound processing.
Yes that looks quite 'dark'.
Eero Tamminen wrote: Btw. In these profiles I'm increasing the screen size from default RGB size so that there's no black between the graphics and statusbar at bottom. Although this is Doom1 level, that might affect how time is split between thinking and rendering.
Thinking/rendering are sort of decoupled anyway - depending on the way the code is configured the ratio changes a lot. So I don't think it matters that much if the window is resized.
Eero Tamminen wrote: Executed instructions:
33.16% 33.19% 34.68% 171252 171449 179116 D_GeneratePostBuffer
14.07% 72686 R_VisPlaneSkyShader
10.88% 56199 R_AdvanceSurface_T8
8.60% 44420 R_DrawSurface_NW
7.08% 36558 R_VisPlaneShader
5.37% 5.38% 5.78% 27716 27774 29839 D_PatchRender_8_8
4.75% 24529 R_SpriteColumnShader_Masked2
3.44% 3.45% 3.50% 17784 17804 18065 _subframe_block
2.51% 2.51% 2.92% 12944 12989 15074 stream_texture
1.95% 1.95% 1.95% 10055 10066 10066 turbo_memclr
1.51% 1.51% 1.51% 7787 7824 7824 stack_visplane_area
0.87% 0.87% 0.92% 4479 4479 4773 R_AddSpriteSpans
0.84% 4340 R_StackTransparentSurface
0.71% 0.71% 0.71% 3661 3675 3675 R_ViewTestSpriteLines
0.60% 0.60% 0.60% 3079 3079 3079 init_stategroups
There is a lot of code here which is specific to *loading* so it doesn't really belong in the playthrough/demo profiles. Will need to do something about it especially now that textures will get more processing. From your other notes it actually looks like sprite precaching may have stopped working completely?
Eero Tamminen wrote: ...
Instruction cache misses:
17.14% 17.14% 18.29% 2429 2429 2593 R_AddSpriteSpans
11.03% 1564 R_SpriteColumnShader_Masked2 <--- suspicious
8.85% 8.89% 8.89% 1254 1260 1260 R_ViewTestSpriteLines
6.34% 898 R_DrawSurface_NW
3.41% 3.41% 21.36% 483 483 3028 R_FlushDeferredSurfaces
I think the entry marked above looks like the same strange cache behaviour I have talked about earlier and on hatari-devel. I worked around this and I'm sure it had been committed... I'll need to double check.
Eero Tamminen wrote: 2.01% 285 R_VisPlaneSkyShader
2.00% 283 R_AddSurface_loop
1.96% 278 init_font
1.93% 2.59% 4.23% 274 367 599 D_GeneratePostBuffer
1.90% 270 R_AdvanceSurface_T8
I'm suspicious of 'init_font' since it's a once-only startup call. It should not be seen during a level.
Eero Tamminen wrote: This frame contained a sprite load which apparently took most of the CPU in it:

Code: Select all

> profile stack
- 0x49f1c: display_engine (return = 0x43d36)
- 0x4c57a: R_DrawTransparentSurfaces (return = 0x4a0e2)
- 0x48074: cache_resource (return = 0x4c5c6)
- 0x498be: D_CacheLoad_Sprite (return = 0x48188)
- 0x467ce: read_resource (return = 0x498f8)
GEMDOS 0x3F Fread(65, 5976, 0x183d64)
I think read_resource can be intercepted as it takes a pointer to the WAD directory entry, which will carry the name (at offset 0 probably). I'll check the details and report later.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

Got worst frame info Doom1 level 4 beginning (large open area with several marines and me using the machine gun liberally), happily this time without any extra sprite loading.

Rendering part, CPU side:

Code: Select all

Time spent in profile = 0.23652s.

Visits/calls:
- max = 1036, in R_BSPHyperPlane_RHS at 0x4ad82, on line 1448
- 4354 in total
Executed instructions:
- max = 5760, in _subframe_block+456 at 0x1ef2c, on line 175
- 431375 in total
Used cycles:
- max = 115292, in _subframe_block+470 at 0x1ef3a, on line 180
- 3794400 in total
Instruction cache misses:
- max = 293, in R_AddSpriteSpans+28 at 0x4ae4c, on line 1506
- 26284 in total
...
Executed instructions:
  13.57%  13.58%  13.88%       58527     58567     59864   _subframe_block
  12.98%                       56004                       R_AdvanceSurface_T4
  12.75%                       54982                       R_VisPlaneShader
  10.88%                       46926                       R_DrawSurface_NW
   6.17%                       26616                       R_AdvanceSurface_T8
   6.09%   6.10%   6.39%       26292     26306     27568   D_GeneratePostBuffer
   5.25%   5.27%   5.27%       22652     22720     22720   stream_texture
   3.94%                       17005                       R_SpriteColumnShader_Masked2
   3.52%                       15201                       R_AdvanceSurface_T2
   3.31%   3.32%   4.91%       14288     14326     21182   D_PatchRender_8_8
   2.76%   2.79%   2.79%       11887     12032     12032   stack_visplane_area
   2.04%   2.04%   2.09%        8785      8785      9016   R_AddSpriteSpans
   1.76%   1.76%   1.76%        7601      7605      7605   R_ViewTestSpriteLines
   1.70%                        7322                       R_VisPlaneSkyShader
   1.39%   1.40%   4.08%        6016      6023     17584   get_ssector
...
Instruction cache misses:
  16.68%  16.68%  17.08%        4385      4385      4488   R_AddSpriteSpans
  10.44%  10.45%  10.45%        2743      2746      2746   R_ViewTestSpriteLines
   5.62%                        1477                       R_SpriteColumnShader_Masked2
   5.06%                        1330                       R_DrawSurface_NW
   4.99%                        1311                       build_ssector
   4.21%   4.21%  27.31%        1107      1107      7178   R_FlushDeferredSurfaces
   3.63%                         955                       R_AddLine_loop
   3.36%   3.44%   3.80%         884       903       999   add_wall_segment
   2.55%   2.66%   4.31%         671       698      1133   cache_resource
   2.31%                         606                       R_AdvanceSurface_T4
   2.23%                         587                       R_AdvanceSurface_T2
   2.23%   2.23%  34.05%         586       586      8949   R_SubSectorTryFlush
   2.10%                         551                       R_AddSurface_loop
   1.96%   1.97%   3.74%         514       517       984   get_ssector
DSP side:

Code: Select all

Executed instructions:
  24.43%  24.43%  24.43%      435174    435174    435174   VPRenderSpanDT
  17.88%  21.62%  21.62%      318514    385213    385213   R_DoColumnPerspCorrect
  16.78%                      299006                       command_base
...
Used cycles:
  23.50%                     1783716                       command_base
  21.26%  25.07%  25.07%     1613688   1902734   1902734   R_DoColumnPerspCorrect
  19.86%  19.86%  19.86%     1506876   1506876   1506876   VPRenderSpanDT
   9.01%                      683958                       R_VPLoadTexture
   4.20%                      318784                       R_DoColumnTextureUV
   3.04%   3.04%   3.04%      230490    230490    230490   extract_subvisplane
   2.29%   2.29%   2.29%      174104    174104    174104   R_BufferSurface
   2.16%   2.68%   9.75%      163626    203150    740200   AddLowerWall
   2.04%   2.60%   7.31%      154844    197588    554800   AddUpperWall
   1.94%                      147158                       R_ViewTestAddLine
   1.29%                       98102                       R_VPRenderSky
   1.21%   1.45%  15.98%       91992    109708   1212882   AddMidWall
   1.16%                       88082                       R_VPRenderPlane
   1.12%                       84850                       project_node
When there's no(?) sprite loading, DSP seems to be fairly full, only ~20% left for command_base polling.

Thinking part, CPU:

Code: Select all

Time spent in profile = 0.26189s.

Visits/calls:
- max = 1444, in _P_PointOnDivlineSide at 0x30388, on line 2748
- 7319 in total
Executed instructions:
- max = 6400, in _subframe_block+456 at 0x1ef2c, on line 766
- 407539 in total
Used cycles:
- max = 128044, in _subframe_block+470 at 0x1ef3a, on line 771
- 4201332 in total
Instruction cache misses:
- max = 2031, in _P_PointOnDivlineSide+74 at 0x303d2, on line 2759
- 122836 in total
...
Executed instructions:
  15.98%  15.99%  16.24%       65136     65156     66199   _subframe_block
  14.13%  14.16%  26.67%       57575     57721    108671   _P_RunThinkers
  11.37%  11.43%  13.11%       46332     46587     53447   _P_PointOnDivlineSide
   7.18%   7.20%   8.85%       29276     29332     36082   _R_DrawColumn
   6.48%   6.49%  33.08%       26408     26437    134802   _P_BlockLinesIterator
   6.48%   6.50%  26.08%       26406     26493    106284   _PIT_AddLineIntercepts
   4.26%   4.28%   6.23%       17375     17437     25376   _V_DrawPatch
   2.94%   2.94%  44.93%       11968     11992    183124   _P_PathTraverse
   2.80%   2.81%  11.66%       11398     11443     47525   _R_DrawMaskedColumn
   2.57%   2.57%   2.57%       10482     10482     10482   _R_PointInSubsector
   2.40%   2.40%   2.40%        9800      9789      9789 * _P_InterceptVector
   2.37%   2.37%   4.30%        9670      9670     17514   _P_TraverseIntercepts
   2.37%   2.37%   5.02%        9642      9642     20440   _P_BlockThingsIterator
   1.49%   1.49%   8.92%        6071      6071     36367   _P_SetMobjState
   1.38%   1.38%   1.38%        5620      5627      5627   _P_UpdateSpecials
   0.90%   0.90%  12.58%        3678      3678     51265   _R_DrawVisSprite
...
Instruction cache misses:
  14.80%  14.91%  15.03%       18182     18317     18463   _P_PointOnDivlineSide
  13.15%  13.18%  31.91%       16150     16194     39200   _PIT_AddLineIntercepts
  11.85%  11.91%  24.02%       14553     14630     29506   _P_RunThinkers
  11.06%  11.07%  44.15%       13586     13601     54236   _P_BlockLinesIterator
   6.53%   6.54%  58.53%        8019      8032     71891   _P_PathTraverse
   4.58%   4.57%   4.57%        5621      5613      5613 * _P_InterceptVector
   3.16%   3.16%   7.40%        3877      3877      9090   _P_BlockThingsIterator
   3.08%   3.08%   9.41%        3780      3780     11557   _P_SetMobjState
   3.00%   3.01%   5.19%        3679      3702      6373   _R_DrawMaskedColumn
   2.08%   2.08%   7.31%        2554      2554      8975   _R_DrawVisSprite
   2.06%   2.08%   2.17%        2532      2560      2671   _R_DrawColumn
When there's no sprite loading, thinking part would seem to be taking more CPU.

Interestingly, on both sides subframe_block (audio?) seems to be taking most CPU.

DSP side:

Code: Select all

Visits/calls:
  29.85%  29.85%          40        40   TestLineSegVectorBisection
  19.40%  19.40%          26        26   Divs48_Real
  19.40%                  26             P_CrossSubsector_body
  19.40%  38.81%          26        52   InterceptVectors
   5.97%                   8             command_base
   5.97%                   8             ALGO_P_CrossBSPNode
...
Used cycles:
  99.10%                     8327202                       command_base
   0.51%                       42956                       ALGO_P_CrossBSPNode
Plenty of free DSP for CPU side thinking optimizations, but IMHO first optimization priority would be catching those pesky sprite loads though. :-)
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote:Got worst frame info Doom1 level 4 beginning (large open area with several marines and me using the machine gun liberally), happily this time without any extra sprite loading.
:)
Eero Tamminen wrote: When there's no(?) sprite loading, DSP seems to be fairly full, only ~20% left for command_base polling.
Having the game and rendering captures separated really helps to show DSP load more clearly.

I have a few ideas for narrowing further but still thinking about it.

Eero Tamminen wrote:When there's no sprite loading, thinking part would seem to be taking more CPU.
For TIMEBASE_CONTROL==1, that will probably remain true. For >= 3 I'm hoping it will start to move into the background as more changes go in.
Eero Tamminen wrote: Interestingly, on both sides subframe_block (audio?) seems to be taking most CPU.
The audio mixer is an interrupt and it occurs 'occasionally' and takes a relatively long time. The shorter your profiling run, the bigger it appears to be in the profile.

I probably won't change the interrupt rate (i.e. buffer size) but will speed the code up a bit later.
Eero Tamminen wrote: Plenty of free DSP for CPU side thinking optimizations, but IMHO first optimization priority would be catching those pesky sprite loads though. :-)
The DSP is pretty much free throughout the 'thinking' part as its only used for co-processing. Any parallelism is very local and brief. It's not difficult to apply it in more places but i need to watch code space - I don't have any dynamic modules in this project so it all has to 'fit' and is getting tight (I have made it more modular in case dynamic is needed but TBH have not scoped the work and hoping I don't have to).

Sprite loading - there is an easy way to improve what's there. LinuxDoom has a sort of nasty hack in it which loads all known sprites for all known Doom games on every level - from a big list. The list clearly has holes in it. We'd get the same effect with 100% coverage by just loading every sprite in the WAD, using the WAD sprite directory itself as the list.

However it would be better to use a list of 'dynamic' sprites used on most levels, and a list derived from things in the current map. This keeps the cache as small as possible so useful stuff doesnt get kicked out during loading.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

When playing Doom 1 "-warp 1 4" level, I noticed couple of sprite clipping bugs.

In first screenshot Imps are seen through wall & door, although they're behind those, in second screenshot a dead marine is seen through the ceiling (spooky...):
grab0002.png
grab0003.png
These are with TIMEBASE_CONTROL=3.

(In case you wonder about image quality, I've converted them to 8-bit PNGs in Gimp to save space.)
You do not have the required permissions to view the files attached to this post.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

dml wrote:
Eero Tamminen wrote: Plenty of free DSP for CPU side thinking optimizations, but IMHO first optimization priority would be catching those pesky sprite loads though. :-)
The DSP is pretty much free throughout the 'thinking' part as its only used for co-processing. Any parallelism is very local and brief. It's not difficult to apply it in more places but i need to watch code space - I don't have any dynamic modules in this project so it all has to 'fit' and is getting tight (I have made it more modular in case dynamic is needed but TBH have not scoped the work and hoping I don't have to).
Ah, right, I'm not profiling DSP RAM usage. :-)
dml wrote: Sprite loading - there is an easy way to improve what's there. LinuxDoom has a sort of nasty hack in it which loads all known sprites for all known Doom games on every level - from a big list. The list clearly has holes in it. We'd get the same effect with 100% coverage by just loading every sprite in the WAD, using the WAD sprite directory itself as the list.

However it would be better to use a list of 'dynamic' sprites used on most levels, and a list derived from things in the current map. This keeps the cache as small as possible so useful stuff doesnt get kicked out during loading.
Would pre-loading all sprites use too much memory even for 14MB configuration?
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote: Ah, right, I'm not profiling DSP RAM usage. :-)
:) It's probably not going to make matters easier for profiling if I swap the code around at runtime so hopefully a few vector primitives will be enough to optimize most of the bad bits in the 'thinking' tick.
Eero Tamminen wrote: Would pre-loading all sprites use too much memory even for 14MB configuration?
The resource management is implemented in a way that makes it difficult to run out of memory, providing the static stuff fits in RAM (plus some margin). It behaves a bit like virtual memory. However it still gets slow if it is short of reusable physical memory, and still has to load stuff mid-game if it tries to load too much via precaching (e.g. if precaching represents 1.5x available storage and 2.0x required storage, something *will* still load during play even if it would have been a safe fit).

On a 14mb machine it's not really an issue because there's plenty of margin (currently) but it makes things very tough for a 4mb fit if that becomes viable later. It also eats into margin that might be used for 'Atari-enhancements' on a 14mb box.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote:When playing Doom 1 "-warp 1 4" level, I noticed couple of sprite clipping bugs.
Yep these are pretty funny.

In fact I realized recently that some logic I used for sprite clipping is probably creating performance problems (visual bugs aside). I will change this soon and hopefully fix both problems.

Sprites commit 'fragments' for rendering when the first solid thing appears behind them. The parts of a sprite which are still waiting for a solid thing to commit them via background scenery just wait in a list until that happens. This unfortunately defers some bits of sprites for a long time and causes a lot of fragmentation (walls from distant sectors are narrower).

This was accidental, caused by removing a side-of-line test for early commits. The fix worked for the original problem case but should have been solved differently as it caused these other problems. The logic needs to be a bit more complicated and try to commit the fragments as early as possible, but with special handling for a) transparencies and b) solid walls with a max-z in front of the sprite z.

Since it involves a side-of-line test, I'll probably wait until I'm DSP'ing the other cases which need that vector op so there are several points of reference for a working op.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

For catching incorrectly precached sprite (and other) resources via Hatari...

All resources which get loaded into the BM cache from the WAD will pass through this 68k function:

Code: Select all

read_resource:
; a0 = wd_struct
; a1 = buffer
; return/d0 = bytes read (or -ve for error)
Other resources pulled from the WAD which don't end up in the cache (game-static or level-static resources e.g. palette data) will pass through this one instead:

Code: Select all

load_resource:
(It may help to know that one reads to a buffer provided by the cache, the other allocates before reading and expects the caller to deallocate later).

The input to these functions is a WAD directory entry in reg a0;

Code: Select all

*-------------------------------------------------------*
*	WAD directory structure				*
*-------------------------------------------------------*
			rsreset
*-------------------------------------------------------*
wd_offset		rs.l	1
wd_size			rs.l	1
wd_name			rs.b	8
wd_len			rs.b	0
...or the C equivalent...

Code: Select all

struct wd_struct
{
 int offset;
 int size;
 char name[8];
};
I expect that any calls to read_resource or load_resource between these two labels:

Code: Select all

BM_E_ProfilerReset() <- end of current level load/init

...no loading expected...

P_InitThinkers() <- beginning of new level
...represent a suspicious load. That might be enough to catch these events for now. Both functions are called from P_SetupLevel()


I think a good solution to this would be to have BM track these during a game level and maintain a small cache file for each map. This way the game package can provide cache files for known WADs, and unknown WADs will generate their own on first use. It means writing a small read/write operation for this cache list on each level entry/exit but it's not complicated and can be done in easily the game C code. It doesn't need to be in the asm engine (The latter can focus on the problem of drawing the scene).

For now it would just be interesting to see what's missing - to make sure existing bits are doing what they are supposed to.

[EDIT]

One more note on the sprite/texture loading time - the extra overhead caused by mipmap generation and postmap (D_GeneratePostBuffer) are offline events which will eventually all be cached into a local folder on first use. So the WAD will expand into a directory structure in Falcon format. This means the 2nd time you load a level it will load much faster. I'm hoping to provide a commandline switch specifically to process the WAD in one go, so it's all done before trying to play.

This means I won't be trying to optimize mipmap generation, colour reduction or postmap building processes. These won't happen during play even if the precache isn't working (although it's better to get it working properly anyway).
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

Now that there's separate (interactive) worst frame profiling, I I'll remove separate last frame render & think profile from timedemo profiling. I also make start and end same for both Doom I & II.

Here's corresponding Doom1 profile of playing from first shot until death (level 4):

CPU side:

Code: Select all

Time spent in profile = 284.31990s.

Visits/calls:
- max = 827225, in R_BSPHyperPlane_RHS at 0x50396, on line 12127
- 7099646 in total
Executed instructions:
- max = 3498560, in _subframe_block+268 at 0x1edfa, on line 632
- 499182620 in total
Used cycles:
- max = 84026336, in _subframe_block+278 at 0x1ee04, on line 636
- 4561200344 in total
Instruction cache misses:
- max = 931474, in _P_RunThinkers+28 at 0x39696, on line 8018
- 49357972 in total
...
Executed instructions:
  21.26%                   106126864                       R_DrawSurface_NW
   9.01%                    44981510                       R_VisPlaneShader
   8.74%                    43651191                       R_AdvanceSurface_T8
   6.57%                    32778027                       R_AdvanceSurface_T4
   5.46%   5.47%  14.18%    27252422  27298120  70787691   _P_RunThinkers
   5.22%   5.23%   5.23%    26081595  26093155  26093155   _subframe_block
   4.49%                    22388512                       R_SpriteColumnShader_Masked2
   3.80%   3.81%   4.19%    18979140  19040721  20894077   stream_texture
   3.49%   3.49%   3.81%    17406073  17434171  19038048   _V_DrawPatch
   2.53%   2.53%   2.68%    12609132  12633515  13368480   _R_DrawColumn
   1.78%                     8880625                       R_AdvanceSurface_T2
   1.66%   1.68%   1.80%     8278552   8371012   8997396   stack_visplane_area
   1.33%   1.33%   4.11%     6646713   6657607  20512531   _R_DrawVisSprite
   1.31%   1.31%   1.37%     6541715   6550526   6845598   _R_PointInSubsector
   1.21%   1.24%   2.08%     6026251   6195198  10406559   _BM_P_CrossBSPNode
   1.08%   1.09%   1.15%     5415458   5428282   5725392   R_ViewTestSpriteLines
   0.96%   0.96%   1.04%     4794984   4808410   5205370   _P_UpdateSpecials
   0.91%   0.91%   0.95%     4519972   4527370   4740739   init_stategroups
...
Instruction cache misses:
  11.55%  11.60%  36.46%     5700411   5725656  17997415   _P_RunThinkers
   4.48%   4.49%   4.66%     2211502   2216517   2301386   R_AddSpriteSpans
   3.91%   3.92%   6.65%     1929016   1934853   3283691   _R_DrawVisSprite
   3.66%   3.67%   3.69%     1807049   1813637   1823188   R_ViewTestSpriteLines
   2.87%   2.87%  10.18%     1415491   1418921   5026408   _P_CheckPosition
   2.83%   2.83%   8.55%     1394577   1397213   4221141   _V_CopyRect
   2.67%   2.70%   3.31%     1317365   1333020   1631375   _V_DrawPatch
   2.56%   2.59%   2.63%     1262479   1276166   1299438   _R_DrawColumn
   2.31%                     1139316                       R_DrawSurface_NW
   2.15%   2.16%  20.76%     1061915   1064386  10245117   _P_SetMobjState
   2.11%                     1041409                       less256_d
   1.86%   1.86%  13.46%      918238    920368   6641673   _STlib_drawNum
   1.78%   1.79%   4.19%      880240    882610   2069707   _BM_P_CheckSight
   1.74%                      859528                       build_ssector
DSP side:
Used cycles:

Code: Select all

  39.51%                  3603983298                       command_base
  25.47%  27.54%  27.54%  232357502425126436482512643648   R_DoColumnPerspCorrect
  11.21%  11.21%  11.21%  102272118810227211881022721188   VPRenderSpanDT
   6.63%                   604420884                       R_VPLoadTexture
   2.43%                   221600804                       R_DoColumnTextureUV
   1.88%   1.88%   1.88%   171472104 171472104 171472104   extract_subvisplane
   1.34%   1.65%   8.49%   122124768 150104292 774086892   AddLowerWall
   1.23%                   112446176                       ALGO_P_CrossBSPNode
   1.18%   1.53%   5.55%   108048260 139645602 506408086   AddUpperWall
   1.13%                   103175482                       R_ViewTestAddLine
   1.13%   1.34%  18.69%   102987100 1221016141704615088   AddMidWall
   0.82%                    74362088                       P_CrossSubsector_body
   0.69%                    62772466                       R_VPRenderPlane
   0.65%                    59038550                       project_node
One can get this by setting some map option to profile.conf and doing manual play (i.e. timedemo isn't started) after running profile.sh.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3988
Joined: Sat Jun 30, 2012 9:33 am

Re: Bad Mood : Falcon030 'Doom'

Post by dml »

Eero Tamminen wrote: One can get this by setting some map option to profile.conf and doing manual play (i.e. timedemo isn't started) after running profile.sh.
I did this by accident recently via a -warp switch in profile.conf :)
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3999
Joined: Sun Jul 31, 2011 1:11 pm

Re: Bad Mood : Falcon030 'Doom'

Post by Eero Tamminen »

The extra sprite lumps being loaded during Doom1 timedemo are following:

Code: Select all

PUFFA0
PUFFB0
PUFFC0
PUFFD0
BAL1A0
BAL1B0
BAL1C0
BAL1D0
BAL1E0
BLUDA0
BLUDB0
BLUDC0
BEXPA0
BEXPB0
BEXPC0
BEXPD0
BEXPE0
Doom2 timedemo loads mostly the same during gameplay, except that it didn't load the BEXP* ones, but loaded BAL2* (A, B, D, E) ones.

Return to “680x0”