Bad Mood : Falcon030 'Doom'

dml · Post by **dml** » Sat Feb 23, 2013 1:45 pm

Here are two binaries, with the DSP LOD and CPU LST files (symbols part).

The binary with 'P' in the name has the internal profiler & metrics enabled so you can compare notes with the internal profiler (It's slower than the normal build and the symbols won't match it, so don't waste time trying to use the symbol file with that). Press '5' to enable those - everything appears onscreen at once alongside the FPS reading.

Both builds have handshaking and alignment optimizations disabled and are noticably slower than a typical 'final' build but it should be fine for testing in Hatari (the 3fps I reported on e4m2 is actually rises bit higher with these things re-enabled).

https://dl.dropbox.com/u/12947585/BMSymTst.zip

Eero Tamminen · Post by **Eero Tamminen** » Sat Feb 23, 2013 3:56 pm

Thanks!

While it works with the old UAE CPU core, it doesn't seem to work with the WinUAE CPU core. Which one you're using?

With WinUAE CPU core it terminates after pressing a key and processing the texture, I guess right before the (doom1.wad) level screen should come up:

Code: Select all

GEMDOS 0x3F Fread(64, 552, 0xdf5ca)
GEMDOS 0x42 Fseek(3919532, 64, 0)
GEMDOS 0x3F Fread(64, 35080, 0xdf5ca)
B-Trap f22c at 248e8 (0x8669188)
A-Trap a000 at e01028 (0x94458c8)
A-Trap a000 at e010a0 (0x9445940)
A-Trap a00e at e010c0 (0x9445960)
A-Trap a000 at e010a0 (0x9445940)
A-Trap a00e at e010c0 (0x9445960)
A-Trap a000 at e010a0 (0x9445940)
A-Trap a00e at e010c0 (0x9445960)
A-Trap a000 at e010a0 (0x9445940)
A-Trap a00e at e010c0 (0x9445960)
A-Trap a000 at e010a0 (0x9445940)
A-Trap a00e at e010c0 (0x9445960)
A-Trap a000 at e010a0 (0x9445940)
A-Trap a00e at e010c0 (0x9445960)
A-Trap a000 at e010a0 (0x9445940)
GEMDOS 0x4C Pterm(0xFFFF)

As it works with old UAE CPU core, it's not a problem.

Note, the issue happens when fast-forwarding the startup. I try later whether it works without fast-forwarding (if yes, that's Hatari bug).

dml wrote:The CPU side isn't very complex (compared with some medium-sized C projects) but the DSP side is quite large and not very typical so it may be useful for testing.

I thought you had already coupled Bad mood renderer to Doom code, as you had some screenshots of Doom 2. Does above comment mean that they're still separate, or that Doom code isn't complex? If Doom2 is a separate binary, I would be interested also about your Doom2 build and CPU symbols for that.

dml · Post by **dml** » Sat Feb 23, 2013 4:13 pm

Eero Tamminen wrote: While it works with the old UAE CPU core, it doesn't seem to work with the WinUAE CPU core.

Strange - it has worked ok here for as long as I've been using it.

Eero Tamminen wrote: Which one you're using?

I always use the WinUAE build. The one with all the extra CPU parameters.

Eero Tamminen wrote:With WinUAE CPU core it terminates after pressing a key and processing the texture, I guess right before the (doom1.wad) level screen should come up:

I can't really explain this - I have only ever been using the WinUAE build with BadMood.

BTW I haven't tried the fast-forward feature yet, will do so next chance I get.

Eero Tamminen wrote:I thought you had already coupled Bad mood renderer to Doom code, as you had some screenshots of Doom 2.

No I'm still quite far from that stage. All I had done there was 'port' LinuxDoom 1.10 to Falcon as a startingpoint to make sure the code worked and all the HW dependent issues were fixed or stubbed out (display, audio, keyboard etc). So what you saw was the ported, but otherwise original Id code running the DOOM2 WAD file on Hatari/Falcon.

Doom2 is basically Doom with some code extensions/tweaks and a new set of otherwise compatible leveldata (IWAD). While those code tweaks will most likely be missing from this code, I'm pretty sure they can be scooped up from the many source ports floating around out there already.

dml · Post by **dml** » Sat Feb 23, 2013 4:17 pm

dml wrote: I can't really explain this - I have only ever been using the WinUAE build with BadMood.

Actually maybe I can explain it - if it turns out to be the fast-forward mode.

Not all of the DSP handshaking optimizations are disabled - the new texture mapping method depends on it quite heavily and it doesn't get turned off with the others. So if you speed up the 68030 relative to the DSP with that method enabled, it will probably die (and there isn't yet a way to control it - it gets 'compiled in').

So I recommend avoiding FF for testing BadMood for now.

dml · Post by **dml** » Sat Feb 23, 2013 5:16 pm

Eero Tamminen wrote:If Doom2 is a separate binary, I would be interested also about your Doom2 build and CPU symbols for that.

Sorry I missed that bit.

Here's the ported executable built with -g (optimized, but symbols should be present), and a 'nm' dump to a separate symbol file. It was built for 68030+68882, although I don't know if any FPU stuff is actually generated - Doom is meant to be largely float-free.

linuxdoom-1.10.zip

[EDIT] Note the above is a plain TOS executable - not Linux or MiNT

The main problem you'll have with this build is lack of keyboard input. So you'll have to force it to 'warp' to a specific level to get it to render a 3D view. I used DOOM2.WAD with '-warp map01' to force it past the titlescreen & menu (IIRC... doublecheck check the doom wikis/faqs for the commandline args).

Eero Tamminen · Post by **Eero Tamminen** » Sat Feb 23, 2013 11:13 pm

dml wrote:Not all of the DSP handshaking optimizations are disabled - the new texture mapping method depends on it quite heavily and it doesn't get turned off with the others. So if you speed up the 68030 relative to the DSP with that method enabled, it will probably die (and there isn't yet a way to control it - it gets 'compiled in').

So I recommend avoiding FF for testing BadMood for now.

It wasn't fast-forward. Turning that off doesn't make WinUAE core work for me, and Bad mood works fine with old UAE core even with fast-forwarding. Image of what happens on screen is attached, in case it tells something for you.

No need to investigate it, as the program works with other CPU core.

Note: fast-forwarding shouldn't change anything that's visible to the emulated environment. It just disables host's wait for emulator VBL end and skips related host window updates (host window is updated on each emulated VBL, by default). VBL isn't skipped inside the emulation. If fast-forward changes something, it would be a bug in Hatari.

With the old UAE core and doom1.wad... If I run to the first level exit room and turn to the right, Bad mood doesn't anymore react to user input or show the results of user input. DSP code executed after that point is just this:

Code: Select all

p:09c3  0aa980 0009c3  (06 cyc)  jclr #0,x:$ffe9,p:$09c3

So I think it's stuck. CPU side loops through 248 instructions and based on the function names (from player_movement to ikbd_hander to End_Calculate), it would seem(?) be working correctly.

PS. I had a typo on Hatari DSP caller information gathering, it's fixed now.

Eero Tamminen · Post by **Eero Tamminen** » Sat Feb 23, 2013 11:59 pm

dml wrote:Here are two binaries, with the DSP LOD and CPU LST files (symbols part).

Attached is callgraph of the CPU side (in DOT format) instructions usage, during run from starting position to exit room.

It seems that framecounter, stabilizer_b, new_vbi and ikbd_handler are some kind of interrupt handlers as they're called from all over the code. I've used "--ignore-from" option to remove callgraph edges ending in them to make the graph readable.

There's a single function that itself is a clear bottleneck as it alone consume 2/3 of the instructions:

Code: Select all

Time spent in profile = 15.84330s.

Calls:
- max = 153998, in load_real_s_a5_d16_a2
- 462271 in total
Executed instructions:
- max = 1161896, in render_wall_1x1+216
- 46417454 in total
Used cycles:
- max = 9122448, in render_wall_1x1+208
- 254166016 in total

Calls:
 33.31%    153998  load_real_s_a5_d16_a2
  9.11%     42090  loop_line_scan
  8.84%     40871  next_line
  8.63%     39898  End_Calculate
  8.63%     39898  Calculate_Distance
  8.28%     38295  Distance1
  1.70%      7854  nodeincone
  1.37%      6337  framecounter
  1.34%      6192  invisible
  1.34%      6192  segment_loop
  1.30%      6032  next_node

Executed instructions:
 67.98%  31556906  render_wall_1x1
 11.69%   5428115  render_flats_1x1
  2.92%   1355046  enable_pixel_spans
  2.75%   1276185  Calculate_Distance
  1.51%    701024  nodeincone
  1.51%    699140  stack_visplane_area
  1.36%    629902  add_wall_segment
  1.32%    613797  segment_loop
  1.20%    557600  initialise_freetable
  1.16%    537096  Distance1
  1.14%    529832  loop_line_scan

dml wrote:The binary with 'P' in the name has the internal profiler & metrics enabled so you can compare notes with the internal profiler (It's slower than the normal build and the symbols won't match it, so don't waste time trying to use the symbol file with that). Press '5' to enable those - everything appears onscreen at once alongside the FPS reading.

For now I'm not interested in that level of information, just whether post-processing works fine and the general picture of whether there are clear bottleneck even without thinking of cycles and caches.

dml wrote:Both builds have handshaking and alignment optimizations disabled and are noticably slower than a typical 'final' build but it should be fine for testing in Hatari (the 3fps I reported on e4m2 is actually rises bit higher with these things re-enabled).

It seems already usable. I'm used to low FPS.

Eero Tamminen · Post by **Eero Tamminen** » Sun Feb 24, 2013 12:18 am

As to DSP, perspected_column() takes over half the DSP, but nearly 4/5 is host port waiting.

Code: Select all

Time spent in profile = 27.11142s.
...
Used cycles:
 52.55% 457116728  perspected_column
 13.87% 120693602  VPRenderPlaneDT_
  9.80%  85286436  command_base
  6.79%  59070152  linear_column
  4.45%  38701764  SetTexture
  3.58%  31181526  finish_column

There's something fishy in DSP call info, so I'm not attaching that.

dml · Post by **dml** » Sun Feb 24, 2013 9:15 am

Eero Tamminen wrote: It wasn't fast-forward. Turning that off doesn't make WinUAE core work for me, and Bad mood works fine with old UAE core even with fast-forwarding. Image of what happens on screen is attached, in case it tells something for you. No need to investigate it, as the program works with other CPU core.

So if building hatari with this:

./configure --enable-winuae-cpu; make clean; make

...doesn't enable WinUAE core, what do I need to do?

I must have misunderstood something earlier. I thought I was using it all along with this compile flag... (and the UI does certainly look different when built with this flag - the CPU extra flags aren't available without it).

Eero Tamminen wrote: Note: fast-forwarding shouldn't change anything that's visible to the emulated environment. It just disables host's wait for emulator VBL end and skips related host window updates (host window is updated on each emulated VBL, by default). VBL isn't skipped inside the emulation. If fast-forward changes something, it would be a bug in Hatari.

Ok understood.

Eero Tamminen wrote:With the old UAE core and doom1.wad... If I run to the first level exit room and turn to the right, Bad mood doesn't anymore react to user input or show the results of user input. DSP code executed after that point is just this:
Code: Select all
p:09c3  0aa980 0009c3  (06 cyc)  jclr #0,x:$ffe9,p:$09c3

The CPU and DSP have lost sync with each other, probably due to the timing sensitivity of the DSP texturing. I expect its probably stuck in VPRenderPlaneDT on the DSP side, or renderflats_?? on the CPU side, or some functionality immediately before or after it.

I have seen this happen in Hatari in other parts of the code when I enable the handshaking optimizations, but that build should have worked. It does work here. (Are you still sure I'm using the old UAE core? I can understand this happening if the cores have different timings - but if using the same cores we should get the same results)

Eero Tamminen wrote:So I think it's stuck. CPU side loops through 248 instructions and based on the function names (from player_movement to ikbd_hander to End_Calculate), it would seem(?) be working correctly.

That will be an interrupt - the user input is sampled in realtime so framerate doesn't affect movement speed. The mainloop will have frozen though.

dml · Post by **dml** » Sun Feb 24, 2013 10:22 am

Eero Tamminen wrote: Attached is callgraph of the CPU side (in DOT format) instructions usage, during run from starting position to exit room.

Thanks! I'll study it over the next few days to see how it ties up with what I know (or thought I knew

Eero Tamminen wrote: It seems that framecounter, stabilizer_b, new_vbi and ikbd_handler are some kind of interrupt handlers as they're called from all over the code. I've used "--ignore-from" option to remove callgraph edges ending in them to make the graph readable.

Yes that's right. A few realtime functions are running on VBL and TimerC.

Eero Tamminen wrote: There's a single function that itself is a clear bottleneck as it alone consume 2/3 of the instructions:

Is that 2/3rds of instructions, or calls made? I'd expect the floor/wall rendering to consume 2/3rd of instructions but not this function.

It is interesting though partly because it's a small frequent operation but also because it's one of the DSP/CPU transfer bottlenecks which showed up in the DSP profile. One processor is trying to make an exchange and the other is not always ready.

Note: When I profiled the UD e4m2 level in the last screenshot, it was only spending about 10% of total time rendering floors - which was typically the main cost in a simple scene before - everything else was CPU bound processing map wall segments and shoving them at the DSP and dealing with the wall columns so it does correlate with your results so far.

Eero Tamminen wrote: Time spent in profile = 15.84330s.
Calls:
- max = 153998, in load_real_s_a5_d16_a2
- 462271 in total

So this one is the smallest, most frequent CPU function used. It sends a 32bit real to the DSP, as part of wall creation. It's subject to the obvious jsr/rts overhead but the main cost is the exchange itself (2 host port bus exchanges - not always ready). It exists as a call (not inline) because it's used several times in a row very close together and stands a chance of cache reuse.

Eero Tamminen wrote: Executed instructions:
- max = 1161896, in render_wall_1x1+216
- 46417454 in total
Used cycles:
- max = 9122448, in render_wall_1x1+208
- 254166016 in total

That also correlates. The wall drawing (which both hides and occasionally 'blocks for' a large portion of DSP work) consumes most of the work

Eero Tamminen wrote: Calls:
33.31% 153998 load_real_s_a5_d16_a2
9.11% 42090 loop_line_scan *
8.84% 40871 next_line *
8.63% 39898 End_Calculate *
8.63% 39898 Calculate_Distance
8.28% 38295 Distance1
1.70% 7854 nodeincone
1.37% 6337 framecounter
1.34% 6192 invisible *
1.34% 6192 segment_loop *
1.30% 6032 next_node *

This is a bit strange looking because many of those are not functions at all - I put a * star next to the ones which are not calls (either loop labels or 'gotos'). They are also not near actual call labels so they can't be aliases.

I suspect one of the following:

- wrong binary was profiled against the symbols file (profile the one without 'P' in the name)
- the symbols didn't import correctly
- the symbols didn't match the binary in the first place (?)

Eero Tamminen wrote: Executed instructions:
67.98% 31556906 render_wall_1x1
11.69% 5428115 render_flats_1x1
2.92% 1355046 enable_pixel_spans
2.75% 1276185 Calculate_Distance
1.51% 701024 nodeincone
1.51% 699140 stack_visplane_area
1.36% 629902 add_wall_segment
1.32% 613797 segment_loop
1.20% 557600 initialise_freetable
1.16% 537096 Distance1
1.14% 529832 loop_line_scan

This looks close-ish to what the built-in profiler says, except a few of these again are not functions (segment_loop, loop_line_scan - see below).

check_move:
sf no_move
move.l nextx,cor_nextx
move.l nexty,cor_nexty
bsr build_blocks_list ; checks for closest blocks
move.l Vertex_Array,a1 ; Array of points
lea blocks_scan,a3 ; list of blocks to check
move.w (a3)+,d0 ; number of blocks (-1)
line_scan:
move.w (a3)+,d1 ; block number
move.l Blockmap_Array,a2 ; Array of blocks
move.l (a2,d1.w*4),a2 ; current block address
addq.l #2,a2 ; skips first word
loop_line_scan:
move.w (a2)+,d1 ; linedef number
cmpi.w #$ffff,d1 ; = -1 ?
beq next_line_scan ; if yes, end of block...

Eero Tamminen wrote: For now I'm not interested in that level of information, just whether post-processing works fine and the general picture of whether there are clear bottleneck even without thinking of cycles and caches.

The callgraph view does show a 'program flow' bottleneck but the 'call' aspect of it doesn't translate to cost in this case.

e.g. this is what it really does:

Code: Select all

	lea		load_real_s_a5_d16_a2,a3
*-------------------------------------------------------*
*	Load y-increments and accumulators (16.16)	*
*-------------------------------------------------------*
	move.l		tmp_addseg_y1a(a6),d0
	jsr		(a3)
	move.l		tmp_addseg_y1i(a6),d0
	jsr		(a3)
	move.l		tmp_addseg_y2a(a6),d0
	jsr		(a3)
	move.l		tmp_addseg_y2i(a6),d0
	jsr		(a3)

        ....

*-------------------------------------------------------*
*	Load 16.16 realnumber into DSP as 24.24		*
*-------------------------------------------------------*
load_real_s_a5_d16_a2:
*-------------------------------------------------------*
	dspwaitwrite.0	(a5)
	move.w		d0,(a2)
	swap		d0
	dspwaitwrite.1	(a5)				; .1
	move.w		d0,(a2)
	rts

It looks terrible in the callgraph view because it's interpreted as a normal function call. But in fact it's a cache optimization for a small reusable handshake (called via register, several times in a burst). The actual time wasted in this process is small. I suppose I should turn it into a tail call style impl. (jmp/jmp) but the host bus transfer will still dominate, and there are better things I can do with it via a more general refactor of the wall generator.

So anyway - while it's tempting to remove the call aspect, the main effect it would have is hiding the operation from view and reshuffling the callgraph view. The handshake itself is the primary cost (inlined or otherwise).

It's also the only place this kind of optimization was used, which is why it appears to dominate the callgraph view in such an obvious way.

The good news is the callgraph view found it even in this unusual form

Thanks for time spent on the results - it would be nice if you could identify why some of those unexpected labels appear in the callgraph. I can provide another build if it helps. It's possible I made a mistake that day when packaging things up but I thought it was ok.

dml · Post by **dml** » Sun Feb 24, 2013 10:24 am

Eero Tamminen wrote:As to DSP, perspected_column() takes over half the DSP, but nearly 4/5 is host port waiting.
Code: Select all
Time spent in profile = 27.11142s.
...
Used cycles:
 52.55% 457116728  perspected_column
 13.87% 120693602  VPRenderPlaneDT_
  9.80%  85286436  command_base
  6.79%  59070152  linear_column
  4.45%  38701764  SetTexture
  3.58%  31181526  finish_column

This makes sense - the perspected_column routine is the one with the heavy divides - generally hidden behind the wall column drawing on the CPU.

It's nice to see the DSP still isn't a holdup the majority of the time!

[EDIT #1] note that 'finish_column' isn't a call - it's another goto/jump.

[EDIT #2] If you're associating contiguous blocks of code under labels as the primary metric for generating callgraph entries, then it probably explains why you are getting loop labels and gotos showing up in the callgraphs for both CPU and DSP. This is probably ok providing the callgraph is interpreted with care, block entries are also interesting to a degree - but it's probably only reliable for compiled programs. Assembly language code takes liberties with this kind of thing, global labels can show up within code blocks at the same graph level for various reasons. To prune that you'll need to find blocks reached specifically via jsr/bsr ops, and maybe also exited via rts.

dml · Post by **dml** » Sun Feb 24, 2013 1:50 pm

Eero Tamminen wrote: Attached is callgraph of the CPU side (in DOT format) instructions usage, during run from starting position to exit room.

I've been looking at your DOT graph and can see lots of recognizable behaviour in there, but some very strange things too... e.g.

1) the graph converges broadly on 'get_dy' with a single call from lots of places - even for the rendering pipeline, yet it is not used at all by that code? (It looks like an accidental convergence point for leaves in your graph, when they should remain open...).

2) I also see interrupt (keyboard handling) code making 'calls' to strange locations such as 'load_real..' when in fact it's probably just being interrupted there on occasion - not real callgraph activity as such. Even more strange is the fact these interrupts only appear to 'call' a select few locations in background code, when I'd expect it to be scattered broadly across many locations in the program. This needs some explanation perhaps?

3) As mentioned previously, it is unravelling flat functions into graphs, where loop labels and gotos appear inside the function - the collision detection stuff on the lower right of the graph looks quite explosive but its mostly all one function with no calls in it.

4) I also see initialization stuff in the graph (show_mips) which only happens once at startup, so the graph is more complex overall than for a typical 'display frame'. But that's more of a use case thing really, i.e. when to start profiling capture or which contexts to capture within.

I'll see what else I can find in there!

dml · Post by **dml** » Sun Feb 24, 2013 6:44 pm

Found an interesting optimization/oversight in the wall handling code today.

There is already a kind of 'fast path' for walls which are not expected to generate any (textured) columns, particularly if they are tagged with sky texture or 'null' texture. These kinds of wall are not common esp. the 'null' case but it's useful for some occasional map tricks since its main duty is to formally close open edges of a sector's floor or ceiling where a sector meets another sector - and since it doesn't generate any actual wall columns the CPU doesn't have to wait for anything to draw. A 'fast path' was made for this kind of situation before it was measured properly, but was then left un-optimized when it was found to be rarely used.

There are many cases though where walls get generated 'upside down' - the backs of raised areas of floors or lowered ceilings - or the far side of steps. These upper/lower surfaces face away from the viewer so they are inverted. They are still used to 'close' sector floors with a nice clean edge where those walls exist but they generate no wall columns. The normal wall routine was being used here in 'dummy wall' mode where the fast path could have been used for exactly the same result. 'dummy wall' mode was added mainly to deal with sector joins and map bugs, and wasn't being treated as a real fast path.

Fixing this hasn't resulted in much of a speedup yet because the DSP isn't normally causing a holdup, but having identified the best uses for this 'fast path' it's allowing a lot of unused code to be removed for those cases and the CPU will be doing less. So it is leading to some speedups indirectly. I have already seen that removing redundant stuff from the 'fast path' code is now increasing the framerate measurably.

I think the next biggest speedup could be found by moving a lot of the ssector/segment processing stuff onto the DSP side instead of trying to do it on the CPU since there's a lot of vector arithmetic, lots of flow control and floor height clamping tests etc. etc. and is too long to cache properly on the 68030 as part of the ssector seg/linedef loop. On the most complex level I tried, this CPU-side process is taking a huge (!) portion of total time. Moving this code would also reduce the number of terms being sent to the DSP for each wall segment - some of the same fields are sent for each upper/mid/lower wall, when there are common fields for the whole wall and a few overrides etc.

I think before I try to do that I'll continue where I left off trying to 'batch' bits of work on the 030 to make them more self-contained and cache better. When this runs up against limits, will then look at moving more of it onto the DSP and have the CPU just do scene walking and housekeeping. I expect it will be easier to move stages of the reworked 'batched' code, than it will be to move the meandering, complicated code that's being used now.

[EDIT]

..thought of an extension to this one - it should be easy to have the CPU find the nearest point of each upper/lower 'step' wall to the viewer, and using a lookup table find the height difference in screen space. If the step is facing the viewer but too small/faraway to see, snap the floor/ceiling heights to be equal on both sides and the fast path will end up being used instead of the wall generator. Don't know if this happens often enough at this resolution to be worthwhile but its easy to test.

Eero Tamminen · Post by **Eero Tamminen** » Sun Feb 24, 2013 10:09 pm

dml wrote:So if building hatari with this:

./configure --enable-winuae-cpu; make clean; make

...doesn't enable WinUAE core, what do I need to do? I must have misunderstood something earlier. I thought I was using it all along with this compile flag... (and the UI does certainly look different when built with this flag - the CPU extra flags aren't available without it).

WinUAE CPU core is the one with FPU selection box in system options. There may be some WinUAE specific option that is differently for us which is causing the difference. I'll check that next week.

dml wrote:[EDIT #2] If you're associating contiguous blocks of code under labels as the primary metric for generating callgraph entries, then it probably explains why you are getting loop labels and gotos showing up in the callgraphs for both CPU and DSP. This is probably ok providing the callgraph is interpreted with care, block entries are also interesting to a degree - but it's probably only reliable for compiled programs. Assembly language code takes liberties with this kind of thing, global labels can show up within code blocks at the same graph level for various reasons.

How the callgraph stuff works:

1. User loads symbols to the Hatari debugger.

2. During profiling, profiler tracks every code address that has a symbol associated with it, and it keeps records of from what PC addresses execution transferred there.

3. When saving profile, this caller PC address & call count information is saved after rest of the profile information.

4. Post-processing script adds together costs for each memory address in the profile as it reads it and when it comes across new symbol address [1], it assigns the so-far collected costs to previous symbol.

I.e. every address that has a symbol, and which gets executed, is interpreted as a "function". If user inputs as symbols goto loop labels, they're treated as "functions".

[1] Symbols are parsed from the profile itself (which is always the case when you have caller information), and they can also be given separately with symbol files.

dml wrote:To prune that you'll need to find blocks reached specifically via jsr/bsr ops, and maybe also exited via rts.

There are couple of reasons why the profiler code work currently like it does:
* Simplicity (also first stab at this code)
* That way one can e.g. find out from which contexts an interrupt routine gets called. Which can be useful when tracking some very hard bugs (profiler is very useful also for debugging).

Maybe profiler should store some flags about the caller addresses, are they jsr/bsr (at the time of call as they may change later) etc. If you have some suggestions about that, it would be best to discuss it at the hatari-devel mailing list (think also about what would be useful for debugging hard bugs...).

Another possibility is that one uses a specific calling convention for symbols in assembly, e.g. certain prefix for subroutines and another one for loop labels. They they're easy to filter out before giving them for Hatari.

Eero Tamminen · Post by **Eero Tamminen** » Sun Feb 24, 2013 11:09 pm

dml wrote:I've been looking at your DOT graph and can see lots of recognizable behaviour in there, but some very strange things too... e.g.

1) the graph converges broadly on 'get_dy' with a single call from lots of places - even for the rendering pipeline, yet it is not used at all by that code? (It looks like an accidental convergence point for leaves in your graph, when they should remain open...).

Arrow in graph means: there was point in code execution where previous PC was the one listed in the arrow text, and the current PC the symbol address in the node that the arrow points to.

Could something like that happen to get_dy()? Or could you have some macros or inlines used by rendering code that use get_dy()? get_dx() is also called from rendering code.

dml wrote:2) I also see interrupt (keyboard handling) code making 'calls' to strange locations such as 'load_real..' when in fact it's probably just being interrupted there on occasion - not real callgraph activity as such. Even more strange is the fact these interrupts only appear to 'call' a select few locations in background code, when I'd expect it to be scattered broadly across many locations in the program. This needs some explanation perhaps?

Maybe interrupt happened at the first instruction of a function? If after interrupt PC changes from interrupt handler to a first instruction in a function, that's recorded as a "call" by the profiler.

Interrupt handlers are a bit of special case. Unless one is specifically investigation their call contexts, It may be best to tell post-process to ignore "calls" to and from them. Currently I'm telling callgraphs to ignore calls *to* specific interrupt symbols as they get called from everywhere (and really mess up and slow down graph rendering). I'm not asking it to ignore calls *from* them as call "back" would be recorded only if it happened to an exact symbol address, and interrupt itself might be doing some real subroutine calls which could be interesting to see.

Profiler flagging bsr etc. would help the post-processor also in this case.

dml wrote:3) As mentioned previously, it is unravelling flat functions into graphs, where loop labels and gotos appear inside the function - the collision detection stuff on the lower right of the graph looks quite explosive but its mostly all one function with no calls in it.

As it could here.

dml wrote:4) I also see initialization stuff in the graph (show_mips) which only happens once at startup, so the graph is more complex overall than for a typical 'display frame'. But that's more of a use case thing really, i.e. when to start profiling capture or which contexts to capture within.

I started profile only after the game engine had loaded everything and the doom view had come on screen. Is some initialization stuff still being done at that point?

dml wrote:I'll see what else I can find in there!

Callgraphs are easy to get. Just load symbols for things you're interested about before doing the profile (to specify about "calls "to what addresses you're interested about), do the profile normally (with latest non-modified Hatari profiler code), save it, and then run the tools/hatari-profile.py script:

Code: Select all

hatari-profile.py -g -st --ignore-to framecounter,stabilizer_b,new_vbi,ikbd_handler bmsym.txt

(-g = generate callgraph, -st = statistics & top-lists, --ignore-to = ignore calls to given interrupt routines)

dml · Post by **dml** » Mon Feb 25, 2013 1:38 am

Eero Tamminen wrote: Arrow in graph means: there was point in code execution where previous PC was the one listed in the arrow text, and the current PC the symbol address in the node that the arrow points to.

That I think explains the unfolding of functions containing global labels into graphs. It also explains some of the other odd looking things which happen to relate to interrupts.

Eero Tamminen wrote: Could something like that happen to get_dy()? Or could you have some macros or inlines used by rendering code that use get_dy()? get_dx() is also called from rendering code.

It looks like get_dx and get_dy are a chained mouse interrupt handler - effectively interrupt routines. They should probably be marked for 'ignore' during profiling, along with some other stuff.

Eero Tamminen wrote: ...I'm not asking it to ignore calls *from* them as call "back" would be recorded only if it happened to an exact symbol address, and interrupt itself might be doing some real subroutine calls which could be interesting to see.

There are no callbacks made from interrupts - there is some chaining of interrupts - optimizations - but that's all. There is no way for the mouse, keyboard or vbl/timerc stuff to make calls into to primary systems. It looks a bit like the exit from each interrupt is recorded as a call - at least some of the time?

Eero Tamminen wrote: I started profile only after the game engine had loaded everything and the doom view had come on screen. Is some initialization stuff still being done at that point?

It turns out that 'show_mips' is a global definition, used to turn on/off some conditional code. i.e. it's a switch. However, to the assembler it's also a label, and it's an accidental alias for 'cache_resource' - the first function in that sourcefile, and probably the real function in the graph. It's the texture cache 'touch' operation.

So it's a side effect of using the first label which aliases a given address. Perhaps using the last label would be more accurate? Since it will technically always be closer to the thing it maps to. Probably not a perfect heuristic but maybe a better one.

Eero Tamminen wrote: Callgraphs are easy to get. Just load symbols for things you're interested about before doing the profile (to specify about "calls "to what addresses you're interested about), do the profile normally (with latest non-modified Hatari profiler code), save it, and then run the tools/hatari-profile.py script:

Thanks. I'll probably try this again with some extra things ignored to see how it looks. It may also be worth my time refactoring some of the older code to use local labels more so they don't pollute the callgraph too.

Eero Tamminen · Post by **Eero Tamminen** » Mon Feb 25, 2013 8:28 am

dml wrote:It looks like get_dx and get_dy are a chained mouse interrupt handler - effectively interrupt routines. They should probably be marked for 'ignore' during profiling, along with some other stuff.

Do you change the interrupt handler address as the interrupt is being processed? I.e. does being "chained" mean that instruction execution can at some point be changed to continue from get_dx or get_dy label while PC is in some non-interrupt code?

dml wrote:
Eero Tamminen wrote: ...I'm not asking it to ignore calls *from* them as call "back" would be recorded only if it happened to an exact symbol address, and interrupt itself might be doing some real subroutine calls which could be interesting to see.
There are no callbacks made from interrupts - there is some chaining of interrupts - optimizations - but that's all. There is no way for the mouse, keyboard or vbl/timerc stuff to make calls into to primary systems. It looks a bit like the exit from each interrupt is recorded as a call - at least some of the time?

Yes. If the PC address which is restored, when "returning" from the interrupt, is address for some symbol, it's recorded as "call" to that symbol address. As the probability for the restored PC being exactly on one of the symbol address is low, you don't see them that much.

Note: normal code execution where PC just advances to a next instruction is also currently considered as a call, if the next instruction is at symbol address. I think something like that could even in C-code be a (static) function call optimized by a compiler like gcc. That's why I'm not going to do any unconditional bsr/jsr heuristics, although I could consider adding that as extra info, which could be utilized some post-processor option (e.g. --ignore-non-branch).

dml wrote:It turns out that 'show_mips' is a global definition, used to turn on/off some conditional code. i.e. it's a switch. However, to the assembler it's also a label, and it's an accidental alias for 'cache_resource' - the first function in that sourcefile, and probably the real function in the graph. It's the texture cache 'touch' operation.

So it's a side effect of using the first label which aliases a given address. Perhaps using the last label would be more accurate? Since it will technically always be closer to the thing it maps to. Probably not a perfect heuristic but maybe a better one.

Post-processor already does that, it should now handle properly conflicts in symbol addresses and address names, and give warnings about things user should be aware.

However, Hatari debugger doesn't. It's symbol handling is very simplistic, it just gives warnings if there are conflicts (they're noticed when it sorts the symbols and addresses). If there are conflicts, the symbol used by debugger is whichever the binary search on that list just happens to match first.

In my case I gave the symbols just to the debugger as they're needed to get the caller information for profile file. I'm not sure whether that can be fixed afterwards by giving the symbols also to the post-processor with its more advanced symbol handling, because it can consider symbol in the profile file itself as being given "later" (it parses profile files after symbol files).

dml · Post by **dml** » Mon Feb 25, 2013 1:22 pm

Having thought about it a bit, I figure it should be easy to get surface prelighting into Doom/BadMood pretty much for free. Some of the levels suffer from having the same light level on every face of a complex structure and you can't really see the complexity until you're really close to it - everything looks pretty flat (hand-lit sectors excepted, although that touches a whole ssector at a time, it's not the same as per-wall lighting).

A simple diffuse prelight step would help the eye better pick out corners of walls, and the shape of walls which are not completely flat, making everything look more 3D. It has to be kept subtle (no more than 15-25% adjustment of intended light level for that sector) so it doesn't cause problems with hand chosen light levels and some care would be needed for exterior/sky lit areas. I think it will look better though. Worth a go.

dml · Post by **dml** » Mon Feb 25, 2013 2:03 pm

...yep - got time to hack in a test over lunch and it looks better nearly everywhere. More pics to follow.

(light levels still too dark overall, due to floor/mip lighting issues to be fixed soon - and sky still very weird for other reasons)

plbefore1.jpg

plafter1.jpg

dml · Post by **dml** » Mon Feb 25, 2013 2:32 pm

...another example comparison - same caveats apply re: overall too dark.

plbefore2.jpg

plafter2.jpg

bullis1 · Post by **bullis1** » Mon Feb 25, 2013 3:16 pm

That is a pretty good idea. The results are really nice.

Although I guess Doom purists would want an option to turn it off for various reasons.

To solve the overall darkness problem I guess you could implement a method of boosting the gamma or saturation of the brightest walls above the default levels. Could be tricky to get a nice result though.

dml · Post by **dml** » Mon Feb 25, 2013 3:40 pm

bullis1 wrote:That is a pretty good idea. The results are really nice.
Although I guess Doom purists would want an option to turn it off for various reasons.

Yes true - I turned it into a toggle bound to a key shortcut for now, just in case

bullis1 wrote:To solve the overall darkness problem I guess you could implement a method of boosting the gamma or saturation of the brightest walls above the default levels. Could be tricky to get a nice result though.

Extending the lighting range could work yes - some extra 'HDR'-like margin above the usual maximum level.

Many of the textures are quite dark even before I fiddle with shading stuff so it's difficult to light them without making lots of areas a bit darker as a side effect. Something like the above could solve that.

It looks like there's a global gamma control in the code already, set from the command line - although I don't remember much about it now. I'll need to review all the light level and depth cue stuff together so it all makes sense again...

Eero Tamminen · Post by **Eero Tamminen** » Mon Feb 25, 2013 4:56 pm

dml wrote:It looks like there's a global gamma control in the code already, set from the command line - although I don't remember much about it now.

I remember using that. I guess it's there because CRT monitors brightnesses could differ a lot, especially when they age, much more than for LCDs (which are typically too bright to start with, not too dark).

dml · Post by **dml** » Mon Feb 25, 2013 7:00 pm

Eero Tamminen wrote: I remember using that. I guess it's there because CRT monitors brightnesses could differ a lot, especially when they age, much more than for LCDs (which are typically too bright to start with, not too dark).

Yes I think it was added for large CRT/VGA gamma differences - not so much for adjusting light levels in the game. I think a palette-space saturation control will work fine alongside it.

dml · Post by **dml** » Mon Feb 25, 2013 7:25 pm

So first look at BM this evening I managed to resist the temptation to test more custom Doom2 PWADs designed for machines 100x faster than my Falcon, and relatively unimportant things like new prelighting fx and went back to fixing the ceiling/floor lighting to work with DSP textures & mipmaps - so it looks mostly correct.

The dynamic lighting in BM was never really correct - flickering lights etc. can turn into shadows, darker than surrounding areas. It was just a hack at the time, and some (wrong) guesswork - binding it to the Doom game code will result in correct light levels for those effects.

Sky still quite broken but it will have a special shader made anyway which works differently from other visplanes.

litflats.jpg

[EDIT] looks like somebody went to some map editing trouble to make a soft area lighting effect in this one!

pool.jpg

Atari-Forum

Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'