There is plenty of code tidying and several ugly DSP code-level optimizations needing done but it should be in good shape soon to have some fun with the Doom code. I'm getting bored with staring at DSP code anyway

Moderators: Zorro 2, Moderator Team
Code: Select all
-p:03fc 0aa981 0003fc (06 cyc) jclr #1,x:$ffe9,p:$03fc 28.32% (384106, 2304636, 0)
+p:03fc 0aa981 0003fc (06 cyc) jclr #1,x:$ffe9,p:$03fc 28.27% (383462, 2300772, 0)
...
-p:072a 0aa980 00072a (06 cyc) jclr #0,x:$ffe9,p:$072a 0.01% (178, 1068, 0)
+p:072a 0aa980 00072a (06 cyc) jclr #0,x:$ffe9,p:$072a 0.03% (443, 2658, 0)
Code: Select all
$01fac8 : move.l #$80000000,d7 0.00% (56, 448, 0)
$01face : move.l #$40000000,d5 0.00% (56, 672, 56)
-$01fad4 : move.l d7,d3 0.02% (752, 3008, 2)
+$01fad4 : move.l d7,d3 0.02% (752, 3008, 0)
$01fad6 : mulu.l d3,d3,d4 0.02% (752, 36096, 0)
-$01fada : cmp.l d6,d4 0.02% (752, 3008, 57)
+$01fada : cmp.l d6,d4 0.02% (752, 3008, 56)
$01fadc : bgt.s $1faea 0.02% (752, 5344, 0)
While on the CPU side the effect is spread wider (due to interrupts happening at "random" places), on function level the effect actually isn't really larger, except for cache misses, there it apparently can sometimes be >0.1%:Eero Tamminen wrote:CPU side profile differences are naturally larger, but probably still smaller than Hatari's Falcon emulation CPU cycle accuracy issues, so I wouldn't worry about that either.
Code: Select all
Executed instructions:
- 60.07% 237540 render_wall_1x1
+ 60.08% 237540 render_wall_1x1
13.60% 53790 clearlongs
- 10.13% 40038 render_flats_1x1
- 3.27% 3.28% 12920 stream_texture
- 2.50% 9870 dividing_node
- 1.80% 1.81% 7112 stack_visplane_area
- 1.27% 5018 segment_loop
- 1.15% 1.15% 4546 add_partition_segment
+ 10.13% 40034 render_flats_1x1
+ 3.27% 3.29% 12920 stream_texture
+ 2.48% 9814 dividing_node
+ 1.79% 1.81% 7096 stack_visplane_area
+ 1.27% 5022 segment_loop
...
Instruction cache misses:
- 15.17% 1928 render_wall_1x1
- 12.52% 1591 segment_loop
- 5.71% 5.71% 726 process_lighting
- 4.52% 574 memory_handle
- 3.87% 32.71% 492 add_wall_segment
- 3.68% 468 dividing_node
- 3.55% 451 seg_prelight_done
+ 14.90% 1890 render_wall_1x1
+ 12.54% 1591 segment_loop
+ 5.72% 5.72% 726 process_lighting
+ 4.53% 574 memory_handle
+ 3.89% 32.33% 494 add_wall_segment
+ 3.82% 485 dividing_node
+ 3.56% 451 seg_prelight_done
Aha - watch out for the 'clear_longs' which is clearing the framebuffer - it is configured to do this only 3 times (first time each of the 3 backbuffers is visited) after which it will stop. I forgot to mention this.Eero Tamminen wrote: While on the CPU side the effect is spread wider (due to interrupts happening at "random" places), on function level the effect actually isn't really larger, except for cache misses, there it apparently can sometimes be >0.1%:(This was for the 1st rendered frame, from 1st "r_begin" to next "r_begin".)Code: Select all
Executed instructions: - 60.07% 237540 render_wall_1x1 + 60.08% 237540 render_wall_1x1 13.60% 53790 clearlongs - 10.13% 40038 render_flats_1x1 - 3.27% 3.28% 12920 stream_texture - 2.50% 9870 dividing_node - 1.80% 1.81% 7112 stack_visplane_area - 1.27% 5018 segment_loop - 1.15% 1.15% 4546 add_partition_segment + 10.13% 40034 render_flats_1x1 + 3.27% 3.29% 12920 stream_texture + 2.48% 9814 dividing_node + 1.79% 1.81% 7096 stack_visplane_area + 1.27% 5022 segment_loop ... Instruction cache misses: - 15.17% 1928 render_wall_1x1 - 12.52% 1591 segment_loop - 5.71% 5.71% 726 process_lighting - 4.52% 574 memory_handle - 3.87% 32.71% 492 add_wall_segment - 3.68% 468 dividing_node - 3.55% 451 seg_prelight_done + 14.90% 1890 render_wall_1x1 + 12.54% 1591 segment_loop + 5.72% 5.72% 726 process_lighting + 4.53% 574 memory_handle + 3.89% 32.33% 494 add_wall_segment + 3.82% 485 dividing_node + 3.56% 451 seg_prelight_done
Ok, I'm now profiling following:dml wrote:Aha - watch out for the 'clear_longs' which is clearing the framebuffer - it is configured to do this only 3 times (first time each of the 3 backbuffers is visited) after which it will stop. I forgot to mention this. So it's probably better to measure from the 4th or 5th frame in, earliest.
Yes, the difference between Doom1 and Doom2 WAD costs is pretty radical.dml wrote:From the relative weight of the wall rendering, I assume this is an e4mX (e4m2?) level? Those levels have huge wall counts!Good for profiling.
Code: Select all
Executed instructions:
46.01% 48.27% 17762954 flat_generate_mips
17.76% 17.80% 6856544 flat_remap_mips
13.03% 5029082 render_patch_direct
5.75% 5.77% 2217984 correct_element
3.56% 3.59% 1376259 create_quick_alpha
2.18% 2.18% 840994 mip_plot_16bit
...
Instruction cache misses:
56.33% 73.52% 1232677 flat_generate_mips
16.47% 16.51% 360479 mip_plot_16bit
8.48% 8.74% 185577 flat_remap_mips
7.79% 170520 ROM_TOS
Code: Select all
Executed instructions:
44.83% 47.03% 25026563 flat_generate_mips
17.62% 17.65% 9833640 flat_remap_mips
9.90% 5526118 render_patch_direct
5.46% 5.48% 3050443 strcmp_8
4.91% 4.93% 2743296 correct_element
3.35% 3.87% 1869338 build_directory_hash
2.47% 2.48% 1376259 create_quick_alpha
2.12% 2.13% 1185037 mip_plot_16bit
...
Instruction cache misses:
53.89% 70.34% 1736922 flat_generate_mips
15.76% 15.80% 507944 mip_plot_16bit
8.13% 8.38% 261972 flat_remap_mips
6.07% 9.66% 195654 locate_entry_q
5.51% 177616 ROM_TOS
3.71% 3.82% 119487 strcmp_8
Code: Select all
Executed instructions:
44.01% 881624 render_wall_1x1
30.24% 605812 render_flats_1x1
5.16% 5.18% 103360 stream_texture
4.04% 4.08% 81022 stack_visplane_area
3.16% 63332 dividing_node
2.14% 2.14% 42788 add_partition_segment
1.94% 38818 segment_loop
...
Instruction cache misses:
13.04% 11316 segment_loop
12.48% 10829 render_wall_1x1
8.56% 8.62% 7432 process_lighting
4.67% 4051 dividing_node
4.31% 3744 build_ssector
4.20% 3648 memory_handle
3.80% 10.42% 3294 visplane_tryflush
3.59% 28.28% 3120 add_wall_segment
3.24% 2816 seg_prelight_done
2.93% 2541 render_flats_1x1
2.81% 3.14% 2441 stack_visplane_area
2.80% 17.03% 2428 render_wall
Code: Select all
Executed instructions:
72.85% 2187804 render_wall_1x1
9.46% 284228 render_flats_1x1
3.44% 3.46% 103360 stream_texture
2.93% 2.95% 87976 stack_visplane_area
2.78% 83582 dividing_node
1.37% 41096 segment_loop
...
Instruction cache misses:
16.21% 15737 render_wall_1x1
12.58% 12211 segment_loop
5.98% 5.99% 5808 process_lighting
4.73% 4592 memory_handle
4.05% 34.64% 3936 add_wall_segment
3.74% 3628 dividing_node
3.72% 3608 seg_prelight_done
3.14% 22.13% 3050 render_wall
Code: Select all
Used cycles:
37.88% 14009246 VPRenderPlaneDT_
24.67% 24.73% 9122644 perspected_column
10.03% 3707982 command_base
8.47% 3130786 SetTexture
4.27% 4.27% 1578496 extract_subvisplane
2.96% 1093216 AddLowerWall
2.67% 987888 AddUpperWall
1.80% 664928 AddMidWall
1.17% 433504 project_node
0.94% 347408 NodeInCone
0.91% 338014 R_ViewTestBufferSeg
Code: Select all
Used cycles:
53.62% 53.68% 24362428 perspected_column
13.55% 6155958 VPRenderPlaneDT_
10.09% 4586496 command_base
6.89% 3130834 SetTexture
3.81% 3.81% 1729756 extract_subvisplane
2.24% 1017024 AddLowerWall
1.97% 894368 AddUpperWall
1.52% 690128 AddMidWall
0.82% 373060 R_ViewTestBufferSeg
0.78% 354880 project_node
Spinloop also consumes instructions, so if it's a bottleneck, it should be at least clearly visible in the profile, shouldn't it?dml wrote:While looking at this stuff its important to bear in mind that while an operation may dominate, it doesn't necessarily mean a bottleneck. It's quite complicated to profile concurrent processors - I've been focusing on the host port 'spin' loops on both sides to find the blockages as it's the only thing that does not mislead.
Indeed, that's what I've been using it forEero Tamminen wrote: Spinloop also consumes instructions, so if it's a bottleneck, it should be at least clearly visible in the profile, shouldn't it?
I do have another tool which finds/highlights them automatically (spreadsheet checks ratio of spin path vs exit path encounters) but yes labels could be used to help find them by other means or by eye.Eero Tamminen wrote: If these spin/wait points are embedded into larger functions, it may help if you have separate labels before and after the spin points, and they clearly name what is being waited at that point.
Sounds great! Whenever you have something runnable, I would be interested to profile it.dml wrote:I have also started working on the game code and refactoring BM to fit with that.
I forgot to mention that for some reason Doom1 shows up in BM in some "letterbox" display format (see the attached screenshot), unlike Doom2 WAD which showed up normally. That might explain some of the differences...Eero Tamminen wrote:Yes, the difference between Doom1 and Doom2 WAD costs is pretty radical.
There are quite a few stages involved - most of it getting the asm project reorganised for linking to C without all the 'WAD viewer' bits attached, fixing all the interdependencies. But once I have them joined, linking and running I'll let you know.Eero Tamminen wrote: Sounds great! Whenever you have something runnable, I would be interested to profile it.![]()
I've started reorganising for it, but not there yet. Did some experiments with the scrolling STE demo to make sure it would work. I also have a strange problem loading LOD files directly at runtime which is preventing it working fully outside Hatari+Devpac (LOD2BIN is a TTP and must be run inside Hatari on each build of the DSP code :-z ). Probably just some XBios/DSP call ordering dependency I need to refresh my memory on - it works sometimes but not others.Eero Tamminen wrote: Are you now compiling the 68k assembly with Vasm and building the rest with gcc?
I found 'symbols prg' worked great when I last tried it the other day, using the new sources (started using the new profiler but not tried the postprocess/callgraph bits yet).Eero Tamminen wrote: For CPU code we could switch into using DRI/GST symbols in the binary itself, instead of exporting & importing the symbols separately to Hatari debugger. I guess Vasm etc can still output a.out objects, GCC just needs to be told to link the final result into "traditional format" (haven't tried that myself yet though).
In fact it's just flat-shading the floor, ceiling in black because the VP shader is disabled in that build. It's working ok.Eero Tamminen wrote: I forgot to mention that for some reason Doom1 shows up in BM in some "letterbox" display format (see the attached screenshot), unlike Doom2 WAD which showed up normally. That might explain some of the differences...
Hatari can give a trace of those calls with:dml wrote:Probably just some XBios/DSP call ordering dependency I need to refresh my memory on - it works sometimes but not others.
Code: Select all
--bios-intercept --trace xbios
Does "clean port" mean that it doesn't yet have the BM stuff integrated into it? If yes, I wonder why you're getting frame skips (FS=1) with plain CPU code...dml wrote: fixed the code to run with the demo loops inside the commercial WAD files. This grab is from the Doom II 'clean port' running it's demo loop in Hatari.
Yes it's just the original Doom source hacked up and built for TOS, but with some changes to allow it to run the attract mode and display graphics (which is important, since I have no keyboard input yet and it's the only other way to prove the game code is running properly on m68k - while it is portable, it is also full of thorns).Eero Tamminen wrote: Does "clean port" mean that it doesn't yet have the BM stuff integrated into it?
Under what circumstances should we see frame skips?Eero Tamminen wrote: If yes, I wonder why you're getting frame skips (FS=1) with plain CPU code...
Eero Tamminen wrote: PS. You're quite naughty in posting these teaser pics, when you know how eager people are to test new versions (and in my case, profile them).
Hm. If you've had time to get rid of the gettimeofday() stuff featuring largely in the first linuxdoom TOS profile, profiling the CPU only demo mode might also be interesting to see whether there's something new visible, e.g. for code adding player, gun & enemy sprites. Providing debug symbols requires just the right gcc linker flag so getting them is now easy to automate.dml wrote:Yes it's just the original Doom source hacked up and built for TOS, but with some changes to allow it to run the attract mode and display graphics (which is important, since I have no keyboard input yet and it's the only other way to prove the game code is running properly on m68k - while it is portable, it is also full of thorns).
Only when your PC is too slow to run emulation at full speed (if you would use fast-forward, it would be at maximum frame skip, which is by default 5, not at FS=1).dml wrote:Under what circumstances should we see frame skips?Eero Tamminen wrote: If yes, I wonder why you're getting frame skips (FS=1) with plain CPU code...
I'll look at this later and see what's going on. Probably being used as a portable fallback for realtime measurement, and is supposed to be replaced.Eero Tamminen wrote: Hm. If you've had time to get rid of the gettimeofday() stuff featuring largely in the first linuxdoom TOS profile, profiling the CPU only demo mode might also be interesting to see whether there's something new visible, e.g. for code adding player, gun & enemy sprites.
I did try "-Wl,--traditional-format", followed by 'symbols prg TEXT DATA BSS' in the debugger, and that appeared to work.Eero Tamminen wrote: Providing debug symbols requires just the right gcc linker flag so getting them is now easy to automate.![]()
It could be a range of things - DSP emulation was on, this laptop isn't incredibly fast, and I was using a remote login from another machine at the time from elsewhereEero Tamminen wrote: Only when your PC is too slow to run emulation at full speed (if you would use fast-forward, it would be at maximum frame skip, which is by default 5, not at FS=1).
What if you also build the code with "-g"? Probably without that you get only global symbols.dml wrote:I did try "-Wl,--traditional-format", followed by 'symbols prg TEXT DATA BSS' in the debugger, and that appeared to work.
However I have noticed that very few symbols from the program are visible in the debugger. I'm still looking into this.
In fact I get thousands of .Lxxxx local symbols and some global symbols even without -g. With -g I get an extra 500k on the executable and I don't see any obvious difference in visible debugger symbols.Eero Tamminen wrote: What if you also build the code with "-g"? Probably without that you get only global symbols.
Could you mail the binary to me with information about your compiler version and example(s) of some symbols that should and should not be visible?dml wrote:In fact I get thousands of .Lxxxx local symbols and some global symbols even without -g. With -g I get an extra 500k on the executable and I don't see any obvious difference in visible debugger symbols.