Eero Tamminen wrote:Much better now!
Good
Eero Tamminen wrote:
16.76% 17.00% 18.91% 76562 77616 86381 _BM_P_CrossBSPNode
13.69% 13.71% 14.26% 62538 62610 65144 _R_PointInSubsector
7.63% 7.65% 70.16% 34848 34950 320403 _P_RunThinkers
5.05% 5.07% 30.15% 23049 23141 137669 _P_CheckPosition
4.47% 4.48% 4.48% 20423 20461 20461 _PIT_CheckThing
4.36% 4.38% 8.86% 19931 19983 40444 _P_BlockThingsIterator
4.18% 4.20% 12.50% 19080 19167 57063 _P_PathTraverse
That bunch is quite annoying. Especially the first one which is *raytracing* all over the map, and gives me a headache. It's not really worth trying to optimize it much more as it already half on the DSP and half in the 030 instruction cache. There is one last bit that might be DSP'd but I'm pretty sure the biggest problem with it is just being over-used from too many random places.
The best fix is the one that's already implemented through TIMEBASE_CONTROL >= 3 (which you obviously can't profile since it breaks demo replay), in effect managing the AI's use of it to a sensible level. No code optimization is going to beat that.
Some of the others I still have to look at but as you can see the distribution is very flat, around 4-5% per item - so once again they are best squashed through management at level above. I'll rewrite P_PointInSubsector and inspect the others for opportunities.
Eero Tamminen wrote:
Worst rendering, CPU side:
Executed instructions:
25.22% 116472 R_SpriteColumnShader_Masked2
12.47% 57580 R_AdvanceSurface_NMip0
10.47% 48341 R_VisPlaneSkyShader
7.07% 32651 R_AdvanceSurface_TMip3
6.57% 6.57% 6.57% 30348 30368 30368 _BM_A_Mux3x2
3.50% 3.51% 3.51% 16180 16211 16211 stream_texture
3.50% 16164 R_VisPlaneShader
3.34% 15421 R_AdvanceSurface_TMip2
2.90% 13416 R_AdvanceSurface_TMip0
2.88% 13320 R_AdvanceSurface_NMip3
2.30% 2.33% 2.33% 10628 10762 10762 stack_visplane_area
Well that list is a bit easier to handle - because the only one I can really optimize more is the one at the bottom, and the sky shader

the rest are all near limits already for the way they currently work.
Unfortunately several of the more recent optimizations to those areas are data-cache oriented (mipmaps + packed palettes -> locality of reference), so you won't see that in your profiles until Hatari supports data cache! This is especially true for sprites near the player - the bigger they get the more the data cache speeds up drawing. So Hatari is exaggerating the cost for these in the 'worst' frame profile. Not much we can do about that one.
Eero Tamminen wrote:
...
Instruction cache misses:
15.67% 15.68% 16.38% 4337 4340 4533 R_AddSpriteSpans
7.31% 7.35% 7.35% 2023 2034 2034 R_ViewTestSpriteLines
Those still need attention but not before I fix the bugs.
Eero Tamminen wrote:
At least with TIMERBASE_CONTROL=1, GCC 2.x and Hatari WinUAE CPU core timings, thinking can take more instructions in Doom II timedemo than rendering.
As getting worst frame timings for Doom II timedemo is now trivial, you might want to do it also for GCC 4.x build...

Yes I'll do a test with gcc4 to see what's different.
Thanks for the results