Eero Tamminen wrote:
I'm assuming the extra "cache misses" are extra instructions that get read into cache at the same time...?
If my assumption is right, I could also mark that in profiler as a single cache miss for the instruction at PC address. Which one you would prefer?
I suspect that the *average* miss rate per instruction is the most useful bit of information. If only the peak miss rate is recorded the analysis will be misleading for a programmer.
I wasn't clear on what I was asking was for, it was about what should happen on *each* executed instruction. Should I record just that there was miss, or how many misses executing the current instruction at this point (once) incurred ie. should I increase the sum of misses for given address by 0-1 or by 0-3.
Also, does the DSP have any kind of cache / cache misses?
(At least Hatari DSP emulation doesn't support such thing, but if it did, adding it to profiler would be trivial
BTW this is something that might need attention in WinUAE - It is possible that WinUAE is assuming burst mode all the time? In this case I would expect 0-4 cache misses (peak) but 0-1 cache miss (average) per instruction since burst mode is a kind of extended prefetch purely for cache filling, with a limit of 4. I think the burst bits are usually off on the Falcon but if enabled WinUAE might emulate incorrectly. It's also worth noting that a 'miss' on the 68030 implies a longword fetch - i.e. 2 individual 16bit memory fetches on the Falcon. Normally it would be just one fetch - another thing WinUAE might not know about!
What about TT, does that also have burst mode normally off?
Latter point affects just timings, right? Anyway, that's more for Laurent than me...
Assuming burst mode is off , there should be a maximum of 1 miss per longword per instruction. So if an instruction requires 1 word, that's 1 miss max. If a lengthy instruction needs 2 longwords that could be 2 misses but the same instruction is not longword-aligned that would be 3 misses. This is more likely the reason. In your experiments, is it the case that most instructions have 0-1 misses? But very occasional larger instructions have 2 or 3? I think it would be worth checking the size and alignment of the 3-miss cases.
After running this 4kB demo for a while:
I get following statistics:
- 0: 29282934
- 1: 43281
- 2: 3323
- 3: 6510
First number is the number of misses and second one is number of instructions having that number of misses.
Does that look reasonable?
Profile results look like this (also with the patch in earlier mail):
Code: Select all
> profile misses 8
0x01d1b0 11.85% 8233
0x01d1c0 7.70% 5346
0xe03cac 3.94% 2737
0xe1065c 3.46% 2406
0xe1c288 3.46% 2406
0xe03ca0 2.31% 1604
0xe03ca2 2.31% 1604
0xe356f0 2.31% 1604
8 CPU addresses listed.
> d 0x01d1aa
$01d1aa : 3c3c 0031 move.w #$31,d6 0.00% (223, 0, 0)
$01d1ae : 2448 movea.l a0,a2 0.04% (11120, 0, 0)
$01d1b0 : 2649 movea.l a1,a3 0.04% (11120, 0, 8233)
$01d1b2 : 41e8 0258 lea $258(a0),a0 0.04% (11120, 0, 4)
$01d1b6 : 43e9 fda8 lea $fda8(a1),a1 0.04% (11120, 0, 0)
$01d1ba : 3e3c 0063 move.w #$63,d7 0.04% (11120, 0, 0)
> profile addresses
$e0f6b6 : add.w d4,d4 0.00% (1, 0, 1)
$e0f6b8 : movea.l a1,a2 0.00% (1, 0, 3)
$e0f6ba : adda.w d4,a2 0.00% (1, 0, 0)
$e0f6bc : move.w (a1)+,(a0)+ 0.00% (16, 0, 2)
$e0f6be : move.w (a2)+,(a0)+ 0.00% (16, 0, 0)
$e0f6c0 : move.w (a1)+,(a0)+ 0.00% (16, 0, 1)
$e0f6c2 : move.w (a2)+,(a0) 0.00% (16, 0, 0)
$e0f6c4 : adda.w d3,a5 0.00% (16, 0, 1)
$e0f6c6 : movea.l a5,a0 0.00% (16, 0, 0)
$e0f6c8 : dbra d2,$e0f6bc 0.00% (16, 0, 1)
$e0f6cc : rts 0.00% (1, 0, 0)
$e1054a : rts 0.00% (802, 0, 0)
$e1065a : subq.l #4,sp 0.00% (802, 0, 683)
$e1065c : movem.l d0-d7/a0-a6,-(sp) 0.00% (802, 0, 2406)
$e10660 : movea.l $e4d17e,a1 0.00% (802, 0, 802)
$e10666 : movea.l $ffbe(a1),a0 0.00% (802, 0, 0)
$e1066a : move.l $ffc2(a1),$3c(sp) 0.00% (802, 0, 0)
$e10670 : jsr (a0) 0.00% (802, 0, 802)
$e10672 : movem.l (sp)+,d0-d7/a0-a6 0.00% (802, 0, 1230)
$e10676 : rts 0.00% (802, 0, 0)
$e1b67e : link a6,#$fffe 0.00% (1, 0, 0)
$e1b682 : move.l #$3000,d0 0.00% (1, 0, 0)
$e1b688 : unlk a6 0.00% (1, 0, 1)
$e1b68a : rts
Numbers in parenthesis are instruction count, cycle count (buggy in WinUAE core so always 0), and sum of misses for that address.
Anyway back to the question - for maximum usefulness I think *average* misses per instruction (0.0-3.0?) is the best overall summary (i.e. total misses for instruction / total incidence of instruction over entire session).
Total summed misses per instruction over the entire session is probably also quite useful as it would let you see whether a recognized hotspot was caused by cache misses or just by incidence at a glance. Whether it's better just to show the large 'sum' itself or to show it as a relative cost (total misses per instruction / total misses from whole program over whole session) I don't know - whatever makes it easier to see a cache-miss hotspot against the background of other code.
The code in the above linked patch shows both percentage of total and the sum of 0-3 hit values per executed instruction.