DSP profiling extension

A forum about the Hatari ST/STE/Falcon emulator - the current version is v2.1.0

Moderators: simonsunnyboy, thothy, Moderator Team

User avatar
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

DSP profiling extension

Postby dml » Thu Feb 27, 2014 2:46 pm

Some time ago I made a small patch for Hatari (v1.6) which changed the output format for profiling information gathered from the DSP. I have recently redone the patch to work with Hatari v1.7. It is linked below:

https://dl.dropboxusercontent.com/u/129 ... fext2.diff

The reason for this was a need to inspect instruction-level timings for optimization purposes. This is a bit different from the intended use case for the Hatari profiler, which generates a compact format subsequently digested by a python script, which in turn produces a classic, high-level profile view of program activity (i.e. function-level and callgraph profiling). This is very useful for large programs with lots of bits of code and complex flow control.

The original 'raw' format is however a bit lacking in detail for using the profile data directly and spot-optimizing instruction sequences, so I made some changes to make this work better for me.

What extra info does the patch show?

- more precise measurements, to 5 decimals (versus 2)
- shows execution cycles as a %age of total runtime (versus total hits as a %age - not quite the same)
- shows external access penalities (sometimes unpredictable cost caused by conflict when accessing external dsp ram more than once per op)
- shows average cycle counts and average penalties over profiling session (instead of just the last cost measured)
- shows ops with varying execution time
- fields prefixed with <columncode>: to make scanning easier for custom import to Excel or with python

DISCLAIMER: my changes are incompatible with the hatari_profile.py and hatari_spinloop.py scripts - so you can't generate normal callgraphs or profiled function lists with my patch applied. All you can do is look at instructions and their costs directly. If you want to use both kinds of view, you'll need to build two versions of Hatari, as I did. I don't use this special version very often but it has turned out to be very useful for some kinds of optimization work.

An example of the difference in output before/after the patch:

Code: Select all

before (unpatched version):

p:0271  5eec79         (06 cyc)  tfr y1,b y:(r4+n4),a        0.04% (130576, 783456, 0)

after (patched version):

p:0267  4dec45         (6:2)  cmp x0,a y:(r4+n4),x1        u:00.04252% c:00.05813% (U:   52102 avC:6.00 avE:2.00 V:0)

A typical profile packet in the new format looks like this:

<address> <opwords> (X:Y) <opcode> u:00.00189% c:00.00258% (U: 2315 avC:6.00 avE:0.00 V:0)

u: instruction usage as a percentage of total profiled instructions
c: cycle usage as a percentage of total profiled time
U: actual (absolute) usage for this instruction (total hits)
avC: measured average cycles for this instruction
avE: measured average external memory penalty for this instruction
V: peak variance seen in terms of cycles (e.g. 2 = execution time varied by 2 cycles worst case during profile run) note that this can also cause non-integer averaged cycle measurements for any instruction.
X: cycles taken by the *last* visit to this instruction
Y: external penalty cycles taken by the *last* visit to this instruction

User avatar
Atari Super Hero
Atari Super Hero
Posts: 621
Joined: Mon Aug 30, 2010 8:36 am

Re: DSP profiling extension

Postby dhedberg » Thu Feb 27, 2014 5:02 pm

Cool stuff, and really useful for some type of optimizations! Thanks for sharing!
Daniel, New Beat - http://newbeat.atari.org

User avatar
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: DSP profiling extension

Postby dml » Fri Feb 28, 2014 11:50 am

You're welcome, maybe it will find a use in other hands :)

One of the things I did with this is set up a spreadsheet with prepared rules to extract things from the output, such as automatically finding hostport handshake bottlenecks on the CPU or DSP side, and the type of bottleneck (block for reading or writing) (***). It also has some stuff to sum instances of instructions using a specific absolute address or immediate constant, so you can count the number of times a long immediate value is referenced, to help judge if the item needs moved into short-address memory.

By using this post-process step I was able to annotate (e.g. [num_hits:num_references]) all of the common constants and var references in the disasm code and optimize the most frequently used ones into the low $00-$3f area for fast access and/or to reduce code size:

Code: Select all

p_column_rout:      dc   R_DoColumnPerspCorrect         ; [040902:06] pointer to AddWall column routine
p_endwall_rout:      dc   R_EndAddSurface            ; [003094:03] pointer to AddWall completion routine
c_one:         dc   $000001               ; [139635:16] quick-immediate #>1,??
c_viewplane:      dc   ZMIN               ; [008676:02] clipping plane z
c_pnorm:      dc   1<<((6+7+8+8)-24-1)         ; [011593:01] evaluated constant for mapping function

One of the nice things about this is that you can find the same constant aliased for different jobs in the disassembly, and give it multiple names, but storing only once in low memory for fast access:

Code: Select all

rshft15:                              ; shift right 15 = ($000100 * v).high
lshft9:                              ; shift left 9 = ($000100 * v).low
c_000100:      dc   $000100               ; [011593:01]

c_0000FF:      dc   $0000FF

c_00FFFF:      dc   $00FFFF

Note for any newcomers to DSP code: this kind of thing is meaningful only because the DSP is 'funny' with it's use of immediate data - it's often cheaper to read a value from a variable in low memory, than to load the value directly as immediate data - because right-justified (or large) constants can cost an extra instruction word, whereas reading from a low address can use compact short addressing and costs nothing...

(***) Eero recently introduced 'spinloop' autodetection and profiling in Hatari which makes some of this redundant. There are still some small benefits to post-processing these yourself - like identifying handshakes which don't wait at all, and might be optimized away - but the rest can now be done with Hatari itself.

Social Media


Return to “Hatari”

Who is online

Users browsing this forum: No registered users and 2 guests