Got some sensible output from the 68882-optimizer tool last night:
This is the original sequence of FPU operations rendered by the GCC compiler, disassembled. It implements a m3x3 x v3 matrix->vector transform.
Code: Select all
;$0356b6 : f213 4480 fmove.s 0(a3),fp1
;$0356ba : f22b 4400 0004 fmove.s 4(a3),fp0
;$0356c0 : f22b 4600 0008 fmove.s 8(a3),fp4
;
;$0356c6 : f228 4500 005c fmove.s $5c(a0),fp2
;$0356cc : f200 0523 fmul.x fp1,fp2
;$0356d0 : f228 4580 0068 fmove.s $68(a0),fp3
;$0356d6 : f200 01a3 fmul.x fp0,fp3
;$0356da : f200 0d22 fadd.x fp3,fp2
;$0356de : f228 4580 0074 fmove.s $74(a0),fp3
;$0356e4 : f200 11a3 fmul.x fp4,fp3
;$0356e8 : f200 0d22 fadd.x fp3,fp2
;$0356ec : f22f 6500 0104 fmove.s fp2,$104(sp)
;
;$0356f2 : f228 4500 0060 fmove.s $60(a0),fp2
;$0356f8 : f200 0523 fmul.x fp1,fp2
;$0356fc : f228 4580 006c fmove.s $6c(a0),fp3
;$035702 : f200 01a3 fmul.x fp0,fp3
;$035706 : f200 0d22 fadd.x fp3,fp2
;$03570a : f228 4580 0078 fmove.s $78(a0),fp3
;$035710 : f200 11a3 fmul.x fp4,fp3
;$035714 : f200 0d22 fadd.x fp3,fp2
;$035718 : f22f 6500 0108 fmove.s fp2,$108(sp)
;
;$03571e : f228 4500 0064 fmove.s $64(a0),fp2
;$035724 : f200 0523 fmul.x fp1,fp2
;$035728 : f228 4480 0070 fmove.s $70(a0),fp1
;$03572e : f200 00a3 fmul.x fp0,fp1
;$035732 : f200 0522 fadd.x fp1,fp2
;$035736 : f228 4400 007c fmove.s $7c(a0),fp0
;$03573c : f200 1023 fmul.x fp4,fp0
;$035740 : f200 0122 fadd.x fp0,fp2
;$035744 : f22f 6500 010c fmove.s fp2,$10c(sp)
The first listing is the tool's interpretation of timing in terms of documented timings (left column), in-context timings taking into account head/tail overlaps and dependencies (middle column) and total in-context cycles for that operation [square brackets], with s=? indicating stalls caused by dependencies which could otherwise be overlapped.
Code: Select all
fmove.s src,fp1 (21:0:0) (21+0+0) [21] s=0
fmove.s src,fp0 (21:0:0) (21+0+0) [21] s=0
fmove.s src,fp4 (21:0:0) (21+0+0) [21] s=0
fmove.s src,fp2 (21:0:0) (21+0+0) [21] s=0
fmul.x fp1,fp2 (17:4:55) (17+4+17) [38] s=0
fmove.s src,fp3 (21:0:0) (21+0+0) [21] s=0
fmul.x fp0,fp3 (17:4:55) (17+4+55) [76] s=0
fadd.x fp3,fp2 (17:4:35) (17+4+0) [21] s=17
fmove.s src,fp3 (21:0:0) (21+0+0) [21] s=0
fmul.x fp4,fp3 (17:4:55) (17+4+55) [76] s=0
fadd.x fp3,fp2 (17:4:35) (17+4+0) [21] s=17
fmove.s fp2,dst (21:0:0) (21+0+0) [21] s=21
fmove.s src,fp2 (21:0:0) (21+0+0) [21] s=0
fmul.x fp1,fp2 (17:4:55) (17+4+17) [38] s=0
fmove.s src,fp3 (21:0:0) (21+0+0) [21] s=0
fmul.x fp0,fp3 (17:4:55) (17+4+55) [76] s=0
fadd.x fp3,fp2 (17:4:35) (17+4+0) [21] s=17
fmove.s src,fp3 (21:0:0) (21+0+0) [21] s=0
fmul.x fp4,fp3 (17:4:55) (17+4+55) [76] s=0
fadd.x fp3,fp2 (17:4:35) (17+4+0) [21] s=17
fmove.s fp2,dst (21:0:0) (21+0+0) [21] s=21
fmove.s src,fp2 (21:0:0) (21+0+0) [21] s=0
fmul.x fp1,fp2 (17:4:55) (17+4+17) [38] s=0
fmove.s src,fp1 (21:0:0) (21+0+0) [21] s=0
fmul.x fp0,fp1 (17:4:55) (17+4+55) [76] s=0
fadd.x fp1,fp2 (17:4:35) (17+4+0) [21] s=17
fmove.s src,fp0 (21:0:0) (21+0+0) [21] s=0
fmul.x fp4,fp0 (17:4:55) (17+4+55) [76] s=0
fadd.x fp0,fp2 (17:4:35) (17+4+35) [56] s=17
fmove.s fp2,dst (21:0:0) (21+0+0) [21] s=21
Then the optimization pass....
cost reduced 1046 -> 1043
cost reduced 1043 -> 1040
cost reduced 1040 -> 1037
cost reduced 1037 -> 1020
cost reduced 1020 -> 1003
cost reduced 1003 -> 986
cost reduced 986 -> 969
cost reduced 969 -> 966
cost reduced 966 -> 949
cost reduced 949 -> 932
cost reduced 932 -> 929
cost reduced 929 -> 928
cost reduced 928 -> 927
cost reduced 927 -> 924
cost reduced 924 -> 921
cost reduced 921 -> 904
And the final sequence, after optimization.
Code: Select all
fmove.s src,fp1 (21:0:0) (21+0+0) [21] s=0
fmove.s src,fp2 (21:0:0) (21+0+0) [21] s=0
fmul.x fp1,fp2 (17:4:55) (17+4+0) [21] s=0
fmove.s src,fp3 (21:0:0) (21+0+0) [21] s=0
fmove.s src,fp0 (21:0:0) (21+0+0) [21] s=0
fmul.x fp0,fp3 (17:4:55) (17+4+0) [21] s=0
fmove.s src,fp3 (21:0:0) (21+0+0) [21] s=0
fmove.s src,fp4 (21:0:0) (21+0+0) [21] s=0
fmul.x fp4,fp3 (17:4:55) (17+4+17) [38] s=0
fmove.s src,fp2 (21:0:0) (21+0+0) [21] s=0
fadd.x fp3,fp2 (17:4:35) (17+4+14) [35] s=17
fmove.s src,fp3 (21:0:0) (21+0+0) [21] s=0
fadd.x fp3,fp2 (17:4:35) (17+4+18) [39] s=14
fmul.x fp1,fp2 (17:4:55) (17+4+17) [38] s=0
fmove.s fp2,dst (21:0:0) (21+0+0) [21] s=21
fmul.x fp0,fp3 (17:4:55) (17+4+17) [38] s=0
fmove.s src,fp3 (21:0:0) (21+0+0) [21] s=0
fmul.x fp4,fp3 (17:4:55) (17+4+17) [38] s=0
fmove.s src,fp2 (21:0:0) (21+0+0) [21] s=0
fadd.x fp3,fp2 (17:4:35) (17+4+14) [35] s=17
fmove.s src,fp1 (21:0:0) (21+0+0) [21] s=0
fadd.x fp3,fp2 (17:4:35) (17+4+18) [39] s=14
fmul.x fp1,fp2 (17:4:55) (17+4+17) [38] s=0
fmove.s fp2,dst (21:0:0) (21+0+0) [21] s=21
fmul.x fp0,fp1 (17:4:55) (17+4+17) [38] s=0
fmove.s src,fp0 (21:0:0) (21+0+0) [21] s=0
fmul.x fp4,fp0 (17:4:55) (17+4+38) [59] s=0
fadd.x fp1,fp2 (17:4:35) (17+4+35) [56] s=17
fadd.x fp0,fp2 (17:4:35) (17+4+35) [56] s=17
fmove.s fp2,dst (21:0:0) (21+0+0) [21] s=21
904 cycles expressed
The basic idea seems to be working. I don't think the results are correct yet (or optimal), for a few reasons - the register names are the original ones, but they have been internally renamed to allow stuff to move around more easily. I haven't updated the final names to reflect the internal names. The algorithm that reorders stuff is also a bit too constrained. Also I think there are some incorrect calculations for some cases where overlap + dependency are involved at the same time.
Apart from that it looks decent - in this case appearing to save 10% on already 'optimized' code. The longer the sequence, the more opportunity it should have.