Quake 2 on Falcon030

dml · Post by **dml** » Fri Nov 21, 2014 6:03 pm

In the spare minutes (!) I had between then and now, I cobbled together a 68882-assembler version of the per-face setup math including my face-sharing optimization, and in terms of performance at least it definitely saves a lot of cycles.

The only problem is that it currently looks like this:

grab0035.png

So once I figure out what is going on there things should be a bit better.

Have not done anything about the surface cache yet - it is still very, very slow and involves callbacks into the C part of the engine. I'll have to rewrite that part too, as a separate job. It probably won't be a quick thing to replace.

Zogging Hell · Post by **Zogging Hell** » Sat Nov 22, 2014 2:02 pm

Blimey Doug, I'm worried my Falcon might melt down and I'll be left with a steaming pile of molten metal and plastic in the case if I try this once it's done..

Keep up the good work

dml · Post by **dml** » Sun Nov 23, 2014 9:52 am

Zogging Hell wrote: Keep up the good work

The per-face setup math has been turned into 68k+FPU and is working properly now. Seems to be quite a lot faster.

It is still probably too slow - but a fair starting point to break it up between the different processors.

dml · Post by **dml** » Sun Nov 23, 2014 4:37 pm

Just a quick note on GCC 4.6 and code generation for FPU.

Code: Select all

 fmove.s   (a3),fp1
 fmove.s   4(a3),fp0 <-- interleaving opportunity
 fmove.s   8(a3),fp4 <-- interleaving opportunity

Code: Select all

 fmove.s   $5c(a0),fp2
 fmul.x    fp1,fp2  <-- dep stall
 fmove.s   $68(a0),fp3
 fmul.x    fp0,fp3  <-- dep stall
 fadd.x    fp3,fp2  <-- dep stall
 fmove.s   $74(a0),fp3
 fmul.x    fp4,fp3  <-- dep stall
 fadd.x    fp3,fp2  <-- dep stall
 fmove.s   fp2,$104(sp)  <-- dep stall

It appears this compiler doesn't know about the impact of scheduling on the 68882 FPU, either for interleaving fmoves or for avoiding dependency stalls. Interleaving of integer operations is less clear - I don't see much of it, but in many cases there isn't much opportunity to be taken anyway.

This means the code being generated is 20-30% slower than it needs to be, and makes hand-assembling worthwhile! Good news only if you're having fun with hand assembly

Zamuel_a · Post by **Zamuel_a** » Sun Nov 23, 2014 4:44 pm

Can instructions in the FPU run in parallel with the CPU to gain speed? In a similar way like the Z DIV was done in Quake on PC? I guess one problem with the FPU is that not all Falcons got one, but should be easy to add.

dml · Post by **dml** » Sun Nov 23, 2014 4:59 pm

Zamuel_a wrote:Can instructions in the FPU run in parallel with the CPU to gain speed? In a similar way like the Z DIV was done in Quake on PC? I guess one problem with the FPU is that not all Falcons got one, but should be easy to add.

Yes it runs in parallel with the CPU, and it also runs in parallel with itself if you are careful.

If you can avoid reading an output (creating a dependency) before it is ready then it can continue fetching and executing to some degree while an operation continues e.g. mul, div or even slower ops.

I think you can parallel any number of fmoves or integer ops against any other FPU op. The remaining FPU ops can only overlap with each other by head/tail pipeline stages (fmove is kind of special because it has no tail, so multiple fmoves can overlap with one long tail, up to the point where fmove's own head begins to interfere with the next op's head).

I had this idea to write a parser for FPU asm which lets you write some code and automatically re-interleaves it without breaking dependency order. Should help produce good code for long sequences of intensive math. It can always be done by hand but takes quite a long time to make it optimal if you have 50+ operations all needing resolved against each other. Looks like the compiler doesn't even try to do that.

dml · Post by **dml** » Mon Nov 24, 2014 12:09 pm

Last evening I had a go at a parser for interpreting FPU assembler. It was quick to write and successfully reads pseudo-assembler sourcefiles into an internal representation which can be used to drive a super-optimizer for 68k FPU code.

I haven't tried to do the optimizer part yet but the path seems clear enough.

1) define all the ops, timings, their head/tail info and other constraints
2) interpret the ops to figure out dependencies between them
3) define a process which can safely reorder the ops without breaking constraints
4) write some greedy algo to calculate and minimize the cost!
5) spit the final op sequence back out as ascii

There is also room to rename registers before and after optimization to help with the reordering.

It's probably a bit of a waste of time for this specific problem since I'm trying to reduce/remove the amount of FPU code related to drawing - but it will probably be useful for other things not related to drawing, on this and other projects. It's also an interesting problem in itself

I imagine a tool like this could be used to optimize 68060 code for pairing etc, although not sure how valuable that really is, if people find it easy to do already by hand. 68882 is (maybe) a bit more tiring to do by hand because of the large timing variations and could benefit more from automation.

I used to have small super-optimizer tools for PS2 vector units which would rewrite linear assembler in vectorized, parallel form. One was supplied by Sony and one I did myself. It was quite a cool thing - but it was really needed on that platform because the vector unit assembly were so nasty, pipelined and parallel that it could take hours to hand-optimize a 5-10 instruction loop properly and have it still work. It was a bit like Falcon DSP but worse. Being able to run an automated optimizer over and over and compare output with hand-made efforts saved a lot of time even if the auto version was thrown away in the end.

Zamuel_a · Post by **Zamuel_a** » Mon Nov 24, 2014 7:55 pm

Could this be used to speed up Badmood as well? If there is anything that might benefit to be rewritten for the FPU. Maybe not for speed by itself, but for the parallelism.

dml · Post by **dml** » Mon Nov 24, 2014 8:08 pm

Zamuel_a wrote:Could this be used to speed up Badmood as well? If there is anything that might benefit to be rewritten for the FPU. Maybe not for speed by itself, but for the parallelism.

I don't think there is much in BadMood which can use it, but it will probably be useful for Q1/Q2 collision detection and maybe other stuff.

Got it reordering instructions now and 'optimizing' although the numbers look wrong. Will post results if/when it starts working.

dml · Post by **dml** » Tue Nov 25, 2014 9:40 am

Got some sensible output from the 68882-optimizer tool last night:

This is the original sequence of FPU operations rendered by the GCC compiler, disassembled. It implements a m3x3 x v3 matrix->vector transform.

Code: Select all

;$0356b6 : f213 4480                            fmove.s   0(a3),fp1
;$0356ba : f22b 4400 0004                       fmove.s   4(a3),fp0
;$0356c0 : f22b 4600 0008                       fmove.s   8(a3),fp4
;
;$0356c6 : f228 4500 005c                       fmove.s   $5c(a0),fp2
;$0356cc : f200 0523                            fmul.x    fp1,fp2
;$0356d0 : f228 4580 0068                       fmove.s   $68(a0),fp3
;$0356d6 : f200 01a3                            fmul.x    fp0,fp3
;$0356da : f200 0d22                            fadd.x    fp3,fp2
;$0356de : f228 4580 0074                       fmove.s   $74(a0),fp3
;$0356e4 : f200 11a3                            fmul.x    fp4,fp3
;$0356e8 : f200 0d22                            fadd.x    fp3,fp2
;$0356ec : f22f 6500 0104                       fmove.s   fp2,$104(sp)
;
;$0356f2 : f228 4500 0060                       fmove.s   $60(a0),fp2
;$0356f8 : f200 0523                            fmul.x    fp1,fp2
;$0356fc : f228 4580 006c                       fmove.s   $6c(a0),fp3
;$035702 : f200 01a3                            fmul.x    fp0,fp3
;$035706 : f200 0d22                            fadd.x    fp3,fp2
;$03570a : f228 4580 0078                       fmove.s   $78(a0),fp3
;$035710 : f200 11a3                            fmul.x    fp4,fp3
;$035714 : f200 0d22                            fadd.x    fp3,fp2
;$035718 : f22f 6500 0108                       fmove.s   fp2,$108(sp)
;
;$03571e : f228 4500 0064                       fmove.s   $64(a0),fp2
;$035724 : f200 0523                            fmul.x    fp1,fp2
;$035728 : f228 4480 0070                       fmove.s   $70(a0),fp1
;$03572e : f200 00a3                            fmul.x    fp0,fp1
;$035732 : f200 0522                            fadd.x    fp1,fp2
;$035736 : f228 4400 007c                       fmove.s   $7c(a0),fp0
;$03573c : f200 1023                            fmul.x    fp4,fp0
;$035740 : f200 0122                            fadd.x    fp0,fp2
;$035744 : f22f 6500 010c                       fmove.s   fp2,$10c(sp)

The first listing is the tool's interpretation of timing in terms of documented timings (left column), in-context timings taking into account head/tail overlaps and dependencies (middle column) and total in-context cycles for that operation [square brackets], with s=? indicating stalls caused by dependencies which could otherwise be overlapped.

Code: Select all

                fmove.s src,fp1 (21:0:0)        (21+0+0)        [21] s=0
                fmove.s src,fp0 (21:0:0)        (21+0+0)        [21] s=0
                fmove.s src,fp4 (21:0:0)        (21+0+0)        [21] s=0
                fmove.s src,fp2 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp1,fp2 (17:4:55)       (17+4+17)       [38] s=0
                fmove.s src,fp3 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp0,fp3 (17:4:55)       (17+4+55)       [76] s=0
                fadd.x  fp3,fp2 (17:4:35)       (17+4+0)        [21] s=17
                fmove.s src,fp3 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp4,fp3 (17:4:55)       (17+4+55)       [76] s=0
                fadd.x  fp3,fp2 (17:4:35)       (17+4+0)        [21] s=17
                fmove.s fp2,dst (21:0:0)        (21+0+0)        [21] s=21
                fmove.s src,fp2 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp1,fp2 (17:4:55)       (17+4+17)       [38] s=0
                fmove.s src,fp3 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp0,fp3 (17:4:55)       (17+4+55)       [76] s=0
                fadd.x  fp3,fp2 (17:4:35)       (17+4+0)        [21] s=17
                fmove.s src,fp3 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp4,fp3 (17:4:55)       (17+4+55)       [76] s=0
                fadd.x  fp3,fp2 (17:4:35)       (17+4+0)        [21] s=17
                fmove.s fp2,dst (21:0:0)        (21+0+0)        [21] s=21
                fmove.s src,fp2 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp1,fp2 (17:4:55)       (17+4+17)       [38] s=0
                fmove.s src,fp1 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp0,fp1 (17:4:55)       (17+4+55)       [76] s=0
                fadd.x  fp1,fp2 (17:4:35)       (17+4+0)        [21] s=17
                fmove.s src,fp0 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp4,fp0 (17:4:55)       (17+4+55)       [76] s=0
                fadd.x  fp0,fp2 (17:4:35)       (17+4+35)       [56] s=17
                fmove.s fp2,dst (21:0:0)        (21+0+0)        [21] s=21

Then the optimization pass....

cost reduced 1046 -> 1043
cost reduced 1043 -> 1040
cost reduced 1040 -> 1037
cost reduced 1037 -> 1020
cost reduced 1020 -> 1003
cost reduced 1003 -> 986
cost reduced 986 -> 969
cost reduced 969 -> 966
cost reduced 966 -> 949
cost reduced 949 -> 932
cost reduced 932 -> 929
cost reduced 929 -> 928
cost reduced 928 -> 927
cost reduced 927 -> 924
cost reduced 924 -> 921
cost reduced 921 -> 904

And the final sequence, after optimization.

Code: Select all

              fmove.s src,fp1 (21:0:0)        (21+0+0)        [21] s=0
                fmove.s src,fp2 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp1,fp2 (17:4:55)       (17+4+0)        [21] s=0
                fmove.s src,fp3 (21:0:0)        (21+0+0)        [21] s=0
                fmove.s src,fp0 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp0,fp3 (17:4:55)       (17+4+0)        [21] s=0
                fmove.s src,fp3 (21:0:0)        (21+0+0)        [21] s=0
                fmove.s src,fp4 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp4,fp3 (17:4:55)       (17+4+17)       [38] s=0
                fmove.s src,fp2 (21:0:0)        (21+0+0)        [21] s=0
                fadd.x  fp3,fp2 (17:4:35)       (17+4+14)       [35] s=17
                fmove.s src,fp3 (21:0:0)        (21+0+0)        [21] s=0
                fadd.x  fp3,fp2 (17:4:35)       (17+4+18)       [39] s=14
                fmul.x  fp1,fp2 (17:4:55)       (17+4+17)       [38] s=0
                fmove.s fp2,dst (21:0:0)        (21+0+0)        [21] s=21
                fmul.x  fp0,fp3 (17:4:55)       (17+4+17)       [38] s=0
                fmove.s src,fp3 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp4,fp3 (17:4:55)       (17+4+17)       [38] s=0
                fmove.s src,fp2 (21:0:0)        (21+0+0)        [21] s=0
                fadd.x  fp3,fp2 (17:4:35)       (17+4+14)       [35] s=17
                fmove.s src,fp1 (21:0:0)        (21+0+0)        [21] s=0
                fadd.x  fp3,fp2 (17:4:35)       (17+4+18)       [39] s=14
                fmul.x  fp1,fp2 (17:4:55)       (17+4+17)       [38] s=0
                fmove.s fp2,dst (21:0:0)        (21+0+0)        [21] s=21
                fmul.x  fp0,fp1 (17:4:55)       (17+4+17)       [38] s=0
                fmove.s src,fp0 (21:0:0)        (21+0+0)        [21] s=0
                fmul.x  fp4,fp0 (17:4:55)       (17+4+38)       [59] s=0
                fadd.x  fp1,fp2 (17:4:35)       (17+4+35)       [56] s=17
                fadd.x  fp0,fp2 (17:4:35)       (17+4+35)       [56] s=17
                fmove.s fp2,dst (21:0:0)        (21+0+0)        [21] s=21

904 cycles expressed

The basic idea seems to be working. I don't think the results are correct yet (or optimal), for a few reasons - the register names are the original ones, but they have been internally renamed to allow stuff to move around more easily. I haven't updated the final names to reflect the internal names. The algorithm that reorders stuff is also a bit too constrained. Also I think there are some incorrect calculations for some cases where overlap + dependency are involved at the same time.

Apart from that it looks decent - in this case appearing to save 10% on already 'optimized' code. The longer the sequence, the more opportunity it should have.

dml · Post by **dml** » Tue Nov 25, 2014 10:50 am

I think once the optimizer tool works properly and can deal with cycle counts for effective addresses etc. it would be worth trying to extend it to do the full CPU instruction set with the 68030's separate head/tail pipe info and write buffer overlapping. There isn't as much opportunity for head/tail stuff on 030 but hand optimizing for write buffer scheduling can be quite tricky.

However I won't go there just now - something for later. Need to finish other stuff first.

dml · Post by **dml** » Tue Nov 25, 2014 12:56 pm

A few more tweaks later, performance gain increased from 15% to 20% on the same input code.

[EDIT] depstall column was reporting stale info - fixed. the rest was unaffected.

cost reduced 1046 -> 1040
cost reduced 1040 -> 1037
cost reduced 1037 -> 1023
cost reduced 1023 -> 1020
cost reduced 1020 -> 1010
cost reduced 1010 -> 1006
cost reduced 1006 -> 989
cost reduced 989 -> 986
cost reduced 986 -> 976
cost reduced 976 -> 973
cost reduced 973 -> 969
cost reduced 969 -> 966
cost reduced 966 -> 963
cost reduced 963 -> 952
cost reduced 952 -> 946
cost reduced 946 -> 929
cost reduced 929 -> 928
cost reduced 928 -> 925
cost reduced 925 -> 908
cost reduced 908 -> 907
cost reduced 907 -> 904
cost reduced 904 -> 898
cost reduced 898 -> 894
cost reduced 894 -> 893
cost reduced 893 -> 890
cost reduced 890 -> 880
cost reduced 880 -> 877
cost reduced 877 -> 873
cost reduced 873 -> 856
cost reduced 856 -> 853

Code: Select all

                fmove.s src,fp11                (21:0:0-0=0)    [total=21, depstall=0]
                fmove.s src,fp8         (21:0:0-0=0)    [total=21, depstall=0]
                fmul.x  fp8,fp11                (17:4:55-55=0)  [total=21, depstall=0]
                fmove.s src,fp9         (21:0:0-0=0)    [total=21, depstall=0]
                fmove.s src,fp12                (21:0:0-0=0)    [total=21, depstall=0]
                fmul.x  fp9,fp12                (17:4:55-55=0)  [total=21, depstall=0]
                fmove.s src,fp13                (21:0:0-0=0)    [total=21, depstall=0]
                fmove.s src,fp10                (21:0:0-0=0)    [total=21, depstall=0]
                fmul.x  fp10,fp13               (17:4:55-38=17) [total=38, depstall=0]
                fmove.s src,fp14                (21:0:0-0=0)    [total=21, depstall=0]
                fmul.x  fp8,fp14                (17:4:55-38=17) [total=38, depstall=0]
                fmove.s src,fp15                (21:0:0-0=0)    [total=21, depstall=0]
                fadd.x  fp12,fp11               (17:4:35-17=18) [total=39, depstall=0]
                fmul.x  fp9,fp15                (17:4:55-38=17) [total=38, depstall=0]
                fmove.s src,fp16                (21:0:0-0=0)    [total=21, depstall=0]
                fadd.x  fp13,fp11               (17:4:35-17=18) [total=39, depstall=0]
                fmul.x  fp10,fp16               (17:4:55-38=17) [total=38, depstall=0]
                fmove.s src,fp17                (21:0:0-0=0)    [total=21, depstall=0]
                fadd.x  fp15,fp14               (17:4:35-17=18) [total=39, depstall=0]
                fmul.x  fp8,fp17                (17:4:55-38=17) [total=38, depstall=0]
                fmove.s src,fp18                (21:0:0-0=0)    [total=21, depstall=0]
                fmul.x  fp9,fp18                (17:4:55-38=17) [total=38, depstall=0]
                fmove.s src,fp19                (21:0:0-0=0)    [total=21, depstall=0]
                fmul.x  fp10,fp19               (17:4:55-38=17) [total=38, depstall=0]
                fmove.s fp11,dst                (21:0:0-0=0)    [total=21, depstall=0]
                fadd.x  fp18,fp17               (17:4:35-17=18) [total=39, depstall=0]
                fadd.x  fp16,fp14               (17:4:35-17=18) [total=39, depstall=0]
                fadd.x  fp19,fp17               (17:4:35-21=14) [total=35, depstall=0]
                fmove.s fp17,dst                (21:0:0-0=0)    [total=21, depstall=21]
                fmove.s fp14,dst                (21:0:0-0=0)    [total=21, depstall=0]

This optimizer is a simple genetic algorithm, which mutates a pool of cloned code sequences looking for opportunities and selecting / breeding the winners from the pool. The 'tweak' here was to randomize the maximum number of mutations per instance, instead of exactly 1 at a time. This helps escape local minima (e.g. where a problem can only be solved by resequencing several ops at once), and results in the additional 5% kick.

There are a bunch of other related optimizations which can be made but unless the code sequence is really long it's probably not worth spending time on those. I'll be happy with an average of 20% gain for minimum expended effort

Keep in mind the input code was already optimized by GCC, so that's quite a healthy speedup.

Register renaming still not complete but it is at least A) naming the output registers with the internally renamed ones and B) constraining to a maximum of 8 registers active at any one time. I still need to do a final rename pass to renumber them back into the 0-7 range for valid assembly.

I have only validated the output code briefly by eye, so it could well be broken but from what I can tell it looks like it is doing the right thing at least most of the time.

I will try later this evening with some real source sequences and try them in the Q2 project to see if its producing valid code.

exxos · Post by **exxos** » Tue Nov 25, 2014 4:43 pm

So the 20% is with the FPU then ? Thats pretty good going really! I take it you didn't go down the DSP route then

dml · Post by **dml** » Tue Nov 25, 2014 6:50 pm

exxos wrote:So the 20% is with the FPU then ? Thats pretty good going really! I take it you didn't go down the DSP route then

The tool appears to offer a ~20% improvement on arbitrary FPU code sequences as generated by GCC.

I don't generally lift GCC-generated code for use with an assembler - that was just a testcase for measuring the tool. The actual FPU assembler used in the Q2 project was done by hand. However the tool probably still offers a 5%+ gain over handcoded FPU assembler and takes a lot less time to do so - and should allow a more tidy version to be maintained and edited, and an optimized/interleaved version produced on demand.

There are two FPU sequence candidates left in the per-face drawing setup code of Q2 which could benefit from this, but one of them will just be converted to DSP soon. The other one I'm less sure about, and may just use this tool on it instead.

The collision detection code however is probably not so easy to hand-assemble - there will be a lot of floating point math and collision events are notoriously hard to visualize and debug, so having an optimizer tool is more likely to be useful there later on.

The latest version seems to produce working code. I added a couple of fake instructions 'fpin, fpout' which should help constrain input and output registers so they don't get renamed in an uncomfortable way. These just book-end the sequence and lock register names temporarily - they can be discarded afterwards. The locking part doesn't actually work yet but its probably the last significant thing needing done before its usable.

Code: Select all

               fpin            fp6             (0:0:0-0=0)     [total=0, depstall=0]
               fpin            fp3             (0:0:0-0=0)     [total=0, depstall=0]
               fpin            fp5             (0:0:0-0=0)     [total=0, depstall=0]

               fmove.s [cam_fm00],fp2          (21:0:0-0=0)    [total=21, depstall=0]
               fmul.x  fp5,fp2         (17:4:55-38=17) [total=38, depstall=0]
               fmove.s [cam_fm10],fp4          (21:0:0-0=0)    [total=21, depstall=0]
               fmul.x  fp3,fp4         (17:4:55-38=17) [total=38, depstall=0]
               fmove.s [cam_fm20],fp0          (21:0:0-0=0)    [total=21, depstall=0]
               fmul.x  fp6,fp0         (17:4:55-38=17) [total=38, depstall=0]
               fmove.s [cam_fm01],fp1          (21:0:0-0=0)    [total=21, depstall=0]
               fmul.x  fp5,fp1         (17:4:55-38=17) [total=38, depstall=0]
               fmove.s [cam_fm11],fp7          (21:0:0-0=0)    [total=21, depstall=0]
               fadd.x  fp4,fp2         (17:4:35-17=18) [total=39, depstall=0]
               fmul.x  fp3,fp7         (17:4:55-38=17) [total=38, depstall=0]
               fmove.s [cam_fm21],fp4          (21:0:0-0=0)    [total=21, depstall=0]
               fadd.x  fp0,fp2         (17:4:35-17=18) [total=39, depstall=0]
               fmul.x  fp6,fp4         (17:4:55-38=17) [total=38, depstall=0]
               fmove.s [cam_fm02],fp0          (21:0:0-0=0)    [total=21, depstall=0]
               fadd.x  fp7,fp1         (17:4:35-17=18) [total=39, depstall=0]
               fmul.x  fp5,fp0         (17:4:55-38=17) [total=38, depstall=0]
               fmove.s [cam_fm12],fp5          (21:0:0-0=0)    [total=21, depstall=0]
               fmul.x  fp3,fp5         (17:4:55-38=17) [total=38, depstall=0]
               fmove.s [cam_fm22],fp3          (21:0:0-0=0)    [total=21, depstall=0]
               fmul.x  fp6,fp3         (17:4:55-17=38) [total=59, depstall=0]
               fadd.x  fp5,fp0         (17:4:35-17=18) [total=39, depstall=0]
               fadd.x  fp4,fp1         (17:4:35-17=18) [total=39, depstall=0]
               fadd.x  fp3,fp0         (17:4:35-0=35)  [total=56, depstall=0]

               fpout           fp2             (0:0:0-0=0)     [total=0, depstall=0]
               fpout           fp1             (0:0:0-0=0)     [total=0, depstall=0]
               fpout           fp0             (0:0:0-0=0)     [total=0, depstall=0]

803 cycles expressed

dml · Post by **dml** » Wed Nov 26, 2014 8:05 pm

There were a couple of bugs needing fixed but I have now incorporated 3 separate optimized FPU sequences produced by the tool and its all working ok.

Aside from the tool, the rasterizer math is pretty well optimized now. I think it is using the theoretical minimum number of divides possible - or quite close. The texture mapping uses no more than 1 divide instruction every 3 or 4 faces needing drawn - less than one divide per face. The geometry processing uses divides for edge scissoring and edge gradient calculation - the former is quite infrequent and the latter can still be optimized a bit.

dml · Post by **dml** » Thu Nov 27, 2014 10:10 am

Some screenshots from last evening, mainly showing framerate improvements from rewriting the per-face math (since the last video).

No changes to per-pixel cost since first version - that part can wait a while.

Won't get time to do much more with it for a couple of weeks probably - will return to it then.

AdamK · Post by **AdamK** » Thu Nov 27, 2014 11:20 am

If Quake2 framerate will beat Bad Mood framerate you might be a little embrassed

Please, do another video.

dml · Post by **dml** » Thu Nov 27, 2014 12:39 pm

I doubt it will catch up with BadMooD except in some convenient views - but it will be more efficient overall (more work done per cycle spent) and may get close in places.

There's still plenty needing fixed so we'll see what happens.

jvas · Post by **jvas** » Thu Nov 27, 2014 12:50 pm

There is a Microclub (https://www.facebook.com/pages/Csokonai ... 5952923592) here in Budapest, which is held on every Friday. It is revived from the ashes of an Amiga club. Needless to say, I'm the only Atari enthusiast there. I eagerly wait to present them Q2 running on the Falcon.

(BadMood had already been presented: have you ever seen Amiga fans jaws dropped while looking at an Atari?)

PS: Check out the facebook page, there are some cool videos recorded. You can also get used to the Hungarian language

FedePede04 · Post by **FedePede04** » Thu Nov 27, 2014 1:06 pm

AdamK wrote: Please, do another video.

+1

dml · Post by **dml** » Thu Nov 27, 2014 5:40 pm

Will try to make final edits to BadMooD for a release first. I'll need to stop messing with Q2 to get that done.

There will still be things needing improved in BadMooD even then, but its being held up for insignificant reasons and I should just finish the obvious ones and get it done.

Eero Tamminen · Post by **Eero Tamminen** » Thu Nov 27, 2014 8:15 pm

AdamK wrote:If Quake2 framerate will beat Bad Mood framerate you might be a little embrassed

Remember that above is just Quake rendering, in BM rendering was just 1/2 of the cost. Rest was AI, sound propagation & management etc...

AdamK · Post by **AdamK** » Thu Nov 27, 2014 8:25 pm

Eero Tamminen wrote:
AdamK wrote:If Quake2 framerate will beat Bad Mood framerate you might be a little embrassed
Remember that above is just Quake rendering, in BM rendering was just 1/2 of the cost. Rest was AI, sound propagation & management etc...

It has about half FPS of Bad Mood atm. I remain optimistic

shoggoth · Post by **shoggoth** » Thu Nov 27, 2014 10:15 pm

Bricks have been shat.

EvilFranky · Post by **EvilFranky** » Thu Nov 27, 2014 10:16 pm

That made me genuinely laugh out loud hahaha!

Atari-Forum

Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030

Re: Quake 2 on Falcon030