I dug up my SQRT routines and played aroud with the Newton approximations from the Stratos magazine. I got some interesting results.
There is a slight problem in the Earx routine. sqrt(1) is calculated with the wrong answer and so is sqrt(2003125025). The routine doesn't like numbers above 2^31. The routine needs some fixing.
Except for the Earx routine all other routines produce correct answers for all numbers from 0 to 2^32-1. (I tested Fast Newton ASM, Nyh ASM, Nyh C, Fast Newton C and Newton C for correctness over the whole range.)
The Fast Newton ASM is by far the fastest. It is also the biggest one. It uses the Newton approximation for a SQRT: x_n+1 = 0.5*(x_n + Value/x_n). You get a better answer very fast. Every iteration the number of digits doubles. If your first gues is correct for 1 bit, the first iteration gives you 2, the 2nd 4, the 3rd 8 and the 4th 16 bits. There we stop because from a 32 bit number you can only get 16 a 16 bit integer as sqrt.
A disadvantage from the Newton method is that it is an aproximation. The last digit may be rounded up or down. The Nyh ASM routine always rounds down, so does tobe. The Nyh, Tobe and Earx routine are basicly the same method, Only the implementation is slightly different.
Well let us cut the crap and look at the code. First the Nyh ASM. It comes from the I_A addition lib. The code in the lib has a good explanation of the algorithm. I haev added it to the zipfile.
Code: Select all
; high speed sqrt
; call: DO = x, integer
; return: D0 = SQRT(x)
sqrt_strt:
move.w #$100,a0
cmp.l a0,d0 ; d0<256?
bcs.s .8bit ; yep
move.l d0,a1 ; x
swap d0
tst.w d0
bne.s .32bit
move.l a1,d0
moveq #0,d1
.loop:
addq.w #1,d1 ;
.loop_start: ; 16 bits left
lsr.w #2,d0 ;
cmp.w a0,d0
bcc.s .loop
move.b .sqrt_table(pc,d0.w),d0 ; table lookup
lsl.w d1,d0 ; order 0
move.l a1,d1 ; x
divu d0,d1 ; x/tmp
add.w d1,d0 ; tmp+=x/tmp
lsr.w #1,d0 ; tmp/=2 order 1
move.l a1,d1
divu d0,d1 ; x/tmp
add.w d1,d0 ; tmp+=x/tmp
lsr.w #1,d0 ; tmp/=2 order 2
rts
.8bit:
move.b .sqrt_table(pc,d0.w),d0 ; table lookup
rts
.32bit:
cmp.w a0,d0 ; 24 bit?
bcc.s .24bit
move.l a1,d0
lsr.l #8,d0
moveq #5,d1
bra.s .loop_start
.24bit:
cmpi.w #$3fc0,d0 ; bignum?
bcc.s .bignum ; yep
move.l a1,d0
clr.w d0
swap d0
moveq #9,d1
bra.s .loop_start
.bignum: ; big nums cause overflows
cmpi.w #$fffd,d0 ; big bignum?
bcc.s .bigbignum ; yep!
move.l a1,-(sp) ; store x
swap d0
lsr.l #4,d0 ; x geen bignum meer
move.l d0,a1 ; x
clr.w d0
swap d0
moveq #9,d1
bsr.s .loop_start
add.w d0,d0
add.w d0,d0 ; d0 is our guess now
move.l (sp)+,d1 ; x
divu d0,d1 ; i/tmp
bvc.s .cont0
moveq #1,d1 ; overflow!
bra.s .cont1
.cont0:
swap d1
clr.w d1
.cont1:
swap d1
add.l d1,d0 ; tmp+=i/tmp
lsr.l #1,d0 ; tmp/=2
rts
.bigbignum:
moveq #0,d0
subq.w #1,d0 ; d0=$0000ffff
rts
.sqrt_table:
dc.b 0,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3
dc.b 4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5
dc.b 5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6
dc.b 6,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
dc.b 8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8
dc.b 8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
dc.b 9,9,9,9,10,10,10,10,10,10,10,10,10,10,10,10
dc.b 10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11
dc.b 11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11
dc.b 12,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12
dc.b 12,12,12,12,12,12,12,12,12,13,13,13,13,13,13,13
dc.b 13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13
dc.b 13,13,13,13,14,14,14,14,14,14,14,14,14,14,14,14
dc.b 14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14
dc.b 14,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15
dc.b 15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15
I included all .C and .S files in the zip file. Hope you like it. I love this kind of work!