Fastest byte-swap on MC68000

All 680x0 related coding posts in this section please.

Moderators: simonsunnyboy, Mug UK, Zorro 2, Moderator Team

ppera

Fastest byte-swap on MC68000

Post by ppera »

What would be fastest code for copy content byte-swapped from one mem. location to other?

At moment I see this as fastest:

moveq #8,d4
loop move.w (a0)+,d0
rol.w d4,d0
move.w d0,(a1)+
.... repeat it some 8 times, then loop as much times needed

Btw. using long move and 2 times (word) swap will result in exactly same cycle count.

With this, speed of some 400KB/sec can be achieved with 8MHz 68000 .
User avatar
frost
Captain Atari
Captain Atari
Posts: 378
Joined: Sun Dec 01, 2002 2:50 am
Location: Limoges
Contact:

Post by frost »

Well, seems poor but this code may be faster because of the rol you use.

Code: Select all

move.b (a0)+,d0
move.b (a0)+,d1
move.b d1,(a1)+
move.b d0,(a1)+     ; 8 cycles each, 32 cycles per word.
Your code should take around 40 cycles per word.
But this will not work if your reads and writes need to be atomic. But anyway, just test and see yourself, it's too much time I didn't code on a ST.
ppera

Post by ppera »

Yes, it is faster. Shift, rotate is slow on MC68000 . Probably better is with some MC68030.

However, for IDE disk read (or write) it is not good - must use word (16-bit) or long read/write - reading by bytes will not work.
User avatar
MiggyMog
Atari Super Hero
Atari Super Hero
Posts: 936
Joined: Sun Oct 30, 2005 4:43 pm
Location: Scotland

Post by MiggyMog »

Just out of interest How many cycles does SWAP use?
('< o o o o |''| STM,2xSTFM,2xSTE+HD,C-Lab Falcon MK2+HD,Satandisk,Ultrasatandisk,Ethernat.
User avatar
bod/STAX
Atari Super Hero
Atari Super Hero
Posts: 508
Joined: Wed Nov 24, 2004 8:13 pm
Location: Halesowen, West Midlands, England

Post by bod/STAX »

According to my clockcycle sheet: 4
User avatar
frost
Captain Atari
Captain Atari
Posts: 378
Joined: Sun Dec 01, 2002 2:50 am
Location: Limoges
Contact:

Post by frost »

ppera wrote:Yes, it is faster. Shift, rotate is slow on MC68000 . Probably better is with some MC68030.

However, for IDE disk read (or write) it is not good - must use word (16-bit) or long read/write - reading by bytes will not work.
Then your routine should be the fastest I think.. It's difficult to get something better without using a 128K lookup table.
ppera

Post by ppera »

MiggyMog wrote:Just out of interest How many cycles does SWAP use?
4 cycles - shortest exec. on MC68000.
Unfortunatelly, there is no byte-swap, so we are on slow rol.w - min 22 cycles for 8-bit shift or rotating.
ppera

Post by ppera »

frost wrote: Then your routine should be the fastest I think.. It's difficult to get something better without using a 128K lookup table.
I used earlier this:

Read to buffer on even address with long move

lea 1(a1),a2
loop move.b (a3)+,(a2)
addq.l #2,a2
move.b (a3)+,(a1)
addq.l #2,(a1)
repeat....

It is also 32 cycle, but need to first read in buffer at a3.
I don't think that with table it would be faster - calc. , reading table will take some cycles too.
User avatar
frost
Captain Atari
Captain Atari
Posts: 378
Joined: Sun Dec 01, 2002 2:50 am
Location: Limoges
Contact:

Post by frost »

Yes, you're right, it would take approx 40 cycles.
Shrimp
Atarian
Atarian
Posts: 7
Joined: Wed Aug 22, 2007 1:36 am
Location: Gothenburg, Sweden

Post by Shrimp »

Some versions of byte swap inner loops, some might be useful for you, others won't be. =)


movep version
movep.l 0(a0),d0
movep.l 1(a0),d1
movep.l d0,1(a1)
movep.l d1,0(a1) tot 24/4 = 6 nops per swapped word


a7 version, user mode or supervisor mode with interrupt disabled
move.b (a0)+,1(a7) 4
move.b (a0)+,(a7)+ 3 tot 7 nops


128k conversion table
movem.w (a0)+,d0-d7 (3+8)/8
add.w d0,d0 1
move.w (a1,d0.l),(a2)+ 5 tot 7.375 nops
...


Movem version
movem.w (a0)+,d0-d6
rol.w d7,d0
...
movem.w d0-d6,12(a1) approx 8.86 nops


Longword version (just a sketch, unroll 3 times to fill up d0-d5 with swapped words)
move.l -(a0),d0 3

rol.l d7,d0 6
move.l d0,d6 1
move.l d0,d1 1
swap d0 1
move.b d0,d1 1
move.b d6,d0 1 14/2 = 7

movem.w d0-d5,-(a1) 8/6 tot 8.33 nop



Also the blitter should be able to do the swap in 3 nops

2 reads per line
1 write per line

8 shifts
word inc (source and dest) = 0
line inc (source and dest) = 2
ppera

Post by ppera »

Shrimp wrote:Some versions of byte swap inner loops, some might be useful for you, others won't be. =)


movep version
movep.l 0(a0),d0
movep.l 1(a0),d1
movep.l d0,1(a1)
movep.l d1,0(a1) tot 24/4 = 6 nops per swapped word


a7 version, user mode or supervisor mode with interrupt disabled
move.b (a0)+,1(a7) 4
move.b (a0)+,(a7)+ 3 tot 7 nops




Also the blitter should be able to do the swap in 3 nops

2 reads per line
1 write per line

8 shifts
word inc (source and dest) = 0
line inc (source and dest) = 2
Thanks. Some notes:
We want to swap larger amount of bytes (therefore is speed important), so need some additional instructions, which will decrease speed.

After movep's we need: addq.l #8,a1 and addq.l #8,a0

After a7 version 1 addq.l #1,a7 is reqired. And I don't see why to use a7, while all adress regs. have same speed.

Can you give more details about how to do it with blitter, since I don't experience with it. ?
User avatar
frost
Captain Atari
Captain Atari
Posts: 378
Joined: Sun Dec 01, 2002 2:50 am
Location: Limoges
Contact:

Post by frost »

When you push a byte to stack, the stack is increased by 2, not 1 ;)
Shrimp
Atarian
Atarian
Posts: 7
Joined: Wed Aug 22, 2007 1:36 am
Location: Gothenburg, Sweden

Post by Shrimp »

You are welcome.

No, you don't need to increment your adress registers after every swap, you unroll the loops at least 20 times and can therefore just increase the indexes (well. I myself usually skips the loop entierly and unrolls the innerloop completely, which gives more free registers and less coding time.)


And no you must not add 1 to A7 in the A7-version, the stackpointer advances 2 bytes for every increment/decrement of bytesize (that is the reason I used A7).

I myself am not that familiar with the blitter, and has to check with the hardware documentation to get that working, but it is possible to add an extra read for every line, and if you are reading from the same word and shifting 8 pixels it is a byte swap.
User avatar
Cyprian
10 GOTO 10
10 GOTO 10
Posts: 2204
Joined: Fri Oct 04, 2002 11:23 am
Location: Warsaw, Poland

Post by Cyprian »

Shrimp wrote:I myself am not that familiar with the blitter, and has to check with the hardware documentation to get that working, but it is possible to add an extra read for every line, and if you are reading from the same word and shifting 8 pixels it is a byte swap.
Shifting is not needed, straight copy should be adequate:
. Source: [DATA to swap]
. Destination: [DATA to swap] - 2 bytes
. EndMask: $00ff $00ff $00ff

It takes 12 cycles per word.
But in this case result will be stared from odd address [DATA to swap] - 1 byte.

If we need result started from even byte we should do straight copy with shifting << 8.
It takes additional 8 cycles per word.

Code: Select all


	clr.w	-(sp)			; ---------
	move.w	#$20,-(sp)		; SUPER
	trap	#1			;
	addq.l	#4,sp			;


	lea	$ffff8a20.w,a0		; ---------
	move.l	#$20000,(a0)+		; BliTTER Init
	move.l	#DATATOSWAP,(a0)+	;
	move.l	#$00ff00ff,(a0)+	;
	move.w	#$00ff,(a0)+		;
	move.l	#$20000,(a0)+		;
	move.l	#DATATOSWAP-2,(a0)+	;
	move.w	#(END-DATATOSWAP)/2,(a0)+	;
	move.w	#$0001,(a0)+		;
	move.w	#$0203,(a0)+		;
	move.b	#$00,1(a0)		;
	move.b	#$c0,(a0)		; bc	test3

	nop
	nop

	dc.l	-1, -1
DATATOSWAP
	dc.w	$1100, $3322, $5544, $7766
END
	dc.l	0, 0
Shrimp
Atarian
Atarian
Posts: 7
Joined: Wed Aug 22, 2007 1:36 am
Location: Gothenburg, Sweden

Post by Shrimp »

It takes 12 cycles per word.
Yes, but my blitter solution also takes 12 cycles and is wordaligned (which is what he wants, as he intends to do an atapi driver if i got it correct)

I'll borrow your blitter setup code, if its ok with you =)

lea $ffff8a20.w,a0 ; ---------
move.l #$2,(a0)+ ; BliTTER Init xInc,yInc source
move.l #DATATOSWAP,(a0)+ ;
move.l #$ffffffff,(a0)+ ;
move.w #$ffff,(a0)+ ;
move.l #$2,(a0)+ ; xInc,yInc destination
move.l #DATATOSWAP,(a0)+ ;
move.w #$0001,(a0)+ ; nr words per line
move.w #NR_WORDS_TO_SWAP,(a0)+ ; nr lines
move.w #$0203,(a0)+ ;
move.w #$c088,(a0) ; bc test3
ppera

Post by ppera »

Shrimp wrote:...
No, you don't need to increment your adress registers after every swap, you unroll the loops at least 20 times and can therefore just increase the indexes (well. I myself usually skips the loop entierly and unrolls the innerloop completely, which gives more free registers and less coding time.)..
And no you must not add 1 to A7 in the A7-version, the stackpointer advances 2 bytes for every increment/decrement of bytesize (that is the reason I used A7)
I have one old (from 1985) 68000 book. It is pretty detailed, but A7 2 byte increment is not mentioned (as not overshot by MOVEM). However, what writes seems to be very accurate - as cycle counts for all instructions and all effects on registers. I checked in debug, and after MOVEP address register is unchanged - so we need after it to increase add. reg. by 8 after 2 lines.

I will check later blitter codes...
ijor
Hardware Guru
Hardware Guru
Posts: 4067
Joined: Sat May 29, 2004 7:52 pm
Contact:

Post by ijor »

Shrimp wrote:128k conversion table
movem.w (a0)+,d0-d7
add.w d0,d0
move.w (a1,d0.l),(a2)+
This obviously doesn't work as you expect.
rol.w d7,d0
Hmm, why you are all "spending" a register with the shift/rotate? rol #8,Dx is as fast and as compact.
ppera wrote:I have one old (from 1985) 68000 book. It is pretty detailed, but A7 2 byte increment is not mentioned (as not overshot by MOVEM).
Both are documented.
ppera

Post by ppera »

ijor wrote:
rol.w d7,d0
Hmm, why you are all "spending" a register with the shift/rotate? rol #8,Dx is as fast and as compact...
ppera wrote: Both are documented.
Because of speed, of course :-) When exec. something lot of times it is always good to have all constants in registers. rol dx,dy is faster than rol constant,dy. And even is shorter - what by instruction 'towers' (repeating same short. seq. 8-16... or more times) takes less space.
Yes, I found some PDFs where mentioned details are described. Just love my old book :-)
ijor
Hardware Guru
Hardware Guru
Posts: 4067
Joined: Sat May 29, 2004 7:52 pm
Contact:

Post by ijor »

ppera wrote:Because of speed, of course :-) When exec. something lot of times it is always good to have all constants in registers. rol dx,dy is faster than rol constant,dy. And even is shorter - what by ..
It is not faster, and its is not shorter.
Shrimp
Atarian
Atarian
Posts: 7
Joined: Wed Aug 22, 2007 1:36 am
Location: Gothenburg, Sweden

Post by Shrimp »

ijor wrote:
Shrimp wrote:128k conversion table
movem.w (a0)+,d0-d7
add.w d0,d0
move.w (a1,d0.l),(a2)+
This obviously doesn't work as you expect.
Of course it does work as I expect, think again... ;)
(Hint: movem.w signextends registers)
Kalms
Retro freak
Retro freak
Posts: 13
Joined: Sat Oct 28, 2006 10:18 am
Location: Linkoping, Sweden
Contact:

Post by Kalms »

Shrimp wrote:
ijor wrote:
Shrimp wrote:128k conversion table
movem.w (a0)+,d0-d7
add.w d0,d0
move.w (a1,d0.l),(a2)+



This obviously doesn't work as you expect.
Of course it does work as I expect, think again... ;)
(Hint: movem.w signextends registers)
Haha, tricky! I've never thought of that myself.
ijor
Hardware Guru
Hardware Guru
Posts: 4067
Joined: Sat May 29, 2004 7:52 pm
Contact:

Post by ijor »

Shrimp wrote:
ijor wrote:
Shrimp wrote:128k conversion table
movem.w (a0)+,d0-d7
add.w d0,d0
move.w (a1,d0.l),(a2)+
This obviously doesn't work as you expect.
Of course it does work as I expect, think again... ;)
(Hint: movem.w signextends registers)
Wow, you are absolutely right. Very very nice!

Btw, the movep or the a7 solutions won't work for him. He needs word access at the hardware port.

Pera, can't you just swap/cross the data bus lines on the hardware?
User avatar
Cyprian
10 GOTO 10
10 GOTO 10
Posts: 2204
Joined: Fri Oct 04, 2002 11:23 am
Location: Warsaw, Poland

Post by Cyprian »

Shrimp wrote:
It takes 12 cycles per word.
Yes, but my blitter solution also takes 12 cycles and is wordaligned (which is what he wants, as he intends to do an atapi driver if i got it correct)

I'll borrow your blitter setup code, if its ok with you =)

lea $ffff8a20.w,a0 ; ---------
move.l #$2,(a0)+ ; BliTTER Init xInc,yInc source
move.l #DATATOSWAP,(a0)+ ;
move.l #$ffffffff,(a0)+ ;
move.w #$ffff,(a0)+ ;
move.l #$2,(a0)+ ; xInc,yInc destination
move.l #DATATOSWAP,(a0)+ ;
move.w #$0001,(a0)+ ; nr words per line
move.w #NR_WORDS_TO_SWAP,(a0)+ ; nr lines
move.w #$0203,(a0)+ ;
move.w #$c088,(a0) ; bc test3

nice trick
ppera

Post by ppera »

ijor wrote: Pera, can't you just swap/cross the data bus lines on the hardware?
You were right about rol.w const,dX - but we obviously used data register instead constant from inertion, habbit, because not much instruction has built in constants (moveq is other and what more?).

I did byte swap on IDE cable long time ago, and it works flawless and fast (1400KB/sec).
http://www.ppest.org/atari/idepc.htm

But SW solution is not bad thing, and as I see, some people still like to push 68000 to it's limits. Current speed (with rol ) is not so bad: some 360KB/sec.
ppera

Blitter codes

Post by ppera »

Unfortunately, blitter codes given work not good. There is no byte-swap at all, and in Shrimp's code are some errors - for instance putting $88 to Skew has same effect as putting 8.

As I see, it is not possible to perform byte-swap with Atari's blitter, at least not with some decent speed. Masking will not change byte-order at all, and skew works so, that shifted bits ar moved to next word. It is good for shift content 1 byte up, but byte-swap is different thing. Even using negative X-increment will not ensure correct byte order on target.
All what I get with diverse settings is: doubled bytes (every second twice), or transferred only every second one.
Too bad, because speed was about 1050KB/sec (due masking, while skew did not affect it)
Post Reply

Return to “680x0”