horizontal scrolling on ST

dml · Post by **dml** » Tue Apr 09, 2013 10:44 am

Zamuel_a wrote:
Has sync-scroll been used to assist low-CPU-load horizontal scrolling on an ST?
It must be possible since many demo screens has perfect scroll in all direction and 50fps and often in fullscreen to.

Sure - but I'm pretty wary of the fact that what you see happening isn't always what's actually happening

e.g. what has been unrolled into memory or what needs done on every scanline (and tying up the CPU) to maintain the scroll.

What I'm interested to know is whether *just* the base address can be moved with syncscroll at a fine grain, and the CPU left free.

I expect it is possible for the reason you stated - probably one or more of those menus couldn't work with the memory available unless it's possible. But finding some details would be interesting. e.g. can it go down to 16 pixels or is it some crazy combination of horrible offsets and copies of the display? And just how horrible does it get

SteveBagley · Post by **SteveBagley** » Tue Apr 09, 2013 11:52 am

dml wrote:I expect it is possible for the reason you stated - probably one or more of those menus couldn't work with the memory available unless it's possible. But finding some details would be interesting. e.g. can it go down to 16 pixels or is it some crazy combination of horrible offsets and copies of the display? And just how horrible does it get

The source code for parts of 'The Lost Boys' MindBomb demo was released via Budgie back in the day, including the sync-scroll mindbomb demo (although the presence of so many labels looking like the output of a disassembler always made me wonder about its provenance...). The source code comments and general feel of the code certainly suggest it can get to a sixteen pixel accuracy.

Steve
(PS. it looks like someone's uploaded them to the forum already — http://www.atari-forum.com/viewtopic.php?f=68&t=3123)

Dio · Post by **Dio** » Tue Apr 09, 2013 12:29 pm

dml wrote:What I'm interested to know is whether *just* the base address can be moved with syncscroll at a fine grain, and the CPU left free.

It's possible to get the shifter out of sync and start the screen 4, 8 or 12-pixel increments, but it's difficult and not a very commonly used technique. (It may not be possible in one or more of the MMU/Shifter sync states that the ST can power-on in).

dml · Post by **dml** » Tue Apr 09, 2013 3:32 pm

Thanks, I'll take a look at MB when I get some time. I figured it was probably something like what I suggested up there ^^ but not really sure. Not opening the L/R border is promising since if needed to suck 100% CPU it would probably have done that too...

Yes Dio I saw that mentioned earlier and it is a pretty cool thing to figure out on an ST. However even just 16 pixel addressing of screen memory is seriously useful for closing the gap a bit with STE. I just hadn't seen a description of it as a separable thing from the other stuff that happens in those screens/menus.

ljbk · Post by **ljbk** » Tue Apr 09, 2013 7:57 pm

Hi !

Sync scroll was and can be used for horizontal scrolling.
It was used mostly in demos but also in a game(at least): Enchanted Lands: http://thalion.exotica.org.uk/games/enc ... nd/el.html
http://www.youtube.com/watch?v=5lnbS-ILkkI

The principle is simple: you use 1 video screen per possible shift.
If you do a 4 pixel scroll, as most have done (almost all "mini-game" menus (the biggest one should be "Nostalgic")), then you have a screen for shift 0, another for shift 4, another for shift 8 and another for shift 12. Sprites must save the background. Only the new data appearing on the left or right has to be inserted using preshifted data or shifting in real time.

Alien with his 4 bit hardware scroller: http://www.youtube.com/watch?v=eSpyQPld7rY
found a way to force the ST Shifter to start the screen using as bitplane 0, the data meant for bitplane 0, 1, 2, or 3.
So as a consequence you can have the screen starting displaced by 0, 4, 8 or 12 pixels.
But this only works in full screen and only on STFs, not STE.
The only advantage of this technique is to reduce the memory needs for the blocks to be inserted on the left or right because you don't need to preshift or realtime shift.
You will need the same 4 screen buffers: 1 where the bitmap starts at bitplane 0, 1 for bitplane 1 ...
There are disadvantages: you must use fullscreen (no left and right border) and you must do an extra cleaning on the left border of your screen if you want to hide a part of the trick otherwise you will get a flickering bitmap.

In 2006, i was looking for a way to do something like this without fullscreen. I found a way but it only works (until now) with 1 of the 2 hardware wakeup states of the STFs. May be i have to revisit that research one day.

Paulo.

dml · Post by **dml** » Tue Apr 09, 2013 9:52 pm

Hi Paulo,

ljbk wrote:Hi !
The principle is simple: you use 1 video screen per possible shift.
If you do a 4 pixel scroll, as most have done (almost all "mini-game" menus (the biggest one should be "Nostalgic")), then you have a screen for shift 0, another for shift 4, another for shift 8 and another for shift 12. Sprites must save the background. Only the new data appearing on the left or right has to be inserted using preshifted data or shifting in real time.

Thanks, that's a great answer. it is exactly how I thought it would work, if it could work at all.

ljbk wrote:Alien with his 4 bit hardware scroller: http://www.youtube.com/watch?v=eSpyQPld7rY
found a way to force the ST Shifter to start the screen using as bitplane 0, the data meant for bitplane 0, 1, 2, or 3.
So as a consequence you can have the screen starting displaced by 0, 4, 8 or 12 pixels.
But this only works in full screen and only on STFs, not STE.

A nice discovery (probably one of the most surprising and unexpected too).

Still, the most interesting thing is to have the CPU free for other work.

I was interested because I considered modifying the STE stuff here...

http://www.atari-forum.com/viewtopic.php?f=68&t=24166 (demo here: https://dl.dropbox.com/u/12947585/CCDEMO3.zip)

...to work on STF with at least vertical sync scrolling. But horizontal would be more interesting so I got curious about the limits.

ljbk wrote: In 2006, i was looking for a way to do something like this without fullscreen. I found a way but it only works (until now) with 1 of the 2 hardware wakeup states of the STFs. May be i have to revisit that research one day.

The wakeup states are interesting but I never found any way to influence them - only to 'work around' them with margins.

It is probably worth another look yes

ljbk · Post by **ljbk** » Wed Apr 10, 2013 7:07 am

Hi !

With Alien's discovery, you must work with no left and no right border so you must use synchro code. That costs around 10% CPU for the affected lines (sync switches) and is not very flexible especially for a game.
This would be the great advantage of a solution without the mandatory no left and no right border.

As far as i know, the SW can not influence the hardware wake up states. It can detect them and eventually handle them.

Paulo.

Dio · Post by **Dio** » Wed Apr 10, 2013 9:19 am

ljbk wrote:As far as i know, the SW can not influence the hardware wake up states.

At the outside, it's possible it could be done with the RESET instruction (which some of the custom chips can see but others can't) but even then I think it very unlikely.

danorf · Post by **danorf** » Thu Apr 11, 2013 12:06 am

ljbk wrote:Hi !

The code pattern is not repeated 100% equal. The registers, ORed to D0 and D1, before storing the result, change.
Attached you will find a zip with the program and a 32K pic.
This code is as it was done in the late 80s, so it is far from perfect.
Press '+' and '-' from the keypad for pixel jumps from 1 to 8. Starting value is 2. Any other key exits to TOS.
No roxl xx(An)/move.b/movep optimization is used.

Paulo.

Hi,

Thanks for sharing and explaining !
I managed to playe a little with your code.
That's a very cool piece of code, simple and efficient.

ljbk wrote:This takes 112 cycles + 2 rol.l => around 136 cycles for 16 dots and 2 pixels shift !!!

Seems you had been a little optimistic on that. It's true for 18 "16 pixels blocks" but the first and last "16 pixels blocks" of the screen takes more time : respectively 240+8*2 cycles and 136+4*2 cycles. More, for odd pixels shift you have a penalty of 16 cycles for the first "16 pixels block" and 8 cycles for all the others "16 pixels blocks" due to misalligned bus access produced by the rol instruction.

So here is a table of what I get when calculating how cycles your code takes :

Code: Select all

nb of	16 pixels block   320 pixels line
shift	                  
1		143,20            2864
2		143,20            2864
3		151,60            3032
4		151,60            3032
5		160,00            3200
6		160,00            3200
7		168,40            3368
8		168,40            3368

As you said it wasn't optimized, I've tried to see what can be done, without spending too much time on it. Here, you'll find wha I've managed to do :

Code: Select all

loop:

;first 16 pixels block
	move.l	d6,d5		;  4
	not.l	d5		;  6 --> 8

	movem.l (a1)+,d0-d3	; 12+8n --> 44

	movea.l	d0,a2		;  4
	movea.l	d1,a6		;  4 --> 8

	rol.l	d7,d0		;  8+2n
	and.l	d5,d0		;  8 --> 16+2n

	rol.l	d7,d1		;  8+2n
	and.l	d5,d1		;  8 --> 16+2n

	rol.l	d7,d2		;  8+2n
	movea.l	d2,a3		;  4
	and.l	d6,d2		;  8
	suba.l	d2,a3		;  8
	swap.w	d2		:  4
	add.l	d0,d2		;  8
	move.l	d2,(a0)+	; 12 --> 52+2n

	rol.l	d7,d3		;  8+2n
	movea.l	d3,a4		;  4
	and.l	d6,d3		;  8
	suba.l	d3,a4		;  8
	swap.w	d3		;  4
	add.l	d1,d3		;  8
	move.l	d3,(a0)+	; 12 --> 52+2n
	;			= 12+44+8+(16+2n)*2+(52+2n)*2 = 200+8n

;18 "middle" 16 pixels block
	movem.l	(a1)+,d0-d5	; 12+8n --> 60

	rol.l	d7,d0		;  8+2n
	movea.l	d0,a5		;  4
	and.l	d6,d0		;  8
	suba.l	d0,a5		;  8
	swap.w	d0		;  4
	add.l	a3,d0		;  8 --> 40+2n

	rol.l	d7,d1		;  8+2n
	movea.l	d1,a3		;  4
	and.l	d6,d1		;  8
	suba.l	d1,a3		;  8
	swap.w	d1		;  4
	add.l	a4,d1		;  8

	rol.l	d7,d2		;  8+2n
	movea.l	d2,a4		;  4
	and.l	d6,d2		;  8
	suba.l	d2,a4		;  8
	swap.w	d2		;  4
	add.l	a5,d2		;  8

	rol.l	d7,d3		;  8+2n
	movea.l	d3,a5		;  4
	and.l	d6,d3		;  8
	suba.l	d3,a5		;  8
	swap.w	d3		;  4
	add.l	a3,d3		;  8

	rol.l	d7,d4		;  8+2n
	movea.l	d4,a3		;  4
	and.l	d6,d4		;  8
	suba.l	d4,a3		;  8
	swap.w	d4		;  4
	add.l	a4,d4		;  8

	rol.l	d7,d5		;  8+2n
	movea.l	d5,a4		;  4
	and.l	d6,d5		;  8
	suba.l	d5,a4		;  8
	swap.w	d5		;  4
	add.l	a5,d5		;  8  --> 40+2n

	movem.l	d0-d5,(a0)	;  8+8n --> 56
	lea.l	24(a0),a0	;  8
	;                       = 60+6(40+2n)+56+8 --> 364+12n ~ 121,33+4n per 16 pixels block

	movem.l	(a1)+,d0-d5	; 12+8n --> 60

	rol.l	d7,d0		;  8+2n
	movea.l	d0,a5		;  4
	and.l	d6,d0		;  8
	suba.l	d0,a5		;  8
	swap.w	d0		;  4
	add.l	a3,d0		;  8 --> 40+2n

	rol.l	d7,d1		;  8+2n
	movea.l	d1,a3		;  4
	and.l	d6,d1		;  8
	suba.l	d1,a3		;  8
	swap.w	d1		;  4
	add.l	a4,d1		;  8

	rol.l	d7,d2		;  8+2n
	movea.l	d2,a4		;  4
	and.l	d6,d2		;  8
	suba.l	d2,a4		;  8
	swap.w	d2		;  4
	add.l	a5,d2		;  8

	rol.l	d7,d3		;  8+2n
	movea.l	d3,a5		;  4
	and.l	d6,d3		;  8
	suba.l	d3,a5		;  8
	swap.w	d3		;  4
	add.l	a3,d3		;  8

	rol.l	d7,d4		;  8+2n
	movea.l	d4,a3		;  4
	and.l	d6,d4		;  8
	suba.l	d4,a3		;  8
	swap.w	d4		;  4
	add.l	a4,d4		;  8

	rol.l	d7,d5		;  8+2n
	movea.l	d5,a4		;  4
	and.l	d6,d5		;  8
	suba.l	d5,a4		;  8
	swap.w	d5		;  4
	add.l	a5,d5		;  8  --> 40+2n

	movem.l	d0-d5,(a0)	;  8+8n --> 56
	lea.l	24(a0),a0	;  8

	movem.l	(a1)+,d0-d5	; 12+8n --> 60

	rol.l	d7,d0		;  8+2n
	movea.l	d0,a5		;  4
	and.l	d6,d0		;  8
	suba.l	d0,a5		;  8
	swap.w	d0		;  4
	add.l	a3,d0		;  8 --> 40+2n

	rol.l	d7,d1		;  8+2n
	movea.l	d1,a3		;  4
	and.l	d6,d1		;  8
	suba.l	d1,a3		;  8
	swap.w	d1		;  4
	add.l	a4,d1		;  8

	rol.l	d7,d2		;  8+2n
	movea.l	d2,a4		;  4
	and.l	d6,d2		;  8
	suba.l	d2,a4		;  8
	swap.w	d2		;  4
	add.l	a5,d2		;  8

	rol.l	d7,d3		;  8+2n
	movea.l	d3,a5		;  4
	and.l	d6,d3		;  8
	suba.l	d3,a5		;  8
	swap.w	d3		;  4
	add.l	a3,d3		;  8

	rol.l	d7,d4		;  8+2n
	movea.l	d4,a3		;  4
	and.l	d6,d4		;  8
	suba.l	d4,a3		;  8
	swap.w	d4		;  4
	add.l	a4,d4		;  8

	rol.l	d7,d5		;  8+2n
	movea.l	d5,a4		;  4
	and.l	d6,d5		;  8
	suba.l	d5,a4		;  8
	swap.w	d5		;  4
	add.l	a5,d5		;  8  --> 40+2n

	movem.l	d0-d5,(a0)	;  8+8n --> 56
	lea.l	24(a0),a0	;  8

	movem.l	(a1)+,d0-d5	; 12+8n --> 60

	rol.l	d7,d0		;  8+2n
	movea.l	d0,a5		;  4
	and.l	d6,d0		;  8
	suba.l	d0,a5		;  8
	swap.w	d0		;  4
	add.l	a3,d0		;  8 --> 40+2n

	rol.l	d7,d1		;  8+2n
	movea.l	d1,a3		;  4
	and.l	d6,d1		;  8
	suba.l	d1,a3		;  8
	swap.w	d1		;  4
	add.l	a4,d1		;  8

	rol.l	d7,d2		;  8+2n
	movea.l	d2,a4		;  4
	and.l	d6,d2		;  8
	suba.l	d2,a4		;  8
	swap.w	d2		;  4
	add.l	a5,d2		;  8

	rol.l	d7,d3		;  8+2n
	movea.l	d3,a5		;  4
	and.l	d6,d3		;  8
	suba.l	d3,a5		;  8
	swap.w	d3		;  4
	add.l	a3,d3		;  8

	rol.l	d7,d4		;  8+2n
	movea.l	d4,a3		;  4
	and.l	d6,d4		;  8
	suba.l	d4,a3		;  8
	swap.w	d4		;  4
	add.l	a4,d4		;  8

	rol.l	d7,d5		;  8+2n
	movea.l	d5,a4		;  4
	and.l	d6,d5		;  8
	suba.l	d5,a4		;  8
	swap.w	d5		;  4
	add.l	a5,d5		;  8  --> 40+2n

	movem.l	d0-d5,(a0)	;  8+8n --> 56
	lea.l	24(a0),a0	;  8

	movem.l	(a1)+,d0-d5	; 12+8n --> 60

	rol.l	d7,d0		;  8+2n
	movea.l	d0,a5		;  4
	and.l	d6,d0		;  8
	suba.l	d0,a5		;  8
	swap.w	d0		;  4
	add.l	a3,d0		;  8 --> 40+2n

	rol.l	d7,d1		;  8+2n
	movea.l	d1,a3		;  4
	and.l	d6,d1		;  8
	suba.l	d1,a3		;  8
	swap.w	d1		;  4
	add.l	a4,d1		;  8

	rol.l	d7,d2		;  8+2n
	movea.l	d2,a4		;  4
	and.l	d6,d2		;  8
	suba.l	d2,a4		;  8
	swap.w	d2		;  4
	add.l	a5,d2		;  8

	rol.l	d7,d3		;  8+2n
	movea.l	d3,a5		;  4
	and.l	d6,d3		;  8
	suba.l	d3,a5		;  8
	swap.w	d3		;  4
	add.l	a3,d3		;  8

	rol.l	d7,d4		;  8+2n
	movea.l	d4,a3		;  4
	and.l	d6,d4		;  8
	suba.l	d4,a3		;  8
	swap.w	d4		;  4
	add.l	a4,d4		;  8

	rol.l	d7,d5		;  8+2n
	movea.l	d5,a4		;  4
	and.l	d6,d5		;  8
	suba.l	d5,a4		;  8
	swap.w	d5		;  4
	add.l	a5,d5		;  8  --> 40+2n

	movem.l	d0-d5,(a0)	;  8+8n --> 56
	lea.l	24(a0),a0	;  8

	movem.l	(a1)+,d0-d5	; 12+8n --> 60

	rol.l	d7,d0		;  8+2n
	movea.l	d0,a5		;  4
	and.l	d6,d0		;  8
	suba.l	d0,a5		;  8
	swap.w	d0		;  4
	add.l	a3,d0		;  8 --> 40+2n

	rol.l	d7,d1		;  8+2n
	movea.l	d1,a3		;  4
	and.l	d6,d1		;  8
	suba.l	d1,a3		;  8
	swap.w	d1		;  4
	add.l	a4,d1		;  8

	rol.l	d7,d2		;  8+2n
	movea.l	d2,a4		;  4
	and.l	d6,d2		;  8
	suba.l	d2,a4		;  8
	swap.w	d2		;  4
	add.l	a5,d2		;  8

	rol.l	d7,d3		;  8+2n
	movea.l	d3,a5		;  4
	and.l	d6,d3		;  8
	suba.l	d3,a5		;  8
	swap.w	d3		;  4
	add.l	a3,d3		;  8

	rol.l	d7,d4		;  8+2n
	movea.l	d4,a3		;  4
	and.l	d6,d4		;  8
	suba.l	d4,a3		;  8
	swap.w	d4		;  4
	add.l	a4,d4		;  8

	rol.l	d7,d5		;  8+2n
	movea.l	d5,a4		;  4
	and.l	d6,d5		;  8
	suba.l	d5,a4		;  8
	swap.w	d5		;  4
	add.l	a5,d5		;  8  --> 40+2n

	movem.l	d0-d5,(a0)	;  8+8n --> 56
	lea.l	24(a0),a0	;  8
	;                       = 60+6(40+2n)+56+8 --> 364+12n ~ 121,33+4n per 16 pixels block
	;			total for 18 "middle" 16 pixels blocks : 2184+72n	

;last 16 pixels block
	move.l	a2,d0		;  4
	move.l	a6,d1		;  4 --> 8

	rol.l	d7,d0		;  8+2n
	and.l	d6,d0		;  8
	swap.w	d0		;  4
	add.l	a3,d0		;  8
	move.l	d0,(a0)+	; 12 --> 40+2n

	rol.l	d7,d1		:  8+2n
	and.l	d6,d1		;  8
	swap.w	d1		;  4
	add.l	a4,d1		;  8
	move.l	d1,(a0)+	; 12 --> 40+2n
	;                       = 88+4n

	subi.w	#128,d7		;  8
	bpl.w	loop		; 10-->12 B / 12 NB --> 20 / 20
	rts

Here is a table of the cycles taken by this code :

Code: Select all

nb of	16 pixels block   320 pixels line
shift	                  
1		133,00            2660
2		133,00            2660
3		141,40            2828
4		141,40            2828
5		149,80            2996
6		149,80            2996
7		158,20            3164
8		158,20            3164

If someone can do better, feel free to modifiy the code I've posted and let us know.

In attachement, you'll find a patched version of SCLH.PRG running the modified code listed above.

ljbk wrote:In 2006, i was looking for a way to do something like this without fullscreen. I found a way but it only works (until now) with 1 of the 2 hardware wakeup states of the STFs. May be i have to revisit that research one day.

I'm realy interested in this if you agree to share the source code.

Dio wrote:At the outside, it's possible it could be done with the RESET instruction (which some of the custom chips can see but others can't) but even then I think it very unlikely.

Unfortunately, RESET instruction "halt" (ok it's more an 'idle wait' state than a real 'halt') the CPU for a 4x number of cycles (124 exactly). So, for what I've understood of wake states until now (I must admit have to reread a certain topic more in details), it won't change anything.

mc6809e · Post by **mc6809e** » Thu Apr 11, 2013 3:03 am

I like Paulo's way. Modifying Paulo's code a little I get a total of 2512 cycles per scanline for a 3 or 4 pixel shift. The two pixel shift is 2352 cycles leaving about 10,000 cycles free at 16.67 fps. Not a lot. But maybe enough for a simple scroller.

Code: Select all

;a0, a1 point to beginning of scanline
;d7 is shift count
;d6 is mask for zeroing upper bits
;eor is used to clear lower bits of two words

;prologue
;read 1st block of 16 pixels
move.l (a0)+, d2 ;8
move.l (a0)+, d3 ;8
lsl.l d7, d2 ;16
lsl.l d7, d3 ;16
;48 cycles

;read 2nd block of 16 pixels
move.l (a0)+, d0 ;8
move.l (a0)+, d1 ;8
rol.l d7, d0 ;16
rol.l d7, d1 ;16
move.l d0, d4 ;4
move.l d1, d5 ;4
and.l d6, d4 ;8
and.l d6, d5 ;8
eor.l d4, d0 ;8
eor.l d5, d1 ;8
swap d4 ;4
swap d5 ;4
or.l d4, d2 ;8
or.l d5, d3 ;8
;write 1st block of 16 pixels
move.l d2, (a1)+ ;8
move.l d3, (a1)+ ;8
;128 cycles

;read 3rd block
move.l (a0)+, d2
move.l (a0)+, d3
rol.l d7, d2
rol.l d7, d3
move.l d2, d4
move.l d3, d5
and.l d6, d4
and.l d6, d5
eor.l d4, d2
eor.l d5, d3
swap d4
swap d5
or.l d4, d0
or.l d5, d1
;write 2nd block
move.l d0, (a1)+
move.l d1, (a1)+
;read 4th block
move.l (a0)+, d0 
move.l (a0)+, d1 
rol.l d7, d0 
rol.l d7, d1 
move.l d0, d4 
move.l d1, d5 
and.l d6, d4 
and.l d6, d5 
eor.l d4, d0 
eor.l d5, d1 
swap d4 
swap d5 
or.l d4, d2 
or.l d5, d3 
;write 3rd block
move.l d2, (a1)+ 
move.l d3, (a1)+ 
;read 4th block
move.l (a0)+, d2
move.l (a0)+, d3

...

;read 20th block
move.l (a0)+, d0 
move.l (a0)+, d1 
rol.l d7, d0 
rol.l d7, d1 
move.l d0, d4 
move.l d1, d5 
and.l d6, d4 
and.l d6, d5 
eor.l d4, d0 
eor.l d5, d1 
swap d4 
swap d5 
or.l d4, d2 
or.l d5, d3 
;write 19th block
move.l d2, (a1)+ 
move.l d3, (a1)+ 
;2432 cycles for previous blocks

;epilogue
OR.L d0, (a2)+ ;new pixels from the right
OR.L d1, (a2)+ ; 
;write 20th block
move.l d0, (a1)+ ;
move.l d1, (a1)+ ;
;32 cycles

;2512 total cycles

ljbk · Post by **ljbk** » Thu Apr 11, 2013 8:31 am

danorf wrote:
ljbk wrote:In 2006, i was looking for a way to do something like this without fullscreen. I found a way but it only works (until now) with 1 of the 2 hardware wakeup states of the STFs. May be i have to revisit that research one day.
I'm realy interested in this if you agree to share the source code.

I will share that if i resume that research, because those tests were left unfinished and not so well organized as my 25+ years old sources

I can't share that at the moment because i would have to look where exactly are the files that do that in the hundreds of sync tests i have done in 2006. Doing this is the same as to resume the research with some ideas i have in my head

.

But for the moment i am ST-busy with Hextracker as i have a lot of features and ideas i would like to include.

Paulo.

Eero Tamminen · Post by **Eero Tamminen** » Thu Apr 11, 2013 8:47 am

danorf wrote:In attachement, you'll find a patched version of SCLH.PRG running the modified code listed above.

Is this already supposed to do something user visible? I see just black screen.

Btw. Attached is Hatari profiler output from running it for ~8s.

ljbk · Post by **ljbk** » Thu Apr 11, 2013 9:15 am

Eero Tamminen wrote:
danorf wrote:In attachement, you'll find a patched version of SCLH.PRG running the modified code listed above.
Is this already supposed to do something user visible? I see just black screen.

Btw. Attached is Hatari profiler output from running it for ~8s.

You probably need the 32K pic from the first zip i posted first ...

danorf · Post by **danorf** » Thu Apr 11, 2013 9:58 am

ljbk wrote:
Eero Tamminen wrote: Is this already supposed to do something user visible? I see just black screen.

Btw. Attached is Hatari profiler output from running it for ~8s.
You probably need the 32K pic from the first zip i posted first ...

I agree. I should have added the image file in my zip. Sorry.

Eero Tamminen · Post by **Eero Tamminen** » Thu Apr 11, 2013 11:34 am

ljbk wrote:You probably need the 32K pic from the first zip i posted first ...

Ah, right. Now I could profile the part where it wraps the screen edge. Percentage-vise result is very slightly different. Whole trace is 240 VBLs:

Code: Select all

profile on
b vbl="vbl+240" :once
continue
...
profile save SCLH2.txt

danorf · Post by **danorf** » Thu Apr 11, 2013 12:08 pm

Eero Tamminen wrote:
ljbk wrote:You probably need the 32K pic from the first zip i posted first ...
Ah, right. Now I could profile the part where it wraps the screen edge. Percentage-vise result is very slightly different. Whole trace is 240 VBLs:
Code: Select all
profile on
b vbl="vbl+240" :once
continue
...
profile save SCLH2.txt

As I'm not too familiar with Hatari profiler, may I ask you the interest of this file / how to read it ?

danorf · Post by **danorf** » Thu Apr 11, 2013 12:41 pm

mc6809e wrote:I like Paulo's way. Modifying Paulo's code a little I get a total of 2512 cycles per scanline for a 3 or 4 pixel shift. The two pixel shift is 2352 cycles leaving about 10,000 cycles free at 16.67 fps. Not a lot. But maybe enough for a simple scroller.

I like it too. Don't misunderstand me, if I have modified his code it's because I've found it interesting.

About your code :

In your prologue lsl instruction will clean right bits of the LSW but not the ones of the MSW. So when it cames to "or.l d4,d2" and "or.l d5,d3" before le writing the first block you'll get graphic gliches.

In addition, i'm pretty sure that move.l (an)+,dn and move.l dn,(an)+ cost 12 cycles not 8.

last thing that I noticed : the way you handle data stored at (a2) and the way you manage this register. Perhaps I don't get something or you use it a total different way than Paulo did, but looks strange to me.

danorf · Post by **danorf** » Thu Apr 11, 2013 12:48 pm

ljbk wrote:I will share that if i resume that research, because those tests were left unfinished and not so well organized as my 25+ years old sources
I can't share that at the moment because i would have to look where exactly are the files that do that in the hundreds of sync tests i have done in 2006. Doing this is the same as to resume the research with some ideas i have in my head .

But for the moment i am ST-busy with Hextracker as i have a lot of features and ideas i would like to include.

Paulo.

Ok, no problem.
I'll try to be here at the right time to see what you achieved on this subject and speak about it with you.

dml · Post by **dml** » Thu Apr 11, 2013 1:01 pm

danorf wrote: As I'm not too familiar with Hatari profiler, may I ask you the interest of this file / how to read it ?

(I'm answering this, only because I've used it quite a lot!)

It's essentially a simulation of the code, with cycle and activity information collected.

The disasm is followed by 4 figures collected by the simulation:

<%age of total encounters> ( <encounter sum>, <cycle sum>, <cache miss cycle sum> )

$01271e : cmp.l $0466.w,d0 6.24% (232032, 4642800, 0)

...the 6.24% figure means 6 percent of all instruction executions took place at this opcode. The opcode was visited 232032 times in total during the simulation, and the opcode cost 4642800 cycles in total. It stands out as a spin-wait on the VBL.

$01272c : lea $12758(pc),a1 0.00% (60, 480, 0)

This was visited only 60 times - too few to show as a %age, cost 480 cycles in total. (480/60 = average cost of 8 cycles per execution).

$0127e6 : move.l d6,d5 0.32% (12000, 48000, 0)

This one used plenty, 48000 cycles / 12000 visits = 4 cycles.

The last figure (cache miss cycle sum) is only relevant on 020+ CPU so it's all zeroes.

I'm sure Eero will offer a much more in-depth version! :-p

(I think for this sort of cycle-counting research/experiment work on ST, showing the opcode cycle counts directly would make things clearer but for finding bottlenecks in a bigger program it works well as it is).

danorf · Post by **danorf** » Thu Apr 11, 2013 1:21 pm

dml wrote:(I'm answering this, only because I've used it quite a lot!)

Many thanks !

mc6809e · Post by **mc6809e** » Thu Apr 11, 2013 2:45 pm

danorf wrote:
About your code :

In your prologue lsl instruction will clean right bits of the LSW but not the ones of the MSW. So when it cames to "or.l d4,d2" and "or.l d5,d3" before le writing the first block you'll get graphic gliches.

In addition, i'm pretty sure that move.l (an)+,dn and move.l dn,(an)+ cost 12 cycles not 8.

last thing that I noticed : the way you handle data stored at (a2) and the way you manage this register. Perhaps I don't get something or you use it a total different way than Paulo did, but looks strange to me.

Yeah, you're right of course. I originally had an AND.L to mask those bits out. Changed at the last minute.

move.l dn, (an)+ is 12 cycles, too, so I guess it's going to take a bit longer than I had hoped. I was thinking move.w dn, (an)+. That adds about 320 cycles to the analysis. Dang.

And the data that is at (a2) is simply whatever needs to be shifted in from the from the right. There are many schemes for shifting in new pixels. I just used something generic to get a good idea of the timing.

danorf · Post by **danorf** » Thu Apr 11, 2013 3:09 pm

mc6809e wrote:And the data that is at (a2) is simply whatever needs to be shifted in from the from the right. There are many schemes for shifting in new pixels. I just used something generic to get a good idea of the timing.

In your code, you only write at (a2)+ and you never read data pointed by this register, that's why it looks strange to me...

Eero Tamminen · Post by **Eero Tamminen** » Thu Apr 11, 2013 3:35 pm

dml wrote: (I think for this sort of cycle-counting research/experiment work on ST, showing the opcode cycle counts directly would make things clearer

It might make people trust them too much.

While Hatari emulation is pretty cycle-accurate (it runs more games & demos than any other emulator), especially its ST/e emulation, and it takes "instruction pairing" into account, it's not perfect. Some things are approximations, which might not be accurate in some specific cases (there's a discussion thread about 68000 prefetch, which probably contains some things that Hatari's 68000 emulation doesn't fully take into account).

Note that I invoked the debugger manually and that always happens at VBL (it's the point at which Hatari refreshes the screen and processes key input). If one wants profiling to start at some other point, that needs a specific breakpoint.

Eero Tamminen · Post by **Eero Tamminen** » Thu Apr 11, 2013 3:44 pm

dml wrote: $01271e : cmp.l $0466.w,d0 6.24% (232032, 4642800, 0)

...the 6.24% figure means 6 percent of all instruction executions took place at this opcode. The opcode was visited 232032 times in total during the simulation, and the opcode cost 4642800 cycles in total. It stands out as a spin-wait on the VBL.

I guess one can use that percentage directly as an average of how much CPU there's free to do other things?

If there are things that happen only occasionally, one can use percentages to see how much those occasional things take CPU. In larger program one would add DRI/GST symbols to the binary and use the profile post-processor to sum all instruction costs for given subroutine/symbol together, so that one doesn't need to do such calculations manually. This helps in deciding how to split running of them between the frames.

danorf · Post by **danorf** » Thu Apr 11, 2013 4:12 pm

@Eero Tamminen : thanks for these enlightments.

Atari-Forum