What is "sync scrolling"?

All about ST/STE demos

Moderators: lotek_style, Mug UK, Moderator Team

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Wed Apr 27, 2005 7:10 am

Gunstick wrote:
ljbk wrote:- line disabling sync switch: 60/50 switch per line that disables the display while memory is still read working on STF and STE (used in the Overscan Demos)

I never found any use for that switch. What did you use it for?

George


Well you can use it to make the screen go up and down...
Also I think I used it in one of my experimental hardscroll routines.
IIRC you could also do it with a mono switch...
Alien / ST-Connexion

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Wed Apr 27, 2005 7:20 am

ljbk wrote:I have a nice screen working on STF that would be impossible for the same amount of memory without that trick.
If i have time i will try to do a STE code path for it using the STE HW using Steem as STE as i only have an STF so that anyone can have a look at it and so that there is at least a second screen using this trick.


Yes post it ! I'd like to see it !
I was actually surprised people didn't use it more, given that I had published full source in ST Magazine.

There's actually another demo with it... but it's on my harddrive in Germany (if that's still alive). If you were at the ICC ... you may remember the beginnings of a PAC-MAN in isometric 3D fullscreen, with artwork by Krazy Rex. Unfortunately I never had the time to finish it.

Sengan
Alien / ST-Connexion

User avatar
leonard
Moderator
Moderator
Posts: 660
Joined: Thu May 23, 2002 10:48 pm
Contact:

Postby leonard » Wed Apr 27, 2005 8:09 am

I never found any use for that switch. What did you use it for?


I think (maybe I'm wrong) Paulo speak about line disabling: the line is not showed ( just black, even if palette is setted), but memory are still read. He is (to my knowledge) the only one to use it in some screens in "overscan demos" to avoid nasty flicky pixels at the top right of the screen.

And of course, that rare effect is emulated in SainT, especially for "overscan demos" :-)
Leonard/OXYGENE.

User avatar
leonard
Moderator
Moderator
Posts: 660
Joined: Thu May 23, 2002 10:48 pm
Contact:

Postby leonard » Wed Apr 27, 2005 8:11 am

There's actually another demo with it... but it's on my harddrive in Germany (if that's still alive). If you were at the ICC ... you may remember the beginnings of a PAC-MAN in isometric 3D fullscreen, with artwork by Krazy Rex. Unfortunately I never had the time to finish it.


Full-screen pacmania ? I guess with soundtracker ? :-)

Oh please, search on your harddisk to find that beauty, even it's just a preview ! I'd love to see that gem with Krazy's gfx !! ( so maybe the little "evil pacman" logo drawn by krazy rex in the "froggies" multipart comes from that game ??
Leonard/OXYGENE.

User avatar
ggn
Atari God
Atari God
Posts: 1258
Joined: Sat Dec 28, 2002 4:49 pm

Postby ggn » Wed Apr 27, 2005 12:58 pm

Gunstick wrote:
Gunstick wrote:I still remember when I demoed a preview of the parallax dist at a party in Marseille. There was no clear screen before the scroll came in so one could very easily see the trick. What I did is I jumped in front of the TV to cover it up until all the scrollers were on screen. Funny.

Unfortunately I could not find the version I demoed at the party.


eureka! I found a box labeled ULM and in there were tonns of floppy disks. One of them said: megadistorter v35 (marseille)
So here it is, the exact thing appearing on the TV screens at the NeXT convention in Marseilles. Have fun.

Georges



Wooooo! Debug symbols too ;)

George
is 73 Falcon patched atari games enough ? ^^

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Postby ljbk » Wed Apr 27, 2005 3:54 pm

alien wrote:
Gunstick wrote:
ljbk wrote:- line disabling sync switch: 60/50 switch per line that disables the display while memory is still read working on STF and STE (used in the Overscan Demos)

I never found any use for that switch. What did you use it for?

George


Well you can use it to make the screen go up and down...
Also I think I used it in one of my experimental hardscroll routines.
IIRC you could also do it with a mono switch...


I am not talking about the 0 bytes lines: i know about that effect too both with a 60/50 switch and with a 71/50 switch but there the timing is so critical that it does not always work in the same way with my own STF.

I am talking, as Leonard said, about a line read by the Shifter with or without borders, but the Display Enable is not active, with a 60/50 switch.
What is this good for ?
Well, on my TV i use as a color monitor for the STF since i have it (almost 20 years :!: ), i can see about 24 pixels of the left border and 52 pixels of the right border when the picture is in fullscreen mode: i can even see the difference between the common way to get stabilization and the ULM one. This means that i can see 24+320+52 = 396 pixels in total corresponding to 396 clock cycles. I have then only 116 cycles where i can change the 16 colors palette without visible screen impact. If you don't do it with MOVEM.L and sometimes as you don't have enough free registers or time you can't, you will need more than 116 cycles to update the 16 colors.

With this trick applied on the last line of the hardscroll lines, i can set my complete palette at any time place and there will not be any screen impact.
This trick is used in the Overscan demos and also with YM50K and is emulated by SainT and STEem 3.2 after Russ checked it on STF and STE.
If you look at TCB Cuddly fullscreen, one can see that they are setting their palette at the middle of the line as you have some flickering dots.
With this trick no data would have been displayed for this line and so no flicker.
You can also check the Overscan Demos in STEem 3.1 versus 3.2 and look at the first line for the 2 screens using hard scroll.


I discovered part of these switch effects myself while playing GoldRunner.
There was a F key to switch from 50 to 60 and vice versa.
I noticed that when going to 60 the screen was like extended and starting earlier.
So i built my first test program by setting 60 Hz at the VBL and getting back to 50 after the screen started. Surprise, surprise, my screen had now 229 lines: the upper border was removed. But there was a problem with the first lines as the screen was bending to the left, so i never used the first 8 lines (empty bitmap with color 0). Soon after that i found the lower border switch in a logical way. These where the tricks i used in my first ever demo, called the Spectrum 512 Scrollshow back in February 1989. That demo had 4 screens shown one after the other and the last one was a bunch of Spectrum 512 pictures with a vertical scrolling of 8 pixels that you could hold with space and some raster bars simulating a circle.
As for the remaining sync switches i only tried to find them a bit later after i received from Daryl (TEX), the Amiga-Demo and the Union-Demo.
There it was obvious that a non bending top border bitmap was possible.

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Postby ljbk » Wed Apr 27, 2005 4:24 pm

alien wrote:The 6 cycle thing is interesting. On my ST they were always rounded up to the nearest multiple of 4, but mine was from 1987. I did get the odd Spectrum 512 distortion though depending on when I turned it on. Basically it sounds like an MMU problem because the MMU would have to hold the 68000 off from fetching instructions from memory during the non 4x cycles. I started coding a 68000 in verilog just to see, and a simple implementation would need to access memory on cycle 0 for instruction fetch and on cycle 2 for data fetch (16 bits). 32 bit accesses would be on cycles 2 and 6. Therefore if the 68000 accessed data on cycle 6 but not 2, the MMU should delay the response... but perhaps they forgot to do that on early MMU's.



I mean the CPU execution times are also rounded on my ST to the above value multiple of 4 unless the next instruction is also a non multiple and then they pair: two EXG will take 12 cycles.
But may be the 68000 receives 2 memory access clocks from the MMU.

Program A:

nop 4 cycles
exg d0,d0 6 ( 8 ) cycles
move.b d0,(a0) 8 cycles 60 Hz


Program B:

nop 4 cycles
nop 4 cycles
nop 4 cycles
move.b d0,(a0) 8 cycles 60 Hz

The only way the 2 above programs may result in a different behaviour, for the same ST without Cold Reset, is if the Program A switch to 60 Hz occurs 2 cycles before it occurs with Program B.


As for my email: ljbk@oniduo.pt


If you find your extra 4 bit scroller screen(s) or game(s), please post them.
As this method allows to save some memory, there is more room for extra stuff like Soundtracker sound, so it should look and sound very nice :) !


Paulo
Last edited by ljbk on Wed Apr 27, 2005 8:02 pm, edited 1 time in total.

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Wed Apr 27, 2005 7:20 pm

ljbk wrote:After having read about it a lot, i found the ST magazine articles written by Alien in issues 51, 52 and 55. Scans of these magazines are available in the web at http://didier.letot.club.fr/St%20mag%20page%202.htm .

According to those articles the "power-up states" are related to the MMU deciding which memory cycle(s) is(are) for the 68000 and which is for the Shifter


Interesting, but something is missing in this description. It doesn't matter exactly which cycles are granted to the CPU and which to the Shifter. It only matters in relation to something else.

You can consider this CPU/Shifter round-robin access as a 4Mhz clock mantained by GLUE (it is Glue, and not MMU who grants bus access). Then every period of this clock bus access is rotated between the CPU and Shifter.

It doesn't matter if the CPU is given the even periods (counting from power up), or if the Shifter does. What it matters is the phase relationship to some other counter or clock. So what is missing to complete the explanation, is what other counter(s) are related.

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Wed Apr 27, 2005 7:32 pm

alien wrote:The 6 cycle thing is interesting... I started coding a 68000 in verilog just to see, and a simple implementation would need to access memory on cycle 0 for instruction fetch and on cycle 2 for data fetch (16 bits). 32 bit accesses would be on cycles 2 and 6.


Can't be. Bus cycles on the 68000 take (minimum) 4 cycles internally. If prefetch was done on cycle 0, the CPU won't attempt to access the bus again before cycle 4.

That's why Atari choose that 4 cycle design. Normally the CPU itself will align all access to the bus on 4 cycles boundary. Only when the CPU takes some extra cycles without accessing the bus, and only when those extra cycles are not divisible by four, then the CPU is delayed by GLUE. This is when accessing memory, accesing slower devices might produce extra wait states disregarding the cycle aligment.

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Thu Apr 28, 2005 4:51 am

leonard wrote:
Full-screen pacmania ? I guess with soundtracker ? :-)

Oh please, search on your harddisk to find that beauty, even it's just a preview ! I'd love to see that gem with Krazy's gfx !! ( so maybe the little "evil pacman" logo drawn by krazy rex in the "froggies" multipart comes from that game ??


Yup, it had a soundracker. In fact most of the code in the Punish Your Machine demo was taken from the game (minus sprites etc)... unfortunately the harddrive is in Germany, I'm in the USA, and the last 2 times my wife went back to Germany she didn't bring the harddrive back as I asked. So... maybe in another year or two. Or maybe I'll just have to rewrite it. I'll ask Krazy Rex if he still has the pictures. Yes the evil pacman logo came from the game.
Alien / ST-Connexion

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Thu Apr 28, 2005 4:54 am

ijor wrote:
alien wrote:The 6 cycle thing is interesting... I started coding a 68000 in verilog just to see, and a simple implementation would need to access memory on cycle 0 for instruction fetch and on cycle 2 for data fetch (16 bits). 32 bit accesses would be on cycles 2 and 6.


Can't be. Bus cycles on the 68000 take (minimum) 4 cycles internally. If prefetch was done on cycle 0, the CPU won't attempt to access the bus again before cycle 4.

That's why Atari chose that 4 cycle design. Normally the CPU itself will align all access to the bus on 4 cycles boundary. Only when the CPU takes some extra cycles without accessing the bus, and only when those extra cycles are not divisible by four, then the CPU is delayed by GLUE. This is when accessing memory, accesing slower devices might produce extra wait states disregarding the cycle aligment.


Yes you're right. I forgot that I had tried to optimize that behaviour. I guess the way 6 + 6 cycles = 12 cycles and not 16 is that the 68000 prefetches one operation in advance (a trick used IIRC by the Union Demo to break debuggers)
Alien / ST-Connexion

User avatar
leonard
Moderator
Moderator
Posts: 660
Joined: Thu May 23, 2002 10:48 pm
Contact:

Postby leonard » Thu Apr 28, 2005 8:30 am

a trick used IIRC by the Union Demo to break debuggers


and to break emulators !! :-) these prefetch tricks are a real nightmare to emulate ( in some *rare* case all happend as 32bits were prefetched instead of only 16 !)
Leonard/OXYGENE.

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Thu Apr 28, 2005 3:04 pm

alien wrote:I guess the way 6 + 6 cycles = 12 cycles and not 16 ...


leonard wrote:and to break emulators !! these prefetch tricks are a real nightmare to emulate ( in some *rare* case all happend as 32bits were prefetched instead of only 16 !)


Both issues boil down to the same undocumented aspect of the CPU: the cycle by cycle execution order of an instruction.

The prefetch trick has different effects depending exactly at which point on the instruction the prefetch cycle(s) are performed.

For a 6+6 cycles to get a net execution of 12 cycles, you also need a special positioning of the prefetch cycles. It can only happen if the first instruction prefetch on cycle 0 of the instruction, and the second prefetchs on cycle 2. So that, counting from the first instruction, the first prefetch is executed on cycle 0 and the second on cycle 8.

At least this is from theory. Would be interesting to hear more inside about the 6+6 from Leonard and Russ.

User avatar
leonard
Moderator
Moderator
Posts: 660
Joined: Thu May 23, 2002 10:48 pm
Contact:

Postby leonard » Thu Apr 28, 2005 3:47 pm

At least this is from theory. Would be interesting to hear more inside about the 6+6 from Leonard and Russ


Humm.... have to admit the prefetch seems to be better handled in Steem, and still bugged in SainT.

SainT originally uses Starscream library ( a free 68000 emulation library). By default, starscream does not handle cycle counting properly, does not handle interrupt properly, and does not not emulate prefetch at all. James Boulton (original SainT author) and me have made many, many patchs and fix in the starscream engine, and I worked a lot on the prefetch.
But to be honest, my prefetch routine is very "empirical". Actually, I "prefetch" 32 bits in all case, wich IS wrong (I know). there is some case where demo crash just because of the 32bits prefetch ( ex the "flashback" screen by TCB). That's why it's a long time SainT is at version 1.99x, coz I would to make that prefetch working to release a 2.0, but I don't work on prefetch since a while.

Here is what I remember from the last time I worked on:

simple case:

lea label(pc),a0
move.w #$4afc,(a0)
label:
nop

nop is prefetched, and no illegal occurs. this is cimple (16bits prefetch)

simple case where prefetch is 16bits only :

lea label+2(pc),a0
move.l d0,(a0) ; patch the JMP adress
label:
jmp $0

the jump adress is NOT prefetched, so my "hardcoded 32bits" prefetch could fail in that case

*strange* "32bits like" prefetch case:

lea label+2(pc),a0
add.l #$12345678,(a0)
label: move.l #$00000000,d0

on real ST, d0 will contain only $00005678 ( and not 12345678 as I supposed).
Not that it only append if the modifiying instruction does read/write ( "add.l" in that case, but should work with EOR or SUB, but not move)

If anyone have an general rule for that, he's welcome :-)

68000 rules !
Leonard/OXYGENE.

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Thu Apr 28, 2005 7:52 pm

leonard wrote:
At least this is from theory. Would be interesting to hear more inside about the 6+6 from Leonard and Russ


Humm.... have to admit the prefetch seems to be better handled in Steem, and still bugged in SainT.

SainT originally uses Starscream library ( a free 68000 emulation library). By default, starscream does not handle cycle counting properly, does not handle interrupt properly, and does not not emulate prefetch at all. James Boulton (original SainT author) and me have made many, many patchs and fix in the starscream engine, and I worked a lot on the prefetch.
But to be honest, my prefetch routine is very "empirical". Actually, I "prefetch" 32 bits in all case, wich IS wrong (I know). there is some case where demo crash just because of the 32bits prefetch ( ex the "flashback" screen by TCB). That's why it's a long time SainT is at version 1.99x, coz I would to make that prefetch working to release a 2.0, but I don't work on prefetch since a while.

Here is what I remember from the last time I worked on:

simple case:

lea label(pc),a0
move.w #$4afc,(a0)
label:
nop

nop is prefetched, and no illegal occurs. this is cimple (16bits prefetch)

simple case where prefetch is 16bits only :

lea label+2(pc),a0
move.l d0,(a0) ; patch the JMP adress
label:
jmp $0

the jump adress is NOT prefetched, so my "hardcoded 32bits" prefetch could fail in that case

*strange* "32bits like" prefetch case:

lea label+2(pc),a0
add.l #$12345678,(a0)
label: move.l #$00000000,d0

on real ST, d0 will contain only $00005678 ( and not 12345678 as I supposed).
Not that it only append if the modifiying instruction does read/write ( "add.l" in that case, but should work with EOR or SUB, but not move)

If anyone have an general rule for that, he's welcome :-)

68000 rules !


- I expect the prefetch unit is lower priority than data reads and writes

- When the add occurs there's a read, followed by a write. If there is a 4 cycle gap between the read and the write, the prefetch unit will use that time to read in what's there (ie 0000), while passing its contents (movel) into the instruction decode stage. The add completes, but the 0000's have already been prefetched.

- In the other cases, there is a write and because there is nothing to do with the result of the read, the access is atomic: next instruction is prefetched and then write which prevents prefetch from getting anything further.

- I was expecting the following code not to show prefetching though:

Code: Select all

  lea foobar(pc), a0
  move.l a0, -(a7)
  move.w #$4afc, d0
  lea lable(pc), a1
  jmp (a1)
  nop
  nop
  nop

foobar:
  move.w d0, (a0)
label:
   nop
   nop
   rts

because I thought the jmp to foobar would invalidate the prefetch buffer, and the move would be executed as soon as it was read leaving no time for a prefetch to take place. But apparently it diid (on a real ST):

Memory bus activity at Cycle
0: move.w is fetched
4: move.w writes
8: illegal is fetched

I don't see how illegal got read before move.w writes unless jmp reads the move.w during its time, and at cycle 0, the nop gets read.

It might be interesting to time these events versus the shifter and see if a tell-tale 4 cycles shows up for the prefetch to happen. If it doesn't, the jmp must include it. After all it's an 8 cycle instruction, but only a single word. Writing PC shouldn't take any longer than writing a0.

Sengan
Alien / ST-Connexion

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Thu Apr 28, 2005 7:57 pm

ijor wrote:For a 6+6 cycles to get a net execution of 12 cycles, you also need a special positioning of the prefetch cycles. It can only happen if the first instruction prefetch on cycle 0 of the instruction, and the second prefetchs on cycle 2. So that, counting from the first instruction, the first prefetch is executed on cycle 0 and the second on cycle 8.


I don't understand. I'll take 2 exchanges (register to register) to make it simple

Some time before Cycle 0: EXG1 is prefetched
Cycle 0: EXG1 has been prefetched, EXG2 is prefetched
Cycle 4: Nothing happens (EXG2 is still waiting in the prefetch queue, EXG1 not yet done)
Cycle 6: EXG1 done, EXG2 flushed from prefetch queue, Prefetch stalled until it can read memory
Cycle 8: Next instruction is prefetched.

Sengan
Alien / ST-Connexion

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Thu Apr 28, 2005 8:56 pm

alien wrote:I don't understand. I'll take 2 exchanges (register to register) to make it simple

Some time before Cycle 0: EXG1 is prefetched
Cycle 0: EXG1 has been prefetched, EXG2 is prefetched
Cycle 4: Nothing happens (EXG2 is still waiting in the prefetch queue, EXG1 not yet done)
Cycle 6: EXG1 done, EXG2 flushed from prefetch queue, Prefetch stalled until it can read memory
Cycle 8: Next instruction is prefetched.

Sengan



This is correct, and that’s why I said you need “special positioning”. For two 6 cycles instructions to “pair” they must be different. Furthermore, they must have a different 4 cycles alignment of the prefetch cycle(s) with respect to the start of the instruction execution. Two “EXG” instructions don’t pair, they have net execution of 8 cycles each one.

Might be you are thinking in terms of a modern processor. A modern processor has independent units, such as sequencer, bus manager, ALU, etc. And for example, if the bus manager and prefetch unit are waiting for external memory, the sequencer and ALU can keep running.

But the 68k doesn’t work like this. If the prefetch is stalled because of external wait states, then the whole CPU is stalled. And if an instruction performs the prefetch in its first cycle, then it always does. That’s why you need two different instructions for pairing two 6 cycles instruction.

A small correction, but not relevant to the issue. When the first EXG starts execution, both EXG instructions are already prefetched. The prefetch is two words, not one.

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Thu Apr 28, 2005 9:42 pm

leonard wrote:If anyone have an general rule for that, he's welcome


From what I’ve seen in protection, and from your tests, I think I can summarize some rules (not perfect, and possible not complete).

The prefetch queue is two words (32 bits). This is a fact, it is mentioned in the manuals. But the manuals are sometimes misleading, because the two words have different names. They are sometimes referenced as the “prefetch queue” and as the “decode register”.

The prefetch is done not only for the opcode, but for all the extension words as well. An instruction performs as many prefetch cycles as the number of words that expand the whole instruction.

The exact effect of prefetch tricks depends on the cycle by cycle execution order of each instruction. More precisely, it depends if the last prefetch cycle is executed before or after the last memory write.

One prefetch word can never be overwritten by the previous instruction. In other words, no prefetch trick can affect the opcode word of the next instruction.

Read/modify instructions produce a two words (32 bits) prefetch effect. More than likely, and as Alien said, this is because the last prefetch cycle is performed during the internal logical/math operation and then before the last write cycle.

Move instructions produce a one word (16-bits) prefetch effect. Suggesting that the last (or only) prefetch cycle is executed after the memory write cycle.

leonard wrote:*strange* "32bits like" prefetch case:

lea label+2(pc),a0
add.l #$12345678,(a0)
label: move.l #$00000000,d0

on real ST, d0 will contain only $00005678 ( and not 12345678 as I supposed).


This is because the prefetch queue is just two words. To get an $12345678 result you’ll need a prefetch of 3 words. The two first words of the “move” instructions are prefetched during the previous (add.l) instruction. But the third word, the one with the lowermost long immediate operand, have no choice but to be prefetched during the execution of the “move” instruction. By this time, obviously, the previous add.l instruction has already completed and overwritten this word (but not the previous two).

alien wrote:- I was expecting the following code not to show prefetching though:

Code: Select all

  lea foobar(pc), a0
  move.l a0, -(a7)
  move.w #$4afc, d0
  lea lable(pc), a1
  jmp (a1)
  nop
  nop
  nop

foobar:
  move.w d0, (a0)
label:
   nop
   nop
   rts

because I thought the jmp to foobar would invalidate the prefetch buffer, and the move would be executed as soon as it was read leaving no time for a prefetch to take place. But apparently it diid (on a real ST):

Memory bus activity at Cycle
0: move.w is fetched
4: move.w writes
8: illegal is fetched

I don't see how illegal got read before move.w writes unless jmp reads the move.w during its time, and at cycle 0, the nop gets read.


No, the prefetch is two words. And the prefetch is filled before the move starts execution (the extra prefetchs cycles are part of the jmp instruction). So the illegal opcode is prefetched before move starts. Move prefetches actually the second nop, not the first one.

Gunstick
Captain Atari
Captain Atari
Posts: 289
Joined: Thu Jun 20, 2002 6:49 pm
Location: Luxembourg
Contact:

Postby Gunstick » Thu Apr 28, 2005 9:55 pm

ULM tried to optimize code by specifically searching for pairable instructions. I don't know where I got the informatin from but we knew that it has to be an instruction with an "odd" execution time (6, 10, 14...) and that one instruction must have a different bus behaviour.
Knowing not much about CPU we just called theese internal and external cycles. And an instruction ending with an internal cycle (EXG) would have to be followed by one starting with an internal cycle (LSL?).

Not many instructions behaved like that and were unusable for "cool" code so the idea was dropped. Self modifying code is of course a complete other issue because you write where the prefetch occurs. Seems disk protections were very happy about this prefetch thing...

Motorola documentation does not help much, it's an atari-design to put memory access on 4 cycle boundaries. In the Motorola schematics there's not documented when for each instruction there is a memory access. So the whole thing is difficult to fine except by just testing.

Some more undocumented stuff... where there is no memory will stay some "remanence" of previous bus signals (leonard had a question about this some time ago). If you put the video address outside the available memory you will see nice flicker: that's actual code and data flowing over the bus. If you do that in a fullscreen, there is no flicker as the code is synched. Test it out with the parallax distorter cheat (that's why it's B-U-S) I mentioned earlier in this thread. So what happens if you write data into non existing RAM and the next instruction is fetched from there? Will it fetch the just written data? Always or somtimes?

Georges

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Postby ljbk » Thu Apr 28, 2005 10:11 pm

Well, i can confirm it.
To my surprise, 2 EXG instructions do NOT pair. I have just checked it on my STF as well as with STEem.
But i am pretty sure that some do pair because i already had fullscreen timing problems because of this.
May be you are right, may be they have to be different.
Please note that probably the base instruction (base opcode) has to be different since EXG D0,D0 also does not pair with EXG D1,D1 but there is something else since EXG does also not pair with NEG.L Dn, NOT.L Dn, CLR.L Dn and Scc Dn and all these cases according to the documents i have would lead to 6 cycles.

But as i refered above, the EXG will have inpact on the cycle execution of the next instruction (Program A and B example).

Paulo.


PS: just made a quick search through some of my sources and i found 2 pairing instructions:

lsr.w #2,dn takes 10 cycles but goes to 12 if followed by a nop
move.b 0(an,dn),dn takes 14 cycles but goes to 16 if followed by a nop

One after the other, they take 24 cycles and the order does not matter !

Gunstick
Captain Atari
Captain Atari
Posts: 289
Joined: Thu Jun 20, 2002 6:49 pm
Location: Luxembourg
Contact:

Postby Gunstick » Thu Apr 28, 2005 11:34 pm

ljbk wrote: just made a quick search through some of my sources and i found 2 pairing instructions:

lsr.w #2,dn takes 10 cycles but goes to 12 if followed by a nop
move.b 0(an,dn),dn takes 14 cycles but goes to 16 if followed by a nop

One after the other, they take 24 cycles and the order does not matter !


rule of thumb explanation: lsr does lots work later... (shifting) but adress register indirect has to work internally earlier to add an+dn.

with what we learned previously:
the move.b has 2 words, so it's prefetched entirely.
As the lsr does not access the bus to write out any result, the move can start executing as soon as lsr is finished (the glue can't block the CPU as the CPU does no access to RAM).

My conclusion: uneven instructions where the result is not output to RAM followed by uneven instructions with an opcode not longer than 32bit gain in total 4 cycles (1 nop) in execution time.

About EXG pairing, EXG is 16 bit, so during it's execution, there is a prefetch, RAM access, 4cycle alignment. So another rule to add: the instruction has to be longer than the prefetch (e.g. 1 word instructions lasts more than 8 cycles).

if I follow my reasoning and do:
move.b 0(an,dn),dn 32 bit, uneven, 14 cycles
subq #1,dn 16 bit, prefetched during move, even 8 cycles
lsr.w #2,dn 16 bit, prefetched during move, uneven, 10 cycles

does the subq effectively execute in uneven cycles so that this code uses 32 cycles and not 36?

No time to test... your turn :-)


Georges
PS: something I wrote down in my timing sheet is movem R->M, motorola says it takes 8+5n or 8+10n for W or L respectively, but I noted that it's 8+4n and 8+8n on Atari.
4n is more logical, we write 16 bit words. So why is motorola saying 5n?
Is the glue forcing the CPU to "speed up" or is the documentation wrong?
Is the ST's movem write faster than on other MC68000 architectures?

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Fri Apr 29, 2005 2:32 am

ljbk wrote:May be you are right, may be they have to be different.
Please note that probably the base instruction (base opcode) has to be different since EXG D0,D0 also does not pair with EXG D1,D1


Of course. For two instructions to pair, the first instruction must have all bus cycles aligned to a four cycle boundary, and the second one must have all bus cycles misaligned (always in relation to the first cycle of the instruction).

The reason is very simple. That’s the only way that two instructions with an execution time not multiple of 4 will pair and avoid the penalty, and still comply with the 4 cycles bus alignment imposed by Glue.

Obviously this could only happen in different type of instructions. At the very least you need a different addressing mode, and I suspect even this won’t be enough. A simple change in the operand can’t produce a different order or aligment of execution.

lsr.w #2,dn takes 10 cycles but goes to 12 if followed by a nop
move.b 0(an,dn),dn takes 14 cycles but goes to 16 if followed by a nop

One after the other, they take 24 cycles and the order does not matter !


Hmm, the order should matter in theory (otherwise my explanation above breaks) and Steem confirms this (can’t test on a real ST right now). According to Steem they pair only when the lsr goes first.

Gunstick wrote:rule of thumb explanation: lsr does lots work later... (shifting) but adress register indirect has to work internally earlier to add an+dn.


Might be. But I think it is the opcode and not the addressing mode what makes the difference. Specifically, it is likely that the “move” instruction has an unusual behavior. It is natural to assume that the “move” instruction is special and have a particular microcode. As a matter of fact, we already know that the prefetch order of the move instruction is different than the read/modify opcodes. (see the tests mentioned above by Leonard regarding the prefetch).

subq #1,dn 16 bit, prefetched during move, even 8 cycles

does the subq effectively execute in uneven cycles so that this code uses 32 cycles and not 36?


(I assume you mean subq.l #1,Dn, Otherwise will be 4 cycles and not 8 ). Can’t be. If the subq.l #n,Dn instructions prefetchs in uneven cycles, then the net execution would be normally (unless paired) 12 cycles.

PS: something I wrote down in my timing sheet is movem R->M, motorola says it takes 8+5n or 8+10n for W or L respectively, but I noted that it's 8+4n and 8+8n on Atari.


Hmm. Looks like you have an earlier buggy manual. Mine specifies 8+4n and 8+8n for movem reg,(Ax).

There is one last thing about this. As Gunstick and me commented here some time ago, it is possible that an instruction with even cycles (multiple of 4) by the book would normally take an extra nop in the ST. This would happen, for example, if an instruction that takes 8 cycles with just a single memory access (prefetch), executes the prefetch in cycle 2 of the instruction. Such an instruction would normally take 12 cycles (unless properly paired before and after). I have no idea if such instructions exist.

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Fri Apr 29, 2005 3:03 am

Gunstick wrote:the move.b has 2 words, so it's prefetched entirely.
As the lsr does not access the bus to write out any result, the move can start executing as soon as lsr is finished (the glue can't block the CPU as the CPU does no access to RAM).

My conclusion: uneven instructions where the result is not output to RAM followed by uneven instructions with an opcode not longer than 32bit gain in total 4 cycles (1 nop) in execution time.


I think your reasoning is wrong, or at least incomplete.

What matters is, again, the cycle by cycle order of an instruction. In this case, it is if the move.b x(Ax,Dx,)Dx makes the first bus cycle in cycle 2 of the instruction. This seems to be the case according to the tests made by ljbk. And if it does so, it will do it always disregarding the previous instructions, because this depends on the microcode instruction and nothing else. Or more precisely, it will always attempt the bus cycle at cycle 2.

In turn Glue will delay or not that bus access, depending (here yes) on the previous instruction and how it left the 4 cycles aligned.

But if the second instruction is a different one that attempts the first bus cycle on its very first cycle, then you won’t have a pairing.

So the pairing doesn’t depend on being completely prefetched or not. It depends on the instruction microcode and how the microcode orders the cycle by cycle execution. Of course that the cycle by cycle order must follow some logic. Some things must be executed in a certain order. But other things could be executed in different order with the same result.

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Fri Apr 29, 2005 3:46 am

Gunstick wrote:Some more undocumented stuff... where there is no memory will stay some "remanence" of previous bus signals (leonard had a question about this some time ago). If you put the video address outside the available memory you will see nice flicker: that's actual code and data flowing over the bus.


Yes I remember that. I actually got very confused at one point when I had a lower overscan and there was a clear pattern that didn't flicker but writing to "memory" there didn't work out. Then I realized that I was seeing the pattern of my code loads. I think the MMU must return the last data it read if it can't access memory at the specified address. This is different from a PC's behaviour where pull-down resistors are used (on PCI) to make invalid memory (memory no one responds to) all FF's.
Alien / ST-Connexion

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

More fun...

Postby alien » Fri Apr 29, 2005 4:00 am

So, I wanted to figure out whether the prefetch word is prefetched by the change-of-flow instruction. Here's my test code:

Code: Select all

  move.w  end_of_test, a0
  sub.w    #2, a0
  move.l  (a0),d0
  move.l  sp,d1

  move.l  (code),d7
  move.l  d7, (a0)

  lea     here(pc), a1
  lea     stack(pc),a2
  lea     2(a2), sp
  jmp     (a0)

here:
  move.l  d0,(a0)
  move.l  d1,sp

end_of_test:
  illegal

code:
  jsr     (a2)     ; Ends up at $4afc-2, will push $4afc (ie illegal) onto the stack
  jmp     (a1)

stack:
  nop      ; Will be replaced by $4afc (ie illegal) when jsr (a2) pushes its return address onto the stack
  rts


Ok so...
- STEEM does an illegal exception at PC == stack
- a real 68000 does an illegal exception at PC = end_of_test

This means that a real 68000 fills the prefetch buffer before it pushes the return address due to the call. Which means that the call instruction fills it. I suspected that because the Bxx conditions take 10 cycles if taken but 8 otherwise.

STEEM does it wrong. What about SainT, Leonard?
Alien / ST-Connexion


Social Media

     

Return to “Demos - General”

Who is online

Users browsing this forum: No registered users and 9 guests