What is "sync scrolling"?

All about ST/STE demos

Moderators: lotek_style, Mug UK, Moderator Team

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Fri Apr 29, 2005 4:21 am

ijor wrote:Might be you are thinking in terms of a modern processor. A modern processor has independent units, such as sequencer, bus manager, ALU, etc. And for example, if the bus manager and prefetch unit are waiting for external memory, the sequencer and ALU can keep running.


Well... I did work for AMD until a few weeks ago, so yes.

But the 68k doesn’t work like this. If the prefetch is stalled because of external wait states, then the whole CPU is stalled. And if an instruction performs the prefetch in its first cycle, then it always does. That’s why you need two different instructions for pairing two 6 cycles instruction.


So you seem to be saying that every prefetch is programmed into the instruction (as I seem to have figured out will call). That's interesting, and would indicate that some instructions would always have the option of pairing with whatever instruction that followed. But I don't understand why it would forbid the same instruction from pairing with itself. (Both instructions don't run at the same time, so there is no conflict if they both prefetch on the "same" cycle because it's the same relative cycle from the beginning of the instruction, but the instructions start at different cycles).

A small correction, but not relevant to the issue. When the first EXG starts execution, both EXG instructions are already prefetched. The prefetch is two words, not one.


Basically you're saying there's a prefetch register, and then there's an ID stage to identify the instruction which looks like 2 words. I agree instruction queue means everything (instruction + extensions) because prefetch wouldn't be able to know what's what until the ID stage. But... this example doesn't seem to match a 2 word always rule:

Code: Select all

  lea     smc(pc),a0
  move.l  code(pc), (a0)

smc:
  nop      ; if prefetch is 0 instructions we should hit here
  nop      ; if prefetch is 1 instruction  we should hit here
  illegal  ; If prefetch is 2 instructions we should hit here

code:
  illegal
  illegal

We hit on the second illegal corresponding to 1 instruction prefetch.
Alien / ST-Connexion

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Fri Apr 29, 2005 4:28 am

Gunstick wrote:rule of thumb explanation: lsr does lots work later... (shifting) but adress register indirect has to work internally earlier to add an+dn.

Sorry but that doesn't make sense to me: lsr doesn't do the work later or else it wouldn't be a 6/8 + 2n instruction, and couldn't be used to synchronize the cpu with the video counter. Also, the penalty would still be there because the following instruction uses the result so nothing can be hidden.
PS: something I wrote down in my timing sheet is movem R->M, motorola says it takes 8+5n or 8+10n for W or L respectively, but I noted that it's 8+4n and 8+8n on Atari.
4n is more logical, we write 16 bit words. So why is motorola saying 5n?
Is the glue forcing the CPU to "speed up" or is the documentation wrong?
Is the ST's movem write faster than on other MC68000 architectures?


4n is right (otherwise spectrum 512 images woudn't work). I think your documentation is wrong. Nothing can speed a CPU up, other than higher voltage and cpu frequency. And that's not a good idea for the lifetime of the chip.
Alien / ST-Connexion

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Fri Apr 29, 2005 6:14 am

ljbk wrote:Well, i can confirm it.
To my surprise, 2 EXG instructions do NOT pair. I have just checked it on my STF as well as with STEem.


I tried it on my ST... and

Code: Select all

  move.b  (a0), d1   ; a0 = 0xffff8209
  exg d7,d6
  exg d6,d7
  move.b  (a0), d3


the 2 exchanges add up to 12 cycles, which looks like 6 + 6.
Steem does it wrong... taking 16 cycles... so let's forget STEEM from here on. I'm not too surprised, this is weird stuff.

Now

Code: Select all

  move.b  (a0), d1   ; a0 = 0xffff8209
  nop
  exg d7,d6
  exg d6,d7
  nop
  move.b  (a0), d3


Now the total is 24 cycles: 16 + 4 + 4

Then we try:

Code: Select all

  move.b  (a0), d1   ; a0 = 0xffff8209
  exg d7,d6
  exg d6,d7
  exg d7,d6
  exg d6,d7
  move.b  (a0), d3


Now the exchanges add up to 28 which is 8*4 - 4.
Perhaps the original 12 was not 6+6, but 8*2 - 4.
We check with

Code: Select all

  move.b  (a0), d1   ; a0 = 0xffff8209
  exg d7,d6
  exg d6,d7
  exg d7,d6
  exg d6,d7
  exg d7,d6
  exg d6,d7
  move.b  (a0), d3

and we get 8*6 - 4 = 44

So it seems that there's __ONLY__ a pairing between either the first move.b and the first exg / or the second exg and the second move.

So...

Code: Select all

  move.b  (a0), d1   ; a0 = 0xffff8209
  nop
  exg d7,d6
  exg d6,d7
  move.b  (a0), d3

takes 16 cycles

while

Code: Select all

  move.b  (a0), d1   ; a0 = 0xffff8209
  exg d7,d6
  exg d6,d7
  nop
  move.b  (a0), d3

takes 20 cycles.

Which tells us that the pairing is the second exg and the second move.b:

Code: Select all

  move.b  (a0), d1   ; 8 cycles
  exg d7,d6              ; 8
  exg d6,d7              ; 6
  move.b  (a0), d3   ; 2 until read, 6 following it


It's starting to look like the GLUE synchronisation theory makes sense because if I do

Code: Select all

  move.b  (a0), d1
  exg     d6,d7
  move.b  (a0),d0 ; 8 cycles
  exg     d6,d
  move.b  (a0),d0 ; 8 cycles
  exg     d6,d7
  move.b  (a0),d0 ; 8 cycles
  exg     d6,d7
  move.b  (a0),d3

I get 52 cycles which is 3 * 8 + 6 * 2 + 8 * 2 which looks like depending on their slot the exchanges are either pairing up or not pairing up with the move.b's.

So basically ijor seems to be right: the 68000 is being stalled by the GLUE when it tries to do a memory access on a 2 cycle boundary, but not when it's doing it on a 4 cycle boundary. Now I still don't get why 2 exchanges following each other don't end up always being 12 cycles. After all the GLUE's not involved, and by the time they execute they're both in the prefetch buffer. Oh well, time to go to bed.

If I didn't mistype it, here's the test code, for you to try at home should you want to check the results.

Code: Select all

  clr.l   -(sp)
  move.w #$20, -(sp)
  trap   #1
  addq.l #6, sp
  move.l  d0, -(sp)
  move.w #$2700,sr
  move.l  $70, -(sp)
  lea $ffff8209, a0
  lea test1(pc),a1
  move.l a1, $70
  stop #$2300
  move.w #$2300, sr

self1: jmp self1

test1:
  move.b (a0),d1
  beq test1
  move.b (a0), d1
  move.b (a0), d2
  sub.b d1,d2
  lea test2(pc), a1
  move.l a1, $70
  moveq #0,d3
  rte

test2:
  move.b (a0),d1
  beq test2
  move.b (a0), d1
  < CODE TO TEST >
  move.b (a0), d3
  sub.b d1, d3
  sub.b d2, d3
  add.w d3,d3
  move.w d3, result
  addq #6, sp
  move.l (sp)+, $70
  move.w #$2300,sr
  move.w #$20, -(sp)
  trap #1
  addq.l #6, sp
  clr.l -(sp)
  trap #1

result: dc.w 0
Alien / ST-Connexion

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Postby ljbk » Fri Apr 29, 2005 6:46 am

ijor wrote:
lsr.w #2,dn takes 10 cycles but goes to 12 if followed by a nop
move.b 0(an,dn),dn takes 14 cycles but goes to 16 if followed by a nop

One after the other, they take 24 cycles and the order does not matter !


Hmm, the order should matter in theory (otherwise my explanation above breaks) and Steem confirms this (can’t test on a real ST right now). According to Steem they pair only when the lsr goes first.


Well in fact i had this couple of instructions just before a dbra, so the lsr is pairing with the dbra :D :!:
After inserting a nop after the lsr, there is no more pairing.

I tried also the subq.l test on both STEem and STF like this:

a)
nop
subq.l #1,d4
move.b 0(a0,d5),d6
lsr #2,d7
nop

b)
nop
move.b 0(a0,d5),d6
subq.l #1,d4
lsr #2,d7
nop

c)
nop
move.b 0(a0,d5),d6
lsr #2,d7
subq.l #1,d4
nop


In none of the cases, there is pairing.


Paulo.

User avatar
leonard
Moderator
Moderator
Posts: 660
Joined: Thu May 23, 2002 10:48 pm
Contact:

Postby leonard » Fri Apr 29, 2005 8:27 am

STEEM does it wrong. What about SainT, Leonard?


Don't know. Have to test it. could you provide the exact code ? ( I mean, is it

move.l #end_of_test, a0

instead of

move.w end_of_test, a0 as you write ?

Or did I missed something ?
Leonard/OXYGENE.

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Fri Apr 29, 2005 3:06 pm

alien wrote:So you seem to be saying that every prefetch is programmed into the instruction (as I seem to have figured out will call).


Yes.

That's interesting, and would indicate that some instructions would always have the option of pairing with whatever instruction that followed.


Hmm, I don’t understand how you reach this conclusion.

But I don't understand why it would forbid the same instruction from pairing with itself.


Let’s see if we can agree on a few terms and concepts (to avoid semantic misunderstanding):

Glue splits bus access in slots. Every two CPU cycles (every 250 ns), bus ownership rotates between Shifter and the CPU. This means that all bus accesses performed by the CPU must be separated by a multiple of 500 ns (4 CPU cycles). If the CPU attempts a “misaligned” bus access, Glue will insert wait states and force the alignment.

A couple of notes about this:

This happens always, disregarding if Shifter needs data or not. In other words, the Shifter still receives its bus slots even at vertical and horizontal blank.

There are additional wait states for floppy/hard disk DMA, Blitter, Ste DMA sound, and when accessing slow devices (ACIAs are the slowest ones).

Bus access is exactly the same for instructions prefetch or for memory access.

The cycle by cycle order (including prefetch cycles) of a specific instructions is fixed in microcode.

Let’s call “penalty” the additional 2 cycles inserted by GLUE for instructions that attempt a misaligned bus access. An instruction that takes 6 cycles (or any value not multiple of 4) will generate a penalty when preceded and followed by a NOP (and most other instructions). This is because the following NOP will attempt a misaligned prefetch. Note that the penalty will actually be applied to the NOP, and not to the 6 cycles instruction, but this doesn’t matter.

We call pairing, when two instructions that generate a penalty when bracketed by NOPs, they avoid the penalty if they are one after the other. There is only one way that pairing can occur (as I said in a previous message):

For two instructions to pair, the first instruction must have all bus cycles aligned to a four cycles boundary, and the second one must have all bus cycles misaligned (always in relation to the first cycle of the instruction).

So no instruction can pair with itself (well, it is possible that there are exceptions, see below). Let’s take the case of the “EXG” instruction. It might have two possible behaviors (but the behavior is fixed, is either always one or always the other):

1 – The prefetch is executed in cycles 0-3, and the bus is idle during the last two cycles.
2 – The bus is idle during the two first cycles and the prefetch is executed on cycles 2-5.

It seems that the actual behavior is the first one. But either way it cannot pair with itself. Because either all bus accesses fall on a 4 cycles boundary, or either never.

Note that this is as long as the two EXGs are bracketed with NOPs. Two EXG together might still pair if properly matched with a preceding and following instructions. But the pairing will be between each EXG and the other instruction. Not between themselves.

As I said, it is possible that there are instructions with “weird” behavior. For example, and instruction might have two or more bus accesses that they are misaligned between themselves. Or it might have extra idle cycles both at the start and at the end. Again, I have no idea if such instructions exist or not. One possible candidate is move d8(Ax,Xx),d8(Ax,Xx).

But... this example doesn't seem to match a 2 word always rule:

Code: Select all

  lea     smc(pc),a0
  move.l  code(pc), (a0)

smc:
  nop      ; if prefetch is 0 instructions we should hit here
  nop      ; if prefetch is 1 instruction  we should hit here
  illegal  ; If prefetch is 2 instructions we should hit here

code:
  illegal
  illegal

We hit on the second illegal corresponding to 1 instruction prefetch.


No, this doesn’t mean the prefetch is one word only. It just means that the last prefetch performed by the move is executed after the write. This is something we already know. See my “prefetch rules” when answering to Leonard question. Actually, you already suggested the same conclusion when you answered to Leonard.

Now I still don't get why 2 exchanges following each other don't end up always being 12 cycles. After all the GLUE's not involved, and by the time they execute they're both in the prefetch buffer.


It doesn’t matter exactly which instructions are prefetched (this only matters for self modified code prefetch tricks). The EXG still must perform one word prefetch (see my prefetch rules). So GLUE will delay EXG if it attempts a misaligned prefetch.

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Sat Apr 30, 2005 1:25 am

leonard wrote:
STEEM does it wrong. What about SainT, Leonard?


Don't know. Have to test it. could you provide the exact code ? ( I mean, is it

move.l #end_of_test, a0

instead of

move.w end_of_test, a0 as you write ?


No what I wrote is right. I'm putting code in the address $4afa so that a $4afc gets pushed onto the stack, which happens to be where my code then jumps. $4afc is illegal. I load "end_of_test"'s value (illegal) into a0.
Alien / ST-Connexion

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Sat Apr 30, 2005 2:02 am

ijor wrote:The cycle by cycle order (including prefetch cycles) of a specific instructions is fixed in microcode.

Let’s call “penalty” the additional 2 cycles inserted by GLUE for instructions that attempt a misaligned bus access. An instruction that takes 6 cycles (or any value not multiple of 4) will generate a penalty when preceded and followed by a NOP (and most other instructions). This is because the following NOP will attempt a misaligned prefetch. Note that the penalty will actually be applied to the NOP, and not to the 6 cycles instruction, but this doesn’t matter.

We call pairing, when two instructions that generate a penalty when bracketed by NOPs, they avoid the penalty if they are one after the other. There is only one way that pairing can occur (as I said in a previous message):

For two instructions to pair, the first instruction must have all bus cycles aligned to a four cycles boundary, and the second one must have all bus cycles misaligned (always in relation to the first cycle of the instruction).

So no instruction can pair with itself (well, it is possible that there are exceptions, see below). Let’s take the case of the “EXG” instruction. It might have two possible behaviors (but the behavior is fixed, is either always one or always the other):

1 – The prefetch is executed in cycles 0-3, and the bus is idle during the last two cycles.
2 – The bus is idle during the two first cycles and the prefetch is executed on cycles 2-5.

It seems that the actual behavior is the first one. But either way it cannot pair with itself. Because either all bus accesses fall on a 4 cycles boundary, or either never.

Note that this is as long as the two EXGs are bracketed with NOPs. Two EXG together might still pair if properly matched with a preceding and following instructions. But the pairing will be between each EXG and the other instruction. Not between themselves.

As I said, it is possible that there are instructions with “weird” behavior. For example, and instruction might have two or more bus accesses that they are misaligned between themselves. Or it might have extra idle cycles both at the start and at the end. Again, I have no idea if such instructions exist or not. One possible candidate is move d8(Ax,Xx),d8(Ax,Xx).


Yup, this seems a good model. Its conclusions are:

Let us consider an instruction that accesses memory at cycles 4n + 2 (where n is an integer):

1. If one of those instructions is started at a 4n + 2 boundary it will take its normal time
2. If one of those instructions is started at a 4n boundary, then it will take 2 extra cycles because the GLUE will delay the 68000's memory access.

which is what we observe.

An example of case 1 is:

Code: Select all

  lsr #2,dn
  move.b 0(an,dn),dn ; starts at 4n + 2 boundary: takes 14 cycles


and an example of case 2 would be

Code: Select all

  exg d0, d0
  exg d0, d0  ; starts at 4n + 2 boundary: takes 8 cycles


Now let us consider instructions that access memory at cycles 4n:

3. If it is started at a 4n boundary it will take its normal time
4. If it is started at a 4n + 2 boundary it will take 2 extra cycles

An example of case 3 is

Code: Select all

  nop
  nop  ; starts at 4n boundary: takes 4 cycles


An example of case 4 is:

Code: Select all

  exg d0,d0  (6 cycles)
  nop        ; starts at 4n boundary: takes 6 cycles


But... this example doesn't seem to match a 2 word always rule:

Code: Select all

  lea     smc(pc),a0
  move.l  code(pc), (a0)

smc:
  nop      ; if prefetch is 0 instructions we should hit here
  nop      ; if prefetch is 1 instruction  we should hit here
  illegal  ; If prefetch is 2 instructions we should hit here

code:
  illegal
  illegal

We hit on the second illegal corresponding to 1 instruction prefetch.


No, this doesn’t mean the prefetch is one word only. It just means that the last prefetch performed by the move is executed after the write. This is something we already know. See my “prefetch rules” when answering to Leonard question. Actually, you already suggested the same conclusion when you answered to Leonard.


Yes, but I was thinking of a Prefetch unit which avoided reading stuff when the bus was busy, rather than of an explicit prefetch in microcode. But I think you're right: the 68000 is a lot simpler than I thought.

So what about

Code: Select all

  move.b 0(an,dn),dn
  lsr #2,dn


Well it turns out that it does NOT pair if you surround it with NOPs. While

Code: Select all

  lsr #2,dn
  move.b 0(an,dn),dn


does pair, even when surrounded by NOPs. I guess in Paulo's code the instructions following it must have paired... and indeed that's what he said in his message: it paired with dbra.
Alien / ST-Connexion

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Sat Apr 30, 2005 2:20 am

Damn, I got the examples wrong:

Code: Select all

  exg d0,d0  ; 6 cycles
  move.b 0(an,dn),dn ; 4n+2 instruction starting at 4n + 2 boundary: takes 14 cycles


and an example of case 2 would be

Code: Select all

  nop  ; 4 cycles
  exg d0, d0  ; 4n + 2 instruction starting at 4n boundary: takes 8 cycles


An example of case 3 is

Code: Select all

  nop  ; 4 cycles
  nop  ; 4n instruction starting at 4n boundary: takes 4 cycles


An example of case 4 is:

Code: Select all

  exg d0,d0  ; 6 cycles
  nop        ; 4n instruction starting at 4n+2 boundary: takes 6 cycles
Alien / ST-Connexion

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Thu Jun 30, 2005 5:23 pm

ijor wrote:As I said, it is possible that there are instructions with “weird” behavior. For example, and instruction might have two or more bus accesses that they are misaligned between themselves. Or it might have extra idle cycles both at the start and at the end. Again, I have no idea if such instructions exist or not. One possible candidate is move d8(Ax,Xx),d8(Ax,Xx).


I was right, the above move (and some other forms of move as well) has two bus accesses that are misaligned. This mean that one bus access will always be delayed disregarding the surrounding instructions. And depending on the previous instruction, the first bus access will be delayed as well.

In other words, these instructions take two or four more cycles than by the book. And Steem does emulate this accurately.

I also found that my prefetch rules in this thread are not exact. Some forms of move behave like a non-move instructions are perform the last prefetch before the write.

Heaven/Taquart
Atarian
Atarian
Posts: 7
Joined: Fri Sep 02, 2005 2:41 pm
Contact:

Postby Heaven/Taquart » Fri Sep 02, 2005 6:12 pm

holy poo... all my faves from that time discussing the magic sync scrolls...

but as i had a 1040ste...where there any nice tricks as well? even with the new hardware?
Heaven/TAQUART
Coder Atari 800, 7800, Lynx, GBA, PSone
http://www.s-direktnet.de/homepages/k_nadj/main.html
http://boinxx.blogspot.com/

Watch out:

- BOINXX (former Trackball) for Atari Computer Systems
- http://numen.scene.pl/

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Mon Nov 14, 2005 1:38 am

ijor wrote:
ijor wrote:As I said, it is possible that there are instructions with “weird” behavior. For example, and instruction might have two or more bus accesses that they are misaligned between themselves. Or it might have extra idle cycles both at the start and at the end. Again, I have no idea if such instructions exist or not. One possible candidate is move d8(Ax,Xx),d8(Ax,Xx).


I was right, the above move (and some other forms of move as well) has two bus accesses that are misaligned. This mean that one bus access will always be delayed disregarding the surrounding instructions. And depending on the previous instruction, the first bus access will be delayed as well.

In other words, these instructions take two or four more cycles than by the book. And Steem does emulate this accurately.

I also found that my prefetch rules in this thread are not exact. Some forms of move behave like a non-move instructions are perform the last prefetch before the write.


Just FYI, I found this http://linux.cis.monroeccc.edu/~paulrsm/doc/dpbm68k1.htm which says

Another way designers made the MACSS faster was to include what is called a prefetch queue. This prefetch queue is more intelligent than other microprocessor queues; its control varies according to the instruction stream contents.

The prefetch queue is a very effective means of increasing microprocessor performance; it attempts to have as much instruction information as possible available before a particular instruction begins execution. The microprocessor uses an otherwise idle data bus to prefetch from the instruction stream. This keeps the bus active more of the time, increasing performance because processing of instructions is often limited by the time it takes to get all the relevant information into the processor.

The part of memory from which instructions are fetched, the program space, contains op codes and addressing information. The prefetch queue can contain enough information to execute one instruction, decode the next instruction, and fetch the following instruction from memory -- all at the same time.

Exactly what is in the queue is very dependent upon the exact instruction sequences. The queue is intelligent enough to stay fairly full without being too wasteful.

For instance, when a conditional branching instruction is detected, the prefetch is ready to either branch or not by the time a decision is made. The queue tries to fetch both the op code following the branch instruction and the op code at the calculated branch location. Then, when the condition codes are compared and a decision is made whether to branch, the processor can begin immediate decoding of either instruction. The other unnecessary op code is ignored.

You can use the prefetch queue in many other special ways as well. One example is in speeding up the repetitious Move Multiple Registers instruction, where it is used to accelerate successive data transfers. The prefetch queue allows many frequently used instructions to execute in exactly the time it takes to fetch the op code (actually, the time to prefetch the next op code).


Sounds like it's independent from the microcode.
Alien / ST-Connexion

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Mon Nov 14, 2005 1:29 pm

Hi Alien,

Thanks for the pointer, interesting article.

Alien wrote:Sounds like it's independent from the microcode.


No, it is completely dictated by microcode. It is the actual microcode for each and every instruction that performs the prefetchs. You are reading that article with a “modern mind”, and you are reaching then the wrong conclusion. This is partially the article author’s fault because the article is wrong, or at least it is misleading.

For instance, when a conditional branching instruction is detected, the prefetch is ready to either branch or not by the time a decision is made. The queue tries to fetch both the op code following the branch instruction and the op code at the calculated branch location. Then, when the condition codes are compared and a decision is made whether to branch, the processor can begin immediate decoding of either instruction. The other unnecessary op code is ignored.


That paragraph of the article is plain wrong. The 68000 does NOT perform conditional branches like that. If that were the case then the timing would be completely different. When the CPU decides that the branch is taken, only then it prefetchs code from the new PC. If the branch is not taken then the new PC is NEVER referenced. This can be easily proved in several ways.

This prefetch queue is more intelligent than other microprocessor queues;


This doesn’t mean what you are thinking. Put that phrase in its historical context. The comparison is against other micros that preceded the M68K (which were mostly very dumb).

It doesn’t mean that the queue is smart and performs prefetch cycles according to some dynamic state (as it is the case of modern micros). It means that the prefetch cycles are “smartly” implemented in the microcode to keep the bus utilization as efficient as possible.

For example, when a simple read/modify instruction is performed, such as addq #1,(A0), the prefetch cycle is executed between the data read and data write. The idea is to keep the bus busy while the CPU internally makes the arithmetic/logical operation.

But this is not because the queue is smart. The engineers that wrote the microcode were the smart ones. Actually, the prefetch queue in the M68k can’t be smart because it has NO prefetch unit at all (this can be seen in several M68K hardware diagrams).

Since I wrote the last message on this thread I made an exhaustive analysis of the prefetch queue. You can see my article at: http://pasti.fxatari.com/68kdocs/

It is more than obvious from the tables of my article (which you are very welcome to test) that it is all executed on microcode. The article is not complete and the tables are only for instructions that perform write cycles (because those are the only ones that could affect self modified code). But if you examine the behavior of other (read only) instructions, then the microcode effect is even more obvious:

Take for example DIV instructions. The DIV instructions execute the prefetch cycles at the very end of the instruction, after the bus is idle for up to 156 cycles. Obviously an independent smart prefetch queue unit would have executed the prefetch cycles much earlier.

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Mon Nov 14, 2005 2:38 pm

Hi Alien,

I’ll take the opportunity to ask you something about overscan coding (and it is related a bit to the prefetch issue). I managed to read your excellent articles on the subject that are still not translated from French. I learned a lot! Thanks. There is however one thing that seems to be wrong, I think.

You explain that the following code:

Code: Select all

move.b    d0,(a0) ; (8 cycles) Switch to mono
nop                     ; (4 cycles) Left border Release
move.b    d1,(a0) ; (8 cycles) Switch to low


can be replaced with this:

Code: Select all

move.b    d1,(a0) ;  8 cycles
clr.b        (a0)      ;  12 cycles


However the timing of both sequences is not the same. On the optimized code the switch back to low rez is executed in the very last bus cycle. But on the first example, the switch is perfomed 4 cycles earlier. This is because “move” first writes and then it prefetches, while “clr” first prefetches and then it writes (clr it’s actually a read/modify instruction).

I don’t know why this was never noted. May be you always used the second form, or may be GLUE can tolerate this 4 cycles variation. Or may be I'm missing something?

Gunstick
Captain Atari
Captain Atari
Posts: 289
Joined: Thu Jun 20, 2002 6:49 pm
Location: Luxembourg
Contact:

Postby Gunstick » Tue Nov 15, 2005 10:33 pm

what about the ULM way, how is the timing there?
move.b #2,$ffff8260.w
move.b #0,$ffff8260.w

ijor
Hardware Guru
Hardware Guru
Posts: 3960
Joined: Sat May 29, 2004 7:52 pm
Contact:

Postby ijor » Wed Nov 16, 2005 2:57 am

Hi Gunstick,

Gunstick wrote:what about the ULM way, how is the timing there?
move.b #2,$ffff8260.w
move.b #0,$ffff8260.w


I can't make a full comparison because the total number of cycles in your code is different (32 cycles) than both sequences in Alien's article (20 cycles). So I guess the code preceding and/or following those two instructions must be slighlty different than the one used by Alien.

But the distance (in cycles) between both writes in your case, is the same as the one in the second of Alien's sequences. It is of 16 cycles between switching to mono, and switching back to low. In comparison, the first of Alien's sequence has a distance between both writes of 12 cycles.

So I guess that 16 cycles is the correct timing.

If you are interested in the cycle by cycle execution of your sequence, then the writes are performed on cycle 8 (9th cycle) of each of your instructions. As I describe in the article pointed above, most move variants perform the write 4 cpu cycles (1 bus cycle) before the end of the instruction. And during the last 4 cycles the last (or only one) prefetch is performed. In your case it is like this:

Code: Select all

cycles 0-3: prefetch short absolute address
cycles 4-7: prefetch first word of the next instruction
cycles 8-12: perform write
cycles 13-15: prefetch second word of the next instruction

User avatar
alien
Atari maniac
Atari maniac
Posts: 97
Joined: Sat May 01, 2004 4:01 am
Location: USA
Contact:

Postby alien » Tue Dec 06, 2005 10:44 pm

ijor wrote:Hi Alien,

I’ll take the opportunity to ask you something about overscan coding (and it is related a bit to the prefetch issue). I managed to read your excellent articles on the subject that are still not translated from French. I learned a lot! Thanks.


Thanks for the compliment. It's nice that stuff I did so long ago is still so relevant!

ijor wrote:There is however one thing that seems to be wrong, I think.
You explain that the following code:

Code: Select all

move.b    d0,(a0) ; (8 cycles) Switch to mono
nop                     ; (4 cycles) Left border Release
move.b    d1,(a0) ; (8 cycles) Switch to low


can be replaced with this:

Code: Select all

move.b    d1,(a0) ;  8 cycles
clr.b        (a0)      ;  12 cycles


However the timing of both sequences is not the same. On the optimized code the switch back to low rez is executed in the very last bus cycle. But on the first example, the switch is perfomed 4 cycles earlier. This is because “move” first writes and then it prefetches, while “clr” first prefetches and then it writes (clr it’s actually a read/modify instruction).

I don’t know why this was never noted. May be you always used the second form, or may be GLUE can tolerate this 4 cycles variation. Or may be I'm missing something?


Good question. I didn't know anything about prefetches at the time I wrote the articles. I remember testing the clr version and it worked on my Atari, but I don't remember subjecting it to rigourous testing as I did for other things. (Writing an autoboot program and powering the computer on and off got boring after a while). In released demos I used the move form because one could cheat: one could move some of the address registers (ff8260 and ff820a, IIRC) to the relevant registers. One got away with it because the upper bits were ignored. So I only used 2 address registers. Also my digital sound replay code was interspersed with the "nops" in the border switching code simplifying integrating the rest of the code.

Sengan
Alien / ST-Connexion

User avatar
Dbug
Atari freak
Atari freak
Posts: 52
Joined: Tue Jan 28, 2003 8:42 pm
Location: Oslo (Norway)
Contact:

Other pairing weirdness cases

Postby Dbug » Thu Dec 29, 2005 1:18 pm

Hi.

I just found this discussion by actually googling for something totaly different, and I'm quite happy of this, it just proved me that I was not crazy :)

Years ago when doin some of the fullscreens screens of the Phaleon demo I got very nasty timing problems. Mit was assuming at this time that I got my timing wrongs, but then I had the killing proof for him :)

If you remember, there is one screen called the FullBall, where there is a face of a manga character (a princess from Ys console game), that blinks her eyes, and then there are some blue and brown vector balls and a disting Next logo.

The screen itself is not particularly interesting or technical, but the cool thing is that I did split rasters: on one side of the head the rasters are of a particular color, and on the other side they are of another color. So if you suppress the bitmap of the screen, you can see a very neat rasters separation => practical to see if you got the timing wrong.

And well, at this time I was not cheating with precomputed sprites and stuff, so everything is fully drawned/masked/restored on screen, meaning that I have a shitload of and/or instructions, many of them being not multiples of 4cycles and thus supposedly being rounded to the next 4cycles boundary.

And well I got the problem of sometimes having one less nop here and there in my scanlines thus fornicating up the fullscreen completely.

What I did then was to replace the fullscreen instructions adresses by neutral values (to disable fullscreen but using the same amount of cpu time, and disable the bitmap so I can just see the rasters).

Then just by switching the order of some instructions, adding a nop in between, before, or after some sequences of my sprite routine, the rasters were moving by 4 pixels to the left from that point.

I never managed to get a consistent rule to know how to win these nops, so it was more a pain in the ass than a real cool trick, because I kept getting timing errors in my fullscreen so in the end I stopped doing hardcore integration and just repeated a sequence that worked fine.

In short, boolean operations on registers are also pairable.

User avatar
rdemming
Atariator
Atariator
Posts: 27
Joined: Wed Jul 04, 2007 9:55 pm
Location: Amstelveen / The Netherlands
Contact:

Re:

Postby rdemming » Wed Feb 25, 2009 2:42 pm

Hi all,

Sorry for digging up this thread :P
It was a very interesting read especially the translated border removal articles of Alien / ST-Connexion in Alive magazine 9 and 11 . I hope to see a translation of the sync-scrolling article as well.

I started to play around with left/right border removal after reading an article about it in Maggie disk magazine by Flix of Delta Force.
It was great to see my own border removal code working for the first time. I just read the great articles by Alien and only now I understand why border removal works. Before reading this it was for me more like: "It just works" :D


ljbk wrote:I am not talking about the 0 bytes lines: i know about that effect too both with a 60/50 switch and with a 71/50 switch but there the timing is so critical that it does not always work in the same way with my own STF.


When I was experimenting with left/right border removal, I discovered 0 bytes lines by using the 70/50 Hz switch. I noticed that there were two places where you can get a 0 byte line with that switch.

The first place can be done at the beginning of a timer B interrupt or vertical blank interrupt and is not so time critical and has a relative big time (28 cycles) between the switch to 70 Hz and back to 50Hz. But if you do this between two visible lines, the next few lines are bend. If you do this at the top of the screen, thus before any visible lines, you won't have any bending because there are no lines to display. But I discovered that this method shows a black/white screen when using RF output on a TV instead of RGB output.
One thing to note about this is that timer B is not counting the lines you skipped (maybe because the switches happen at non visible lines in the VBI case, didn't checked when done at a visible line)

The second option is much more time sensitive and you need to have cycle synchronised code. The switch is later at a line then the first method and the switch back to 50Hz is immediately. This doesn't bend the few lines below it like the first method. Since it is done at a visible line, time B counts the line you skipped.
On a STF the switch should be done 4 cycles earlier than on an STE. As I read elsewhere, the STE timings are a bit off compared to the STF timings.

I used this technique on the "Lemmings color shock" quest screen in the Nostalgic-O-Demo to bounce (a part of) the screen and in the vertical cookie scroll text.
Unfortunately 0 byte lines are not yet emulated by SainT (Leonard, can we expect an update for this? :) ). It seems that, at the moment, only Hatari emulates 0 byte lines.

For years I was wondering why it seems that I am the only one who used the 0 byte line technique (Does somebody know another demo screen that uses 0 byte lines?). But now I read that Paulo and Alien state that this method is very time critical and might not work stable on all STs/STEs. So that might be the reason. At time time I only had access to one STF and one STE so I couldn't test it more.

So I want to question if people ran the "Lemming color shock" screen on real hardware and if that worked for them. I'm interested to hear if it doesn't work for them. If you didn't run it on real hardware, please do it now :P

Thanks and regards,


Robert


P.S. I read everywhere that there are lots of GLUE/Shifter combinations for STF computers that make it difficult to get a full screen working on every STF. Is there also so much variation with STE computers or is safe to assume that it works on one STE, it works on other STEs (and Mega STEs) as well?
There are 10 kinds of people. Those who understand binary and those who don't.

User avatar
leonard
Moderator
Moderator
Posts: 660
Joined: Thu May 23, 2002 10:48 pm
Contact:

Re: What is "sync scrolling"?

Postby leonard » Wed Feb 25, 2009 10:43 pm

Hi Robert!

I just release SainT v2.13, and I added the 0 byte line support so your screen in Nostalgic is now perfectly working :-)

http://leonard.oxg.free.fr

Cheers,
Leonard
Leonard/OXYGENE.

User avatar
thothy
Hatari Developer
Hatari Developer
Posts: 428
Joined: Fri Jul 25, 2003 9:36 am
Location: Germany
Contact:

Re: Re:

Postby thothy » Thu Feb 26, 2009 8:03 am

rdemming wrote:I used this technique on the "Lemmings color shock" quest screen in the Nostalgic-O-Demo to bounce (a part of) the screen and in the vertical cookie scroll text.
Unfortunately 0 byte lines are not yet emulated by SainT (Leonard, can we expect an update for this? :) ). It seems that, at the moment, only Hatari emulates 0 byte lines.


Hehe, right, I remember this screen... I used the Nostalgic-O demo for testing the sync-scrolling emulation from Nicolas and I finally run into that Lemming screen ... I told Nicolas that it was not working and a short time later he had implemented the 0 byte line emulation :-)

rdemming wrote:So I want to question if people ran the "Lemming color shock" screen on real hardware and if that worked for them. I'm interested to hear if it doesn't work for them. If you didn't run it on real hardware, please do it now :P


I remember running this screen on my real ST - and I was quite impressed when I saw it running ... so it worked on my machine (a german 1040 STf).

User avatar
rdemming
Atariator
Atariator
Posts: 27
Joined: Wed Jul 04, 2007 9:55 pm
Location: Amstelveen / The Netherlands
Contact:

Re: Re:

Postby rdemming » Thu Feb 26, 2009 10:23 am

leonard wrote:Hi Robert!

I just release SainT v2.13, and I added the 0 byte line support so your screen in Nostalgic is now perfectly working :-)


Wow, that is fast! Only hours after my request a new version :D I suppose 0-byte lines were already in the works :P
I tested it immediately and indeed works perfectly now. Thank you very much.

thothy wrote:I remember running this screen on my real ST - and I was quite impressed when I saw it running ... so it worked on my machine (a german 1040 STf).


Thank you very much. It is my favorite screen of the screens I've done. I think the screen is very dynamic because lots of (color) effects happen over time. I spent a great deal of time designing the color effects. I don't know how I got the idea of the Lemmings but I think it turned out to be a very nice effect.
I should finish my demo disk sometime with more or less unreleased screens. But due to the hard disk crash around 1994, the source of the main menu was lost :(.


For Leonard, thothy and STEEM authors, I've found another program for you to debug. The hardware scrolling of the game "No Budies Land" by "Expose Software" is not working on any current ST emulator. It is a vertical scrolling game but it doesn't use any borders. On SainT the screen is flashing/distorted when scrolling. On Hatari and Steem, the screen scrolls with 8 lines at the time instead of 1 one line. The screen is not distorted but the status bar is moving up and down while scrolling. So it seems they use a technique not used by anybody else and I'm curious what they are doing.

Robert
There are 10 kinds of people. Those who understand binary and those who don't.

User avatar
leonard
Moderator
Moderator
Posts: 660
Joined: Thu May 23, 2002 10:48 pm
Contact:

Re: Re:

Postby leonard » Thu Feb 26, 2009 9:54 pm

Wow, that is fast! Only hours after my request a new version :D I suppose 0-byte lines were already in the works :P
I tested it immediately and indeed works perfectly now. Thank you very much.


Version 2.13 was already in the works (new STE features for More or Less Zero demo). But 0 bytes line was made yesterday just after reading your post :-) I have a special feature in SainT which log every shifter access so I get the exact timing you used for the 0 byte line :-)

For Leonard, thothy and STEEM authors, I've found another program for you to debug. The hardware scrolling of the game "No Budies Land" by "Expose Software" is not working on any current ST emulator. It is a vertical scrolling game but it doesn't use any borders. On SainT the screen is flashing/distorted when scrolling. On Hatari and Steem, the screen scrolls with 8 lines at the time instead of 1 one line. The screen is not distorted but the status bar is moving up and down while scrolling. So it seems they use a technique not used by anybody else and I'm curious what they are doing.


Very interesting. When you say "no border", you mean a fullscreen game?? I never heard about any fullscreen game?. Could you post the disk image here so we can test it?
Leonard/OXYGENE.

User avatar
rdemming
Atariator
Atariator
Posts: 27
Joined: Wed Jul 04, 2007 9:55 pm
Location: Amstelveen / The Netherlands
Contact:

Re: Re:

Postby rdemming » Fri Feb 27, 2009 8:19 am

leonard wrote:Version 2.13 was already in the works (new STE features for More or Less Zero demo). But 0 bytes line was made yesterday just after reading your post :-) I have a special feature in SainT which log every shifter access so I get the exact timing you used for the 0 byte line :-)


Clever. Thank you for implementing it so fast :)

B.T.W. It only works when SainT is in STE mode and not in STF mode. Remember that I said that the 0-byte line timings are 4 cycles earlier on an STF than on an STE. So the program detects if it is running on an STE by writing and reading from register $ff8205. If not running on an STE it modifies the 0-byte line code to execute 4 cycles earlier.

Very interesting. When you say "no border", you mean a fullscreen game?? I never heard about any fullscreen game?. Could you post the disk image here so we can test it?


Sorry for the confusion. With "it doesn't use any borders" I meant it doesn't display graphics in any border. So it is not full screen. Actually the visible height of the screen is about 16! lines smaller because the hardware scroll snoops away the top lines.

I attached the image for you to test.
Here is the entry at Atari Legend.


Robert
You do not have the required permissions to view the files attached to this post.
There are 10 kinds of people. Those who understand binary and those who don't.

User avatar
leonard
Moderator
Moderator
Posts: 660
Joined: Thu May 23, 2002 10:48 pm
Contact:

Re: What is "sync scrolling"?

Postby leonard » Fri Feb 27, 2009 9:19 pm

B.T.W. It only works when SainT is in STE mode and not in STF mode. Remember that I said that the 0-byte line timings are 4 cycles earlier on an STF than on an STE. So the program detects if it is running on an STE by writing and reading from register $ff8205. If not running on an STE it modifies the 0-byte line code to execute 4 cycles earlier.


Oh I don't know that, thanks :-) I just made a fix. It will be out for the next v2.14 :-)
Leonard/OXYGENE.


Social Media

     

Return to “Demos - General”

Who is online

Users browsing this forum: No registered users and 9 guests