Dio wrote:So firstly, the DE to shifter LOAD is variable depending on the wakeup state. Hence my notation DL3-DL6 for each wakeup state.
I see, I finally understand this DL3-DL6 notation.

This would imply that the shifter counter starts running later according to WU.
Then if you look at the traces, you can see where the writes to the registers are happening. Look at the A23 line - those indicate accesses to memory addresses with the top bit set - i.e. hardware registers or ROM. If you can also see R/W low, then it's a write. So if you look at, say, the 158-byte line, you can see a pair of writes either side of the disabling of DE, which must be the two writes to the syncmode register.
This is indeed interesting, apparently DE is very close to those writes, around cycle 372.
Similarly, the 14-byte line shows two HW register writes either side of the small block of DE. So you can use those to correlate emulator time with the traces.
But there's a deeper can of worms here: what is the accurate definition of 'cycle 376'?
You say "cycle 376 sync 0 -> right border off". But that's a huge simplification of what actually happens.
I didn't mean to explain it all but to illustrate again what's understood by "emulator cycles".
There's a relation between those cycles and pixel cycles ("Paulo cycles"):
emulator cycles = Paulo cycles +83
Paulo cycle 1 = emulator cycle 84.
Here's a constant in Steem:
Code: Select all
#define CYCLES_FROM_HBL_TO_LEFT_BORDER_OPEN 84
This means that a palette change at cycle 84 could affect pixel 1 of a normal 50hz scanline.
In the graph I commented, leaving alone WU states, you would have DE activating around cycle 56, the latency, the prefetch, the latency and then the pixel being rendered around cycle 84, I assumed it's when R2 G2 B2 change but I'm not sure.
If we consider the emulator cycle to start at the beginning of that state machine (MC1) then the CPU write hits the Glue in MC1 or MC2 (assuming it's in normal phase and not on the normally unavailable half cycles). The Glue then does the comparison - but not immediately, but at some later point (probably 1-4 cycles later depending on the wakeup state). 1-4 cycles after that the MMU sees DE, and probably 3 cycles after that it issues the first /LOAD.
Emulator cycles start from 0 at very first frame, then 0 after 512 cycles in a 50hz frame.
In Steem, you have this simple schema for WU:
Code: Select all
State 1
+-----+
| CPU |
+-----+
| MMU |
+-----+
State 2
+-----+
| MMU |
+-----+
| CPU |
+-----+
We generally consider WU1 when talking about cycles.
At the moment emulators perform the simplification, and it almost works because in general wakeup states are ignored, everything quantises to a nice 4-cycle boundary, the effects of the two variable delays are inverted and so cancel out, and it 'just works'. But that's not a true simulation of what's going on - it's just a HLE (High Level Emulation) that emulates the effect rather than the signals in any great detail.
In particular, it doesn't lead to any greater insights about the hardware. In order to gain those, it's necessary to unpick all the fiddly little details and properly consider all the wakeup states. Especially when it comes to considering the +2 cases where the write to the Glue happens half-way between two CPU phases.
In current version of Steem, WU isn't generally ignored, but as we don't master this, it's optional.
The 1st goal is to run programs.
In all versions of Steem cycle precision is 2, not 4.
So what I'm trying to answer is the root hardware defining "what is cycle 376'"? What does zero mean in this numbering system? Is the write actually happening on cycle 376, or is it cycle 377? Where does Glue do its comparisons? At what point does MMU react? What is the logic - and latency - in the Shifter?
That's what I try to answer, practically. This is the timing when the write happens.
You identified yourself places where we could attach emulator cycles to the graph, around cycle 372 for line -2.
I already hinted at the time when the shifter counter starts running.
This may lead to the ability to do a genuine low level simulation, rather than just trapping write addresses, looking up a table and seeing what's supposed to happen.
You know I like this approoach as I implemented real emulation of PC offset in bus errors.
With a caveat, performance, though the shifter trick tests themselves can be quite taxing, especially in Steem.
Just compare CPU usage for desktop with Overscan Demos #6.
And of course, it must work.