Bad Mood : Falcon030 'Doom'

Eero Tamminen · Post by **Eero Tamminen** » Sun Mar 03, 2013 11:33 am

dml wrote:It seems to run faster on a real machine for simpler scenes, but faster in Hatari for complex scenes. Probably related to the small amount of FPU code still in there affecting walls.

Hatari FPU timings are known to be off a lot, but getting them right hasn't been a priority. If you bring this up on hatari-devel, Laurent might take a look at it.

dml · Post by **dml** » Sun Mar 03, 2013 12:10 pm

CiH wrote: And I found something that might be a closer match to what you're doing. This video link is for a modestly accelerated Amiga, which would be a 68030 based booster, at 28 MHz. So a faster CPU than ours, but no DSP. It also has a shedload (32MB) of ram in comparison.

One of the fun things about this conversion was to find out what a stock Falcon can do in unmodified form, versus its contemporaries. So pick a stock Amiga 1200 (?) and that's the comparison I think. I don't remember if the 1200 comes with FastRAM as standard (I think it was an optional, upgrade thing) - but if it does, that's in the Amiga's favour because the STRam bus is a serious problem on the Falcon and it's tough work making up for that loss. If not, then sorry Amiga

There's an argument that the DSP makes comparison 'unfair', but OTOH...

- There isn't much point in comparing between two 'artificially balanced' machines, because the result would be... identical (unless one is poorly coded compared with the other but I never really saw this as a coding competition - more about maximising this Atari box when exposed to a (then) demanding game engine).
- Using the DSP is hard work. Just having it there doesn't help by default - it requires a plan and a lot of effort and with the best intentions it can't solve every problem for you (no external ram access, bus-based interface with CPU, can't RAS @ large datasets) esp. if the problem you're trying to solve is fully defined in advance.
- At that time I didn't see many people using the DSP so it was a good excuse to make use of it for something that seemed 'out of scope' and would look favourable vs other machines at the time. Things changed since then, but it's still a challenge to make it go.

dml · Post by **dml** » Sun Mar 03, 2013 12:14 pm

Eero Tamminen wrote: Hatari FPU timings are known to be off a lot, but getting them right hasn't been a priority. If you bring this up on hatari-devel, Laurent might take a look at it.

TBH this one isn't such a big problem. Getting the CPU timings & i/d-cache interactions accurate is probably a much higher priority. I don't imagine much stuff uses the FPU and also expects timing to be sensitive (unlike DSP/CPU host exchanges which are highly optimizable only if the ratios are known).

DrTypo · Post by **DrTypo** » Sun Mar 03, 2013 2:10 pm

dml wrote: One of the fun things about this conversion was to find out what a stock Falcon can do in unmodified form, versus its contemporaries. So pick a stock Amiga 1200 (?) and that's the comparison I think. I don't remember if the 1200 comes with FastRAM as standard (I think it was an optional, upgrade thing) - but if it does, that's in the Amiga's favour because the STRam bus is a serious problem on the Falcon and it's tough work making up for that loss. If not, then sorry Amiga

A stock A1200 doesn't have Fast-RAM, only 2Mb of chip-RAM. The chip RAM bus is a 7.2MHz 32bit bus (vs 16MHz 16bit bus on the Falcon). Fast-RAM bus is 14.4MHz 32bit.
From what I've heard, adding Fast-RAM to an A1200 nearly doubles its performance.
I'm not sure a A1200+FastRAM would do better than a stock Falcon at Doom, despite the bus bandwith advantage. The A1200 has to do an expensive c2p conversion, and there is no DSP to help the CPU.

It tried BMT401.TTP on my Falcon. With default view on DOOM1.WAD I get 6.5573 fps (vs 4.8484 for BM307.TTP). This is an impressive speed-up, and you're not done yet

At 25MHz I get 11.1888 fps.

Cyprian · Post by **Cyprian** » Sun Mar 03, 2013 4:59 pm

DrTypo wrote:A stock A1200 doesn't have Fast-RAM, only 2Mb of chip-RAM. The chip RAM bus is a 7.2MHz 32bit bus.

ChipRAM in A1200 is a bit slower than ST-RAM in Falcon. CPU in A1200 has access to ChipRam every 8th cpu clock (1.77MHz) and in F030 every 4th cpu clock (4MHz)
Here you can find memory bandwidth comparison between A1200 and F030 http://www.atari-forum.com/viewtopic.php?f=15&t=11513

dml · Post by **dml** » Sun Mar 03, 2013 8:02 pm

DrTypo wrote: A stock A1200 doesn't have Fast-RAM, only 2Mb of chip-RAM. The chip RAM bus is a 7.2MHz 32bit bus (vs 16MHz 16bit bus on the Falcon). Fast-RAM bus is 14.4MHz 32bit. From what I've heard, adding Fast-RAM to an A1200 nearly doubles its performance.

Cyprian wrote: Here you can find memory bandwidth comparison between A1200 and F030 viewtopic.php?f=15&t=11513

Thanks both for the details. I wasn't sure about the A1200 specs at all so this is helpful. My last contact with Amiga HW was A500

DrTypo wrote: I'm not sure a A1200+FastRAM would do better than a stock Falcon at Doom, despite the bus bandwith advantage. The A1200 has to do an expensive c2p conversion, and there is no DSP to help the CPU.

Yes I expect the max framerate would be significantly capped by the c2p step - although map complexity might scale reasonably well with FastRAM especially if the view size is reduced.

DrTypo wrote: It tried BMT401.TTP on my Falcon. With default view on DOOM1.WAD I get 6.5573 fps (vs 4.8484 for BM307.TTP). This is an impressive speed-up, and you're not done yet At 25MHz I get 11.1888 fps.

25MHz definitely helps

In the emulator I'm currently seeing 3.1495 FPS on the e4m2 startpoint with BM401.

Testing BM307 I get 2.1592 FPS under same conditions (or 1.8539 FPS with sprites on). So the optimizations are translating to more dense maps which is good.

I did a bit more today but not much time just now to post. More later.

dml · Post by **dml** » Sun Mar 03, 2013 9:42 pm

Over the last few days I carried out a simple (but tedious) renaming/name-spacing exercise on the DSP module, which didn't itself optimize anything. However it did allow the lower (fastest) 64 words of internal ram to be aliased multiple times in different pieces of code instead of mapping the 'hottest' references permanently. This in turn has allowed a lot of code to be rewritten more efficiently since there is no longer competition for fast memory (unlimited number of fast variables).

Apart from allowing the fastest addressable range to be used more often, it has made most of the code easier to adapt, rewrite and extend since many of the names are now temporary, permanent allocations have shrunk and space conflicts between unrelated blocks of code no longer occur. It's a fairly typical example of higher level changes opening up lower level optimizations (I should have done this from the start but was probably still getting my head around DSP programming generally).

The pattern from now on will be to transfer any directly referenced (non-register) variables to temporary fast variables before being referenced in costly code. There will be no need for directly referenced long addresses in performance sensitive code.

dml · Post by **dml** » Tue Mar 05, 2013 10:00 am

Last night I took a broad look at a number of things and did some experiments.

I went through the 68k code responsible for top level scene view composure, and which hadn't had much attention since near the start (most of it being translated at some point earlier from another C project). The most obvious problem is the lack of data preparation for drawing - lots of indirection and lookups e.g. finding sectors nodes, linedefs for segs and vice versa. etc.. this stuff should be processed & linked up at load time so the scene code can just find everything directly when needed.

Two very (!) small changes here increased the FPS on e4m2 from 3.14FPS to 3.21FPS - much more than all the DSP optimizations taken together so far (texturing excluded since this is always a bottleneck in any form). This also correlates with the profiler results showing the scene view stages increasing in cost disproportionately with scene complexity (for e1m1 it's in 7th place @ 4.2% with floor drawing costs dominating, but e4m2 it's 2nd from the top @ 12%, with wall add/draw costs dominating at 1st and 3rd place).

So this is where I will be spending my time next as it's a relatively easy speedup.

The other thing I did was start looking at the Doom source. This has been interesting for a number of reasons - in many ways it works as I expected but there are some really strange things in there which led me to compare some areas in more detail.

- They went to great lengths to avoid fixed point math (no, extreme lengths). BM does use fixed point in a number of areas - although in some sense it's not much more difficult to implement rapid fixed point in assembly language than integer work - whereas fixed point in C is generally a P.I.T.A. They probably did this mainly for performance reasons (386 = not great for mul) but I wouldn't be surprised if some of it was to make things simpler overall. The code is mostly table indexing and conditional operations, not much numerical happening anywhere (absence of shifts and multiplies in all the usual places). Divides are almost non existent - not that they should be required much anyway - it's much harder to eliminate the muls!

- They only view-tested one 'side' of the BSP. I'm not sure why this is yet. I view-test both sides since material on either side can contribute to the view, even if a large portion on the near side will be behind the viewer. I tried with their single-side method and BM gets a bit slower - not by a large margin but enough for me to stick with what I had. There may be a reason for this hidden somewhere else in the code but I don't see it yet.

- They used searches for the occlusion list and for visplane lookups. This is something I'm glad BM doesn't have to do, particularly for the visplanes. BM's visplane builder is a bit smarter I think, and cost is linear with span count (or: spans have a constant cost). Note: I recently read that later versions of the Doom code changed these searches to hashtables but it still seems a bit wrong to me!

- Occlusion is done with spans 'solidsegs', not individual columns, and walls are broken up into spans between those. This is quite different from BM, which performs occlusion testing/insertion per column on complete walls. I'm on the fence with this one - don't know if its better to stick with BM's method or change it to a span list - there are pros and cons to each. The per-column method suits the DSP well and the cost is not showing on the CPU side at all - so my guess is that it's not worth the trouble. However if it does for some strange reason turn into a visible cost, I can always change it (but would try for a better solution than a linear span search).

- Doom appears to draw upper/middle/lower walls in vertical/sequential order (i.e. a whole seg column at a time, starting at the top wall and working downwards). BM draws the whole upper, middle and lower walls separately, one texture at a time. I suspect they did this because available cache for texturing was zero (or small) and texture switching didn't hurt that much - but scan converting the columns did hurt a lot, so sharing the scan between 3 wall portions made sense. This is my guess, not sure if that's the whole story. This is something I may change slightly in BM, but the purpose will be to reduce wall def transmissions to the DSP - not necessarily to share scan conversion, as the scanning costs practically nothing at the moment (although any opportunity for sharing will be used if it doesn't upset anything else).

- Doom does everything with angles, polar coordinates, weird reverse lookup tables. BM has an approximate analogue with each of these steps but does them in entirely different ways (more vector based). There might be value in converting some of this stuff which must remain on the CPU (but for DSP it makes no real sense to turn fast multiplies into large and potentially slower table indexing!)

- The upper/middle/lower ceiling/floor height logic (and how that translates into the need for wall generation) is quite different in BM and Doom and is difficult to compare them. It will take me some time and testing to see if there are any hidden secrets in the Doom version. I expect this is the really interesting bit.

Overall it's been worthwhile but it's not clear yet what this means for BM. Will take me a bit longer to figure that out.

[EDIT]

While writing that I realized why only one side of the BSP needs tested some of the time (doh) - although that isn't what the Doom code actually does. It tests only one side all of the time. Perhaps the complete version of the testing logic made so little difference they just kept the simpler version of the code in the end. I notice that the ReMooD project has an alternate path which always checks both sides and uses data recursion, and ends up slower (!) so who knows for sure

will stick with what I know works best in BM until something changes...

Eero Tamminen · Post by **Eero Tamminen** » Tue Mar 05, 2013 1:51 pm

dml wrote:The other thing I did was start looking at the Doom source. This has been interesting for a number of reasons - in many ways it works as I expected but there are some really strange things in there which led me to compare some areas in more detail.

Callgraph with instruction count information for the linuxdoom binary you posted earlier is attached.

I used cleaned version of the symbols you included with it (no object file, local or BSS/DATA variable symbols) for the callee information:

Code: Select all

grep -v -e '\.o$' -e ' t \.' -e ' [bBdDW] ' lindoom.sym.orig > lindoom.sym

Profile summary looks following:

Code: Select all

$ hatari-profile.py -g -ts lindoom.txt
...
Time spent in profile = 52.55460s.

Calls:
- max = 624000, in pix68 at 0x20072, on line 86
- 1463607 in total
Executed instructions:
- max = 2492984, in _R_DrawColumn+124 at 0x4a7de, on line 3143
- 129919755 in total
Used cycles:
- max = 24953218, in _R_DrawColumn+150 at 0x4a7f8, on line 3148
- 843106812 in total

Calls:
 42.63%    623880  pix68
 17.21%    251915  _FixedMul
  4.45%     65066  _R_DrawColumn
  4.00%     58510  _R_GetColumn
  3.31%     48403  _P_DivlineSide
  2.51%     36733  _R_DrawSpan
  2.51%     36729  _R_MapPlane
  1.97%     28803  _Z_ChangeTag2
  1.88%     27558  _W_CacheLumpNum
  1.53%     22394  _SwapSHORT
  1.39%     20328  _SwapLONG
  1.31%     19210  _SlopeDiv
  1.31%     19206  _R_PointToAngle
  1.11%     16240  _FixedDiv

Executed instructions:
 29.81%  38733655  _strupr
 18.67%  24259096  _R_DrawColumn
 10.61%  13785876  _R_DrawSpan
  8.17%  10608234  pix68
  6.08%   7901655  _R_RenderSegLoop
  4.58%   5954052  _R_DrawPlanes
  4.29%   5576448  _W_CheckNumForName
  2.72%   3527538  _FixedMul
  1.81%   2353182  _R_MapPlane
  1.26%   1643470  ROM_TOS
  1.24%   1605890  _R_GetColumn
  1.02%   1326884  _P_DivlineSide

Used cycles:
 26.24% 221259750  _strupr
 20.46% 172460444  _R_DrawColumn
 11.89% 100216398  _R_DrawSpan
  8.74%  73702274  pix68
  6.72%  56676600  _R_RenderSegLoop
  4.04%  34057982  _R_DrawPlanes
  3.09%  26083490  _W_CheckNumForName
  2.64%  22217218  _R_MapPlane
  2.27%  19166296  _FixedMul
  1.71%  14401726  ROM_TOS
  1.38%  11634638  _R_GetColumn

Regarding Badmood, it would be nice if you could add some support for automation.

As Hatari can autostart programs, but not give autostarted programs arguments, it would be good if when Badmood gets no arguments:
- It would load some WAD file automatically (e.g. one renamed as "badmood.wad")
- Game would continue after loading the WAD without key press

With these small changes its possible to completely automate profiling of WAD's default view.

If you later on add a demo mode which would start also automatically, even more could be automatically profiled. There could also be some script that would link "badmood.wad" to couple of different WADs and then automatically get profile data for all of them, diff the results against previous versions etc.

dml · Post by **dml** » Tue Mar 05, 2013 3:03 pm

The profile output looks sensible I think (not that I have much time to study it yet). I'd point out that 'pix68' isn't a function with heaps of calls - it's a local loop label (pix%=:;) defined inside a block of GCC inline asm, responsible for converting 8bit to 16bit framebuffer (and takes just under half of total time).

If you exclude that symbol the rest of the doom code shouldn't retreat so much.

dml · Post by **dml** » Tue Mar 05, 2013 3:21 pm

Re: Automation - when BM is connected up with the Doom code it should allow demos to be run in the usual way from the commandline. Until then I'll just make it default to finding doom2.wad, doom.wad, doom1.wad in that order (or something along those lines).

A useful experiment we can try soon with lindoom is nullifying the R_RenderPlayerView or R_RenderBSPNode calls which sit at the root of all the stuff I'll be replacing and collecting some profiling information on the result (absolute times/cycles) so the approximate cost of the game code can be estimated in advance. It may also help to collect a callgraph from that, minus the drawing, to get a clearer view of the game code layout.

It's probably better to get the lindoom 'port' working with recorded demos first - I had problems with that (there are complicated versioning issues with Doom code, WADs and recorded demos which I haven't got my head around yet) so honest profiling info can be collected from collision detection, thing AI updates etc. All of that info is invisible at the moment.

Eero Tamminen · Post by **Eero Tamminen** » Tue Mar 05, 2013 4:36 pm

dml wrote:The profile output looks sensible I think (not that I have much time to study it yet). I'd point out that 'pix68' isn't a function with heaps of calls - it's a local loop label (pix%=:;) defined inside a block of GCC inline asm, responsible for converting 8bit to 16bit framebuffer (and takes just under half of total time).

If you exclude that symbol the rest of the doom code shouldn't retreat so much.

Having that kind of things in the callgraph isn't IMHO a problem, now that (most of) interrupt handler interruptions aren't anymore interpreted as calls. It's only a problem in the top lists, but those are easier to interpret with the callgraph(s).

In callgraph you just see the function, in which the loop label resides, to be the only caller for that label. If the loop symbol is removed before profiling, then the cost is added to its parent in the callgraph, i.e. in this case to _I_FinishUpdate().

PS. Symbols are also useful with tracing. With "trace cpu_symbols" (and dsp_symbols), debugger will output a flow of symbols that PC passes through. It's much more readable that CPU instruction trace.

Eero Tamminen · Post by **Eero Tamminen** » Tue Mar 05, 2013 4:46 pm

dml wrote:Here's an interim build which is relatively stable and has the recent changes rolled in.

When you start integrating doom C-code with your own 68k & DSP asm code... It would be nice if you could automate debug symbol generation (at least) for the resulting binary CPU side code so that each build you attached here would include symbols. I can then look into automating getting profiles out of them automatically.

dml · Post by **dml** » Tue Mar 05, 2013 4:49 pm

One thing the callgraph has been useful for (in the few spare minutes I've had to look through it today!) - is it reveals just how few bind points may be needed between Doom game code and BM to get them communicating (hopefully). e.g. when a thing crosses a ssector boundary - this is something BM needed to deal with for moving sprites - it should be possible to just post messages from those places in the Doom code to have BM state update properly - the BM representation is roughly analogous (thing references per sector - need to change that to per-ssector but it's minor).

Having wall textures update for thrown switches etc. (and other special case state) will probably be more annoying to find and bind up but it looks like most of that stuff is also done in one place in the Doom code.

dml · Post by **dml** » Tue Mar 05, 2013 4:51 pm

Eero Tamminen wrote: When you start integrating doom C-code with your own 68k & DSP asm code... It would be nice if you could automate debug symbol generation (at least) for the resulting binary CPU side code so that each build you attached here would include symbols. I can then look into automating getting profiles out of them automatically.

At this point I'll likely have to stop using Devpac anyway and switch to GCC+VASM, so symbol output will be part of the makefile (and output directly as 'nm' format). That should be ok.

kristjanga · Post by **kristjanga** » Tue Mar 05, 2013 6:29 pm

could anyone make a small video of the engine running and post it on youtube ?

Eero Tamminen · Post by **Eero Tamminen** » Tue Mar 05, 2013 6:42 pm

kristjanga wrote:could anyone make a small video of the engine running and post it on youtube ?

It's work in progress so it would soon be obsolete. It's also easy to test yourself. Just take the latest badmood binary attached to Douglas' comments, download Doom v1 shareware wad [1] and give its name as argument to Badmood TTP file. If you don't have Falcon, it works also in the latest Hatari release, binaries of that are available for Linux, Windows and OSX.

[1] E.g .from here: http://doomwiki.org/wiki/DOOM1.WAD
Or the Freedoom wad: http://www.nongnu.org/freedoom/download.html

dml · Post by **dml** » Tue Mar 05, 2013 7:33 pm

One other decent experiment worth doing - trying to collect metrics from Doom in a similar fashion to BM (increment counters for various things during a frame, or just counting total hits on key functions for a single frame update). This might help show up any significant differences between Doom and BM in the most important areas of work elimination. e.g. if Doom emits 500 wall columns where BM is emitting 1500 in the same scene that would be a red flag. It's not all that easy to correlate the two codebases properly with the various differences involved but if anything is badly wrong in scene management this could help find it.

I can take screenshots from BM with the metrics up which might help with this exercise. I'll probably need to clean up the list though - take out irrelevant items and rename some of them - to make it more useful for direct comparison with Doom behaviour. Using a single startpoint probably isn't sufficient - one simple scene and one complex scene should be used.

The interesting areas will be:

- ssector count (pass visible)
- wall count (pass visible)
-- solidwall count
-- window count
-- mask/transparent count
- upper/lower/middle wall counts
- wall column count

There are plenty of others but these are the ones that would have 'implications' for any significant deviations.

I expect we will see *some* differences show up - so half the task would be interpreting them and ruling out things that look strange but are actually ok/correct.

There's a scary amount of work involved in doing all this stuff so it does require some additional patience

dml · Post by **dml** » Wed Mar 06, 2013 10:03 am

Did another quick test last night with some 68030 PMMU init/restore code which runs BM under transparent translation only (default TOS PMMU tree disabled).

This seems to be working properly because I set the cache-inhibit flag on all of STRam and the resulting build runs very slowly despite having the caches enabled. The translation is mirrored every 16MB so the HW registers continue working as before. <- (doh, no it isn't - but I only masked the lower 16MB in any case) This was just a test to establish clean entry/exit before I build a local PMMU tree to create a non-cacheable shadow for display writing (and other contiguous buffer fills).

If this works as planned, it should allow the data cache to be used in write-allocate mode without being polluted by display writes (so register spills will be cached), speeding up some of the surrounding CPU code without slowing down pixel drawing or removing texels from the cache while drawing.

A shadow of STRam seems to be the simplest / most convenient way to achieve this because you can then mark any buffer non-cacheable at runtime by masking in the 'shadow bit' to the address before writing - so it's not even necessary to allocate special types of memory to do it.

Whether this provides any speed advantage at runtime is still unclear - it should, but hasn't been confirmed on 030 for this task. It certainly did the trick on the 68040 long ago and BM already contained code to interface with that (although it did this through a cookie rather than locally - the cookie was provided by my TK40 CPU/MMU driver project)

Setting up a local PMMU tree is probably the dirtiest, most evil thing I can think of to do inside an application - it's only supposed to be done by the OS and is highly incompatible with CPU and FastRAM upgrades mapping memory above STRam - but if it can speed up the base machine then it's going in - I'll make sure it can be turned on/off so it remains friendly

Of course - if it proves to offer no advantage, it's coming back out again!

dml · Post by **dml** » Wed Mar 06, 2013 3:30 pm

Earlier today I had a quick look at the TOS PMMU tree and reversed it, to help disambiguate the pretty bad documentation in the user manual. The 040 MMU seems to be a bit easier to use than the 030/68851 PMMU - although they are all pretty painful when it comes down to making something work.

I did this by grabbing/printing all of the PMMU registers on a real Falcon (from inside BM), identifying the configured bits and locating the tree (which isn't very big btw - it lives at $700) and reversed the table entries by eye, based on the UM and what I remembered from the 68040.

So it should now be fairly easy to map the non-cacheable shadow with a new tree.

I'll post my 'disassembly' of the PMMU state as soon as I've confirmed its correct and replacing it works as expected.

One thing that confused me for a bit was the state of the SRP register - storing/restoring it caused nasty crash. After inspecting all the registers I found the SRP had been disabled, and the SRP state itself is best described as 'a bag of random bits'. Loading CRP or SRP with invalid data causes a fancy exception and lots of bombs - even if you got that data from the registers in the first place. So if you start messing with the TOS PMMU state, save yourself some pain tripping over this one.

dml · Post by **dml** » Wed Mar 06, 2013 7:03 pm

Locally built PMMU tree is working as expected... implementing the same map as TOS. BM starts and exits cleanly.

Haven't had time to implement the shadow yet but the next update will probably have something to say about it. Will find out soon if any of this was worthwhile.

[EDIT]

Shadow memory is working - plotting pixels to ($f0000000+screenbase), and getting correct output. It's marked as cache-inhibited but I'm not seeing any significant change in performance just using it for the display. Will need to test it with buffer fills mixed in with other random access stuff, and with the write-allocate flag enabled, and make some measurements/comparisons... will post an update when I have some results.

Eero Tamminen · Post by **Eero Tamminen** » Wed Mar 06, 2013 9:54 pm

dml wrote:Did another quick test last night with some 68030 PMMU init/restore code which runs BM under transparent translation only (default TOS PMMU tree disabled).

On the Hatari side MMU usage will mean that old UAE CPU core (i.e. default Hatari build) cannot be used with Badmood anymore. Additionally, the MMU version of the WinUAE CPU core is less cycle accurate than the default "cycle-accurate" WinUAE CPU core. The 030 MMU emulation is still work in progress (e.g. SSW register bits aren't emulated), i.e. you may encounter issues...

dml · Post by **dml** » Wed Mar 06, 2013 10:00 pm

Eero Tamminen wrote: On the Hatari side MMU usage will mean that old UAE CPU core (i.e. default Hatari build) cannot be used with Badmood anymore. Additionally, the MMU version of the WinUAE CPU core is less cycle accurate than the default "cycle-accurate" WinUAE CPU core. The 030 MMU emulation is still work in progress (e.g. SSW register bits aren't emulated), i.e. you may encounter issues...

Yes I'm aware of that - I noticed BM was suddenly blistering fast when I enabled the MMU flag in the UI

I'm not doing anything which will permanently block BM from emulation - it's just an optional hardware trick that will likely have to be disabled in a varied range of circumstances, if it's kept at all.

dml · Post by **dml** » Wed Mar 06, 2013 10:22 pm

Here are my notes on the TOS PMMU tree mapping... you'll need the 030 user manual to make sense of it, but it should help fill in some gaps.

The table is only 3 levels deep and while the page descriptor size is set to 32k, there are no 'real' page descriptors in the tree. All addresses are terminated in short-form (4-byte) 'Early Termination Table Descriptors' which are basically page descriptors in the middle of one of the higher tables, with a size implied by the table depth (in this case, the used pages are 1mb - that's the finest resolution defined in the tree, and there are some much larger, higher level terminators which aren't interesting).

The top level of the tree (TA) divides the 32bit/4GB address space into 16 blocks. Only the first and last block are mapped properly. Entries in between marked 'ET' are early terminator pages, presumably just space fillers. (I also marked them 'P' for 'page' in the final column). The first half of the 32GB space is marked cacheable, where the 2nd half is marked non-cacheable/cache-inhibit (CI).

The first and last blocks TA[0,1]/TA[0,15] are normal table descriptors, pointing at the next 2 tables. One is for the low-side STRam address space, the other for the high-side/HW-register mapped space. The entire 1st table is marked cacheable, the entire 2nd table noncacheable. These 2 tables just break down the low end of the low space, and the high end of the high space once again, with final links to the 4th table TC[0] from the first and last entries.

It is the TC table that defines the 16MB STRam space in terms of 16 * 1MB pages. Note that the U bit indicates used/hit pages, and the M bit indicates modified pages - in familiar places.

As mentioned before SRP is disabled and filled with garbage. Function code matching and read/write matching is also disabled. Address limiting is disabled. Leaf pages are configured as 32k but are not actually used in the tables.

The TT0/TT1 transparent translation registers are strangely enabled despite the tree being populated and usable - and they are mapped to interesting places. It seems to map to AtariTT memory areas (VME and TT ram). Not sure of the reason for this. Maybe somebody else knows the answer.

Incidentally - I mapped my 'shadow' copy of STRam as a new TC[1] table, identical to TC[0] with the CI bit set. I linked it off TB[1,0] instead of the page terminator currently there. I built a new tree for the entire task rather than cloning the system one and editing it - mainly to be sure I was doing it all right (on the premise that if I got anything wrong, it's likely nothing would work

.

Code: Select all

;	CRP				-> TA[0]	
;
;$700	TA[0,0]	($00000000->$0fffffff)	-> TB[0]	($0000074a) T/U/SHORT
;	TA[0,1]	($10000000->$1fffffff)	-> ET		($10000001) P	
;	TA[0,2]	($20000000->$2fffffff)	-> ET		($20000001) P
;	TA[0,3]	($30000000->$3fffffff)	-> ET		($30000001) P
;	TA[0,4]	($40000000->$4fffffff)	-> ET		($40000001) P
;	TA[0,5]	($50000000->$5fffffff)	-> ET		($50000001) P
;	TA[0,6]	($60000000->$6fffffff)	-> ET		($60000001) P
;	TA[0,7]	($70000000->$7fffffff)	-> ET		($70000001) P
;	TA[0,8]	($80000000->$8fffffff)	-> ET		($80000041) P/CI
;	TA[0,9]	($90000000->$9fffffff)	-> ET		($90000041) P/CI
;	TA[0,a]	($a0000000->$afffffff)	-> ET		($a0000041) P/CI
;	TA[0,b]	($b0000000->$bfffffff)	-> ET		($b0000041) P/CI
;	TA[0,c]	($c0000000->$cfffffff)	-> ET		($c0000041) P/CI
;	TA[0,d]	($d0000000->$dfffffff)	-> ET		($d0000041) P/CI
;	TA[0,e]	($e0000000->$efffffff)	-> ET		($e0000041) P/CI
;	TA[0,f]	($f0000000->$ffffffff)	-> TB[1]	($0000078a) T/U/SHORT
;
;
;$740	TB[0,0]	($00000000->$00ffffff)	-> TC[0]	($000007ca) T/U/SHORT
;	TB[0,1]	($01000000->$01ffffff)	-> ET		($01000001) P	
;	TB[0,2]	($02000000->$02ffffff)	-> ET		($02000001) P	
;	TB[0,3]	($03000000->$03ffffff)	-> ET		($03000001) P	
;	TB[0,4]	($04000000->$04ffffff)	-> ET		($04000001) P
;	TB[0,5]	($05000000->$05ffffff)	-> ET		($05000001) P
;	TB[0,6]	($06000000->$06ffffff)	-> ET		($06000001) P
;	TB[0,7]	($07000000->$07ffffff)	-> ET		($07000001) P
;	TB[0,8]	($08000000->$08ffffff)	-> ET		($08000001) P
;	TB[0,9]	($09000000->$09ffffff)	-> ET		($09000001) P
;	TB[0,a]	($0a000000->$0affffff)	-> ET		($0a000001) P
;	TB[0,b]	($0b000000->$0bffffff)	-> ET		($0b000001) P
;	TB[0,c]	($0c000000->$0cffffff)	-> ET		($0c000001) P
;	TB[0,d]	($0d000000->$0dffffff)	-> ET		($0d000001) P
;	TB[0,e]	($0e000000->$0effffff)	-> ET		($0e000001) P
;	TB[0,f]	($0f000000->$0fffffff)	-> ET		($0f000001) P
;
;$780	TB[1,0]	($f0000000->$f0ffffff)	-> ET		($f0000041) P/CI
;	TB[1,1]	($f1000000->$f1ffffff)	-> ET		($f1000041) P/CI
;	TB[1,2]	($f2000000->$f2ffffff)	-> ET		($f2000041) P/CI
;	TB[1,3]	($f3000000->$f3ffffff)	-> ET		($f3000041) P/CI
;	TB[1,4]	($f4000000->$f4ffffff)	-> ET		($f4000041) P/CI
;	TB[1,5]	($f5000000->$f5ffffff)	-> ET		($f5000041) P/CI
;	TB[1,6]	($f6000000->$f6ffffff)	-> ET		($f6000041) P/CI
;	TB[1,7]	($f7000000->$f7ffffff)	-> ET		($f7000041) P/CI
;	TB[1,8]	($f8000000->$f8ffffff)	-> ET		($f8000041) P/CI
;	TB[1,9]	($f9000000->$f9ffffff)	-> ET		($f9000041) P/CI
;	TB[1,a]	($fa000000->$faffffff)	-> ET		($fa000041) P/CI
;	TB[1,b]	($fb000000->$fbffffff)	-> ET		($fb000041) P/CI
;	TB[1,c]	($fc000000->$fcffffff)	-> ET		($fc000041) P/CI
;	TB[1,d]	($fd000000->$fdffffff)	-> ET		($fd000041) P/CI
;	TB[1,e]	($fe000000->$feffffff)	-> ET		($fe000041) P/CI
;	TB[1,f]	($ff000000->$ffffffff)	-> TC[0]	($000007ca) U/SHORT	
;
;$7c0	TC[0,0]	($00000000->$000fffff)	-> ETP		($00000019) M/U/P
;	TC[0,1]	($00100000->$001fffff)	-> ETP		($00100019) M/U/P	
;	TC[0,2]	($00200000->$002fffff)	-> ETP		($00200019) M/U/P	
;	TC[0,3]	($00300000->$003fffff)	-> ETP		($00300019) M/U/P	(4mb ram)
;	TC[0,4]	($00400000->$004fffff)	-> ETP		($00400009) U/P
;	TC[0,5]	($00500000->$005fffff)	-> ETP		($00500009) U/P
;	TC[0,6]	($00600000->$006fffff)	-> ETP		($00600009) U/P
;	TC[0,7]	($00700000->$007fffff)	-> ETP		($00700009) U/P
;	TC[0,8]	($00800000->$008fffff)	-> ETP		($00800009) U/P
;	TC[0,9]	($00900000->$009fffff)	-> ETP		($00900009) U/P
;	TC[0,a]	($00a00000->$00afffff)	-> ETP		($00a00009) U/P
;	TC[0,b]	($00b00000->$00bfffff)	-> ETP		($00b00009) U/P
;	TC[0,c]	($00c00000->$00cfffff)	-> ETP		($00c00009) U/P
;	TC[0,d]	($00d00000->$00dfffff)	-> ETP		($00d00019) M/U/P	(14mb display ram?)
;	TC[0,e]	($00e00000->$00efffff)	-> ETP		($00e00009) U/P
;	TC[0,f]	($00f00000->$00ffffff)	-> ETP		($00f00059) M/U/P/CI	(HW regs)

;	TT0:		0xxx.xxx1:0111.1110:1000.00x1:0xxx.0111
;	$01000000-$01xxxxxx / TT ram?
;	enabled, FC ignored, RW ignored
;
;	TT1:		1xxx.xxx0:0111.1110:1000.01x1:0xxx.0111
;	$FE000000-$FExxxxxx / VME? 
;	enabled, FC ignored, RW ignored
;
;	TC:		1000.0000:1111.0000:0100.0100:0100.0101
;	enabled, SRE disabled, FCL disabled, PS=32k, IS=0, TIA=4bits, TIB=4bits, TIC=4bits, TID=5bits(unused)
;
;	CRP:		1000.0000:0000.0000:0000.0000:0000.0010
;			0000.0000:0000.0000:0000.0111:0000.0000
;	lowerlimit/limit==0: limiting off, short(4byte) descriptors	
;	tree root = $700
;
;	SRP:		xxxx.xxxx:xxxx.xxxx:xxxx.xxxx:xxxx.xxxx ...random nonsense!
;			xxxx.xxxx:xxxx.xxxx:xxxx.xxxx:xxxx.xxxx

dml · Post by **dml** » Thu Mar 07, 2013 9:47 am

Some initial notes:

- Data cache 'write-allocate' enabled is slightly faster than disabled, for CPU code using mixed read/write access
- It doesn't seem to be having the negative impact I expected, without cache-inhibiting (locally) read-only or write-only buffers. This is perhaps because most of the contiguous memory work is in wall and floor filling, and only wall filling should be affected (although it isn't - still trying to figure out why)
- Cache-inhibiting specific items combined with write-allocate seems to give the best results, although the gain is really just a few percentage points.
- Turning the data cache off for contiguous filling is probably still faster than cache inhibiting the same memory with the data cache on. I haven't confirmed this yet but it's looking likely. This may be a Falcon bus <-> 68030 signal interfacing thing, or it may be a 68030 thing. I'll try to confirm it properly when I have time.
- Using the transparent translation (TTx) registers to map STRam seems to be faster than mapping it via page descriptors. Not that this is helpful - the result is unstable when reading textures from disk and probably some other unknown HW accesses. I configured this test at the end of the mainloop so it was only active on the 2nd and subsequent passes - the engine ran ok and measured a little faster.

Will be very busy for next couple of weeks, posting might be relatively sparse until later.

Atari-Forum

Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'

Re: Bad Mood : Falcon030 'Doom'