DOOM on atari st

dml · Post by **dml** » Fri Oct 18, 2013 8:53 pm

Scarlettkitten wrote:Wow, I can't wait

Well it works now. Its taking time to optimize it all, but speed has definitely improved. The replaced code is about 50% faster. Average framerate rose to 8.6fps. c2p now dominates by a large margin, on average frames. Not exactly sure what's going on in the worst frames yet.

28.23% c2pzoom_96_ccpairsq_dualfield
18.01% project_walls_68k
12.86% render_drawplanes_68k
11.98% raycast_world_68kv2
7.36% column_codegen_68k <- actually drawing walls
14.12% render_columns <- not actually drawing walls, but dispatching calls for that. needs replaced.
6.65% scanconvert_visplanes <- has been ok so far but beginning to stick out

There are still 2 areas of C to convert (bold), one medium-sized and one small. Will be interesting to see if replacing those makes as much of a difference.

I have made the code more friendly for native atari GCC2 so perhaps Eero will have some luck getting it to build and perhaps extract worst-frame profiles out of it.

DarkLord · Post by **DarkLord** » Sat Oct 19, 2013 5:07 am

Doug, you never fail to astound.

PS I can't wait to try a final version on my 40mhz
'030 powered STacy.

PSS As well as the final Falcon version.

dma · Post by **dma** » Sat Oct 19, 2013 6:20 am

Amazing coding abilities, once again.
I can't imagine the latest engine evolutions running on my humble ST.

dml · Post by **dml** » Sat Oct 19, 2013 8:39 am

DarkLord wrote: PS I can't wait to try a final version on my 40mhz
'030 powered STacy.

A few very nasty code patching tricks have been used to help 68000 which won't work on 030 but I'll make a compatible option available too.

DarkLord wrote: PSS As well as the final Falcon version.

Sidetracked as usual but it's been fun to see if the ST could do something Doomlike.

dml · Post by **dml** » Sat Oct 19, 2013 9:54 am

calimero wrote:
alexh wrote: and I just watch BadMood on 16MHz+DSP Falcon: I would say it is un pair with 40MHz machines (of course, BadMood is not complete game but... )
and I would say that Amiga 40MHz is faster than a PC!!!
it would be nice to see timedemo (or how it is called for Doom ) from different machines!

Comparing these is a complicated mess of a subject. There are so many different factors involved it would be hard to make any real sense of the results.

You could probably draw some conclusions about the programs or the machines, by running the same program (e.g. DoomAttack) on the different machines (F030, A1200, 50MHz accelerators...), or running different programs (BM vs DoomAttack) on comparable machines (perhaps stock F030 vs stock A1200).

However, as it is we can't really do either.

Even the original 386 version is hard to compare - we do have the portable C code from the Linux version - probably based on the NeXT development version, but we don't have the *actual* optimized i386 code which was used in the commercial release. Also, those machines had the helpful feature of not sharing the video bus with the CPU, and therefore not starving the program of memory reads/writes in higher colour depths. They all had some kind of dedicated video card which 'displayed memory for free'.

Perhaps looking at DoomAttack on a non-accelerated A1200 would be the best place to start with a comparison. Or somebody modifying DoomAttack to run on F030. I expect the result would be really slow, simply because I've seen how the game code performs on F030 without modifications, and DoomAttack does still use the unmodified game code...

GokMasE · Post by **GokMasE** » Sat Oct 19, 2013 6:07 pm

With the speed of this engine moving ever closer to the 9 fps mark, there maybe just enough juice left to power some petit game logic on top of things. Well, I sense that the demo alone is a good enough reason to dig the STe and the scart-cable out from the closet

Btw, has this ST-adventure resulted in any additional ideas to test within Bad Mood?

dml · Post by **dml** » Sat Oct 19, 2013 8:49 pm

GokMasE wrote:With the speed of this engine moving ever closer to the 9 fps mark, there maybe just enough juice left to power some petit game logic on top of things. Well, I sense that the demo alone is a good enough reason to dig the STe and the scart-cable out from the closet

There should be enough CPU left for simple cell-based AI and collisions, or route/node based control to avoid collisions.

At 8fps it takes 6.25 vblanks to compose the frame. If you can keep the AI under a single vblank (many ST games have to work well within this budget) it won't make much visible difference to the speed.

What might be more difficult is getting sprites in there without affecting speed. Avoiding more than one entity per map cell is probably a good idea. I think Wolf works with that constraint most of the time and DM did (alhough it depends on whether you want to treat 'pickups' as entities)

GokMasE wrote:Btw, has this ST-adventure resulted in any additional ideas to test within Bad Mood?

Yes there are a few things about the old floor renderer which I think merit a rewrite now. But it would be better to get a release out first and then mess with details like that afterwards.

I had some other ideas though (which aren't BM related), for an engine on the Falcon. Not sure if/when I'll get around to it but there's still plenty of stuff to try.

DarkLord · Post by **DarkLord** » Sat Oct 19, 2013 9:14 pm

dml wrote: A few very nasty code patching tricks have been used to help 68000 which won't work on 030 but I'll make a compatible option available too.

Thanks for that! I'm sure other (accelerated) Atari users will appreciate it as well.

dml wrote:

Sidetracked as usual but it's been fun to see if the ST could do something Doomlike.

Absolutely, I'm all for an ST version as well as the Falcon version.

Thanks so much for all your hard work Doug!

dml · Post by **dml** » Sun Oct 20, 2013 1:37 pm

Bit of an update on this today.

The decision to break the code up into several smaller, simpler passes with buffering of small packets in between each pass turned out to be a good one - it depends on the packets being as small and simple as possible but it does work well. e.g. the raycaster generates a small packet for every strike containing only x/z contact point and previous/current sector table index from the map. This is compact and enough to feed the next stage. Each routine digests small packets, does a lot of computation and produces small packets for the next stage.

It might in fact be faster this way than converting the whole thing into a surface-order renderer as I considered previously. The resulting code for each pass can be implemented in 68k using registers very efficiently. There are far fewer memory accesses now per display column even with the extra packets - most of the computation is CPU-register-only with a few exceptions. This was the main gain over the original C code (doesn't matter how good the compiler is - it leaves you blind to CPU resources consumed by the program - unless trial-and-error with diassembled C is something you enjoy!)

So the three largest C functions (raycasting, surface projection/clipping, flooor drawing) have now been pretty well optimized and together make up about 44% of all CPU time used. There are a handful of optimizations left which can still be applied but for small gains.

Drawing wall columns is still hard to measure but it's somewhere around 7-11%. All of that stuff has been optimized already.

That means around 55% of the total load can't be easily reduced more without dropping/losing something along the way. Unless I've missed something obvious.

The 6bit/dual-field c2p routine takes about 30% of total time. The single-field dither method costs about half that.

So what I'd call 'rigid' costs account for 85% of total time in dual-field mode. Any further optimization has to take place in the remaining 15%. I'm guessing that can be roughly cut in half so it's possible to stab a guess at the final framerate at this window size.

It currently runs at 8.7fps at 96x60 pixels (plus x2 zoom) in dual-field mode. That means the final framerate could be somewhere around 9.0 - 9.3fps with the same config.

I can't currently change the window size because some stuff has been hardwired in the assembly but when I fix those things some tests can be done at a bigger resolution.

I'll start looking at the last 2 unoptimized functions next to see what can be done with those.

dml · Post by **dml** » Sun Oct 20, 2013 1:44 pm

A quick test with the single-field dither c2p yields 9.77fps so the final version in that mode could average close to 11fps at this window size

mc6809e · Post by **mc6809e** » Sun Oct 20, 2013 4:47 pm

dml wrote:Bit of an update on this today.

The decision to break the code up into several smaller, simpler passes with buffering of small packets in between each pass turned out to be a good one - it depends on the packets being as small and simple as possible but it does work well. e.g. the raycaster generates a small packet for every strike containing only x/z contact point and previous/current sector table index from the map. This is compact and enough to feed the next stage. Each routine digests small packets, does a lot of computation and produces small packets for the next stage.

It might in fact be faster this way than converting the whole thing into a surface-order renderer as I considered previously. The resulting code for each pass can be implemented in 68k using registers very efficiently.

Awesome!

I know it's a bit off-topic but do you think the C language held back the 68K in the processor wars? I always felt that C favored processors with a small register set. Often many of the 68K's registers would be idle in compiled code.

dml · Post by **dml** » Sun Oct 20, 2013 5:27 pm

mc6809e wrote: I know it's a bit off-topic but do you think the C language held back the 68K in the processor wars? I always felt that C favored processors with a small register set. Often many of the 68K's registers would be idle in compiled code.

It's an interesting question - there is probably something in it. The gains which can be had from handcoding vs compiling on 68k are quite clear.

Both register count and instruction complexity can cause problems for code generation. It's much easier to do consistently with RISC, being more fine-grained. Register count *shouldn't* really be a problem if the compiler generates a decent data flow graph through its IL and the backend is doing things sensibly with that. But in the case of GCC and 68k, that might not be true. The backend is now much older than the compiler and probably full of heuristics and rules-of-thumb (which don't scale well).

Looking at the code produced by GCC4, it does a fair job and uses most of the registers most of the time, but it does really bizarre things on a regular basis. It also uses 'un-optimizations' which are both more complex and slower than a simpler solution. I expect a lot of that has to do with type range safety, which is always pretty lax in handwritten asm programs because the coder tends to know how many bits are active at any point in the code, and where overflows matter. If you index an array in C using an unsigned short, it looks innocent but causes the compiler grief since there is no such addressing mode - you get various kinds of mess generated to cope with those near-invisible foibles. Mul/div arithmetic has even worse problems in this area. I get caught out with that all the time and have to keep checking it.

Some of the codegen quality issues are probably due to the slow, organic way compiler technology grew from the early days and the regular change of worldview on 'ideal hardware'. On the one hand compilers were trying to make optimizations based on 'knowledge' of existing chips, while chip makers were doing the opposite - looking at the most frequently used ops and doing trades in silicon to speed up the most common ones emitted by compilers, at the cost of others. So it got especially messy during the 68020/30/40 and 386/486 era.

Another problem with 68k platform code generation is the large variety of hardware and notions about memory speeds etc. Some of that makes a mockery of optimizations in the compiler. Sticking a 68030 on a 16bit bus (or halving the clock rate of the bus) causes extra problems in that area. The compiler tries to optimize based on what it expects and ends up doing the opposite.

While there are plenty of problems with performance of compiled code when looking closely at it, I think the most significant one is the fact that high level languages don't translate structurally to optimal assembly programs, because there is no sense of resource competition. You could make one tiny tweak to a C program and suddenly lose 20% of your performance. Without studying the disassembly the reason might be out of reach. There isn't really a solution to that - they are very different ways to do something.

Having said all that - the compiler (even on 68k) can generate better code globally than anyone by hand. i.e. it doesn't get bored. A coder can focus better on obvious, intensive areas but if the program is just big and sprawling and there are no really obvious targets to optimize, the compiler can pretty much always do a better job just through automation and scale.

mc6809e · Post by **mc6809e** » Sun Oct 20, 2013 8:06 pm

dml wrote: While there are plenty of problems with performance of compiled code when looking closely at it, I think the most significant one is the fact that high level languages don't translate structurally to optimal assembly programs, because there is no sense of resource competition.

Yeah, they seem to focus on maximizing individual resources, but don't have a big-picture view of things where one resource must be balanced against others, register use versus memory accesses, for example. I've seen code that that goes out of it's way to keep registers available by constantly writing results back to the stack. This might have been correct for x86 but it's very wasteful of memory bandwidth on the 68K.

Some of that might be due to C's handling of function calls. Inlining can help, of course.

The other problem with C is memory aliasing. Turning every array access into the dereferencing of a pointer means the compiler can't be sure accesses to two different arrays don't access the same memory address. This again hurts the 68K with it's large number of address registers.

dml · Post by **dml** » Sun Oct 20, 2013 9:01 pm

mc6809e wrote: Yeah, they seem to focus on maximizing individual resources, but don't have a big-picture view of things where one resource must be balanced against others, register use versus memory accesses, for example. I've seen code that that goes out of it's way to keep registers available by constantly writing results back to the stack. This might have been correct for x86 but it's very wasteful of memory bandwidth on the 68K.

Yes I can see this behaviour while browsing the disassembly even now.

Another weakness in the 'big picture view' is managing CPU caches. The compilers are terrible at balancing work for small instruction caches like those on the 020/30. Life gets easier with huge caches - the backends can just assume best case all the time.

mc6809e wrote: Some of that might be due to C's handling of function calls. Inlining can help, of course.
The other problem with C is memory aliasing. Turning every array access into the dereferencing of a pointer means the compiler can't be sure accesses to two different arrays don't access the same memory address. This again hurts the 68K with it's large number of address registers.

I often resort to a strange flavour of minimal C++ and inline asm for tight work on small processors. References and templates can really help with aliasing and macro-like code specialization/expansion if used with care. Its possible to implement pointer behaviour through references too without the aliasing problems - but it doesn't make for easy reading and improved results are not guaranteed, just more likely in some cases.

However there are still the other problems (resource competition, type range safety) which are difficult to avoid without going to assembly language.

GokMasE · Post by **GokMasE** » Mon Oct 21, 2013 3:39 pm

dml wrote:What might be more difficult is getting sprites in there without affecting speed. Avoiding more than one entity per map cell is probably a good idea. I think Wolf works with that constraint most of the time and DM did (alhough it depends on whether you want to treat 'pickups' as entities)

I can understand the sprite area to be quite a challenge when speed is concerned alright. But what about sprite colours?

Considering the amount of psuedo colors used for shading and lighting, maybe it will also be tricky to fit sprite colours with nice contrast to background textures within the colour palette?

dml · Post by **dml** » Mon Oct 21, 2013 4:34 pm

GokMasE wrote: I can understand the sprite area to be quite a challenge when speed is concerned alright. But what about sprite colours?

Considering the amount of psuedo colors used for shading and lighting, maybe it will also be tricky to fit sprite colours with nice contrast to background textures within the colour palette?

It is definitely quite difficult to figure out suitable colour groups for textures and other content. In a sense one part is completely automatic - the source textures are truecolour and the CC palette is generated by a tool. But relying on that alone without any experience with it can give you pretty bad results. The input textures need to be relatively well chosen in the first place to yield complementary pairing colours (this is difficult to explain and even more difficult to correct when it doesn't work well).

There are lots of tricks which can be used to help select and prepare textures to work well together. One of them is to generate a synthetic superpalette in a graphics package which already reserves several shades of several reusable colours. Like the kind of thing you'd do for a 256 colour game. Remap the textures to this superpalette and then crunch those through the CC tool to get the 4bit colour pairings.

Another way is to literally colour-reduce all the source textures into a limited superpalette (limited = somewhere between 64 and 256 colours). If the superpalette looks bad, the CC palette will look worse, so you have a reference to use as feedback to select better textures or adjust existing ones.

Other tricks involve desaturating the input colours so pairing is easier (pale blue and pale brown can be interlaced to produce pseudo-grey but opportunities dry up as the colours get more saturated).

I've been playing with some other methods as well but nothing beats just trying stuff out and seeing what happens, and responding to that in true hacking fashion.

Eero Tamminen · Post by **Eero Tamminen** » Mon Oct 21, 2013 9:42 pm

Maybe doing something like this:
http://en.wikipedia.org/wiki/The_Typing ... d#Gameplay

With monochromatic monsters? Some color variation could be done with very slight single color cycling effect. Cycling between different colors would be garish, but maybe cycling between small variations of gray and hint of some color would give monster a bit of life. Large monsters would save drawing level in the background.

Player movement through the level could be fully automated and monsters stationary if player moves "helplessly" towards them (like in current demo) and input would be only though keyboard. Music / sounds should be something that evokes feeling of a dreadful dream where you're drawn unwillingly to somewhere "where there be monsters"...

Eero Tamminen · Post by **Eero Tamminen** » Mon Oct 21, 2013 9:49 pm

Maybe the names of the "monsters" could be names of famous Atari demos? A bit like greetings, but people would need to type them to "survive".

Text could be 1-plane blitted to the area not covered by the level.

dml · Post by **dml** » Mon Oct 21, 2013 10:31 pm

Eero Tamminen wrote:Maybe the names of the "monsters" could be names of famous Atari demos? A bit like greetings, but people would need to type them to "survive".

Text could be 1-plane blitted to the area not covered by the level.

With chunky pixel you can do some different (non-planar) tricks reasonably quickly inside the rendered window. Transparent text overlays are efficient if using generated code.

It would be faster to blit single bitplanes but more trouble overall - depending on specific palette entries is difficult since the CC palette tool is controlling those. It means using rasters and keeping text somewhere specific, or fixing the palette and never regenerating it when new textures are introduced. Or drawing the text using a bitplane-based CC-aware routine.

alexh · Post by **alexh** » Tue Oct 22, 2013 7:22 am

If you were going to make a demo, then the one that lends itself to this without anymore work would surely be a "Hall of Fame"? Where the artwork would be textures on the walls? Something similar to the Dungeon Master "Hall of Champions". Add some classic Atari Demo group logos?

dml · Post by **dml** » Tue Oct 22, 2013 8:50 am

alexh wrote:If you were going to make a demo, then the one that lends itself to this without anymore work would surely be a "Hall of Fame"? Where the artwork would be textures on the walls? Something similar to the Dungeon Master "Hall of Champions". Add some classic Atari Demo group logos?

I think it's going in that direction. Got bored with optimizing and started working on some audio last night. Will try to pull a demo together as the 'finished' version.

CiH · Post by **CiH** » Tue Oct 22, 2013 5:58 pm

I think it's going in that direction. Got bored with optimizing and started working on some audio last night. Will try to pull a demo together as the 'finished' version.

If you wanted to do something a little bit 'demoish', there's always this event to aim for.

http://www.sillyventure.eu/

dml · Post by **dml** » Tue Oct 22, 2013 6:57 pm

CiH wrote:
I think it's going in that direction. Got bored with optimizing and started working on some audio last night. Will try to pull a demo together as the 'finished' version.
If you wanted to do something a little bit 'demoish', there's always this event to aim for.

http://www.sillyventure.eu/

Yeah, I'm trying not to look at the dates in case it turns into a race

Maybe something decent can be done over the next few days but we'll have to see.

CiH · Post by **CiH** » Tue Oct 22, 2013 7:14 pm

Yeah, I'm trying not to look at the dates in case it turns into a race Maybe something decent can be done over the next few days but we'll have to see.

On the other hand, a deadline can't half concentrate minds!

GokMasE · Post by **GokMasE** » Wed Oct 23, 2013 1:18 pm

Out of curiosity, how many FPS could the latest test binary manage in average?
(Unless I am mistaken that ought to be the first one with alternating roof height, and wall pieces bobbing up and down)

I figure it could be interesting to get a rough idea of what to expect from the upcoming one

Atari-Forum

DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st

Re: DOOM on atari st