As some people asked, I'll try to explain how I did the coding for beats of rage.
All the init sequence is from DHS's system routs (I've adapted their code for my own purpose).
I run the game in 320x240 True color (this works for both VGA and RGB screens).
I also use a virtual screen around the physical screen (128 pixels left, 128 pixels right, 20 pixels top 20 pixels bottom).
This takes more place in memory (for the 2 screens), but allows no clipping during the game (that's important for the hardcoded sprites).
I also activate the instruction cache of the 68030.
The main game code:
-------------------
- I don't clear the screen (I refresh all the background every time)
- All the background is filled with dedicated movem.l functions
- all the movem.l loops must feet in the instruction cache
Example :
; Draw a picture of 100 pixels
; bg_height contains the height of the picture
; bg_offset contains the size of the picture
draw_picture_100: movea.l screen_adr,a5
add.l bg_y_pos,a5
.loop: movem.l (a6)+,d0-d7/a0-a4
movem.l d0-d7/a0-a4,(a5)
movem.l (a6)+,d0-d7/a0-a4
movem.l d0-d7/a0-a4,52(a5)
movem.l (a6)+,d0-d7/a0-a4
movem.l d0-d7/a0-a4,52*2(a5)
movem.l (a6)+,d0-d7/a0-a2
movem.l d0-d7/a0-a2,52*3(a5)
lea.l _SCREEN_WIDTH(a5),a5
add.l bg_offset,a6
subq.w #1,bg_height
bpl.s .loop
rts
I've got the same function for draw_picture_120, draw_picture_128 and draw_picture_200
All the display is in long and not in word (it's faster for the 68030 to use 1 long instead of 2 words)
The engine :
------------
- in the engine, eveything is optimized to run only the needed code when necessary.
By example: - the recomputing of the score digits to display is done only when the score changes and not every frame
- I change the move animation of a player only when its animation is finished, ...
- I avoid as much tests as possible
The sprites technic :
---------------------
I've spent 5 months on this part.

The main idea is to hardcode the sprites for 2 reasons :
- gain the maximum of speed
- gain the maximum of memory
An ennemy is composed of nearly 80 different sprites (sometimes more) that can be up to 140*110 in size.
All sprite has a left and a right picture according to it's direction (~160 images).
This fills quickly the 4 megs of the Falcon.
So, looking at the 68030 cycles table, the fastest instruction to display 2 pixels is : move.l d0,(a6)+
where d0-a5 contains the value of the 2 pixels and (a6) is the screen position
I use one move.l as it is faster than 2 move.w
Of course, if I need to display only one pixel, I use a move.w
For the transparence, I use a add.w immdiate,a6 which takes into account the transparency to the "end of the line" +the virtual screen size to the next line + the position of the 1st pixel to display after the transparency area of the next line.
The problem here is that my sprites are using more than 15 colors.
So, I've written a C program under linux that does the following :
- keep in memory the value of min_cycles
- count all the different colors (in long mode, ie #$f79cf79c is considered as one color and #$f79cbdd4 as another one).
- sort them from the most frequent to the least frequent
The most important part is here : I compute the mathematic variance of all the possible groups of the same color to get the groups of colors that are used closer to each other :
- loop var from 600 to 8000
loop for each color
loop from max occurence of this color to 2
compute the variance of this group
if the computed variance is lower than var, keep it in a table
When the table is filled with variances, I dispatch the packets of similar colors in the registers, I compute the number of cycles (move.w and move.l) used.
(no need to compute the add cycles as they're constant in all the iterations)
Then I compare this value with the value of min_cycles and if lower, I keep the current variance value used for this computing.
Then, I loop everything for a new variance.
At the end of the algo, I recompute the whole sprite with the lower variance (ie lower number of cycles, which is also the lower number of octets by the same time), and I transform it to assembly (I finish with a RTS).
I just have to compile the code for this sprite.
In the game, I just have to give the correct address (working screen + sprite position) to a6 and call BSR (addr_sprite) which is really easy to use.
An example to clarify a little my explanation :
----------
Let's say the sprite looks like this :
abc
abc de ff fg
abbb ee fghi
abbbb ee aeeebd
a, b, c ... are .w colors (ie 1 pixel in the falcon true color mode)
a can be any value (example $114d)
b can be any value (example $774f)
...
the first pixel to display is a, then b, then c, then there's only transparency to the next line.
On the second line, the first pixel to display is again a, then b, then c, then, there are three transparent pixels, then display d, then e, then again two transperent pixels, ...
At the end of the third line, I display pixels f,g,h,i, then there's some transparency, then the end of this line, then again some transparency at the beginning of the fourth line, then I display abbbb ...
The algo :
First, I parse the image to compute the adds (for transparency), the word and long "colors" and I create a table with the long colors
color 1 is : ab
color 2 is : de (ab at beginning of the line is already the second occurence of the color 1)
color 3 is : ff
color 4 is : fg
color 5 is : bb (ab at beginning of the line is already the third occurence of the color 1)
color 6 is : ee
...
The final code will be something like :
move.l color_1,(a6)+
move.w "c",(a6)+
add.w #offset_to_next_line - current_position + transparent area offset in the next line,a6
move.l color_1,(a6)+
move.w "c",(a6)+
addq.w #6,a6 ;(3 transparent pixels)
move.l color_2,(a6)+
addq.w #4,a6
move.l color_3,(a6)+
...
...
rts
At this stade, the sprite is hardcoded and the transparency is taken into account with the add instruction.
You can notice there's no condition code here (neither for clipping nor transparency).
Only MOVE for a pixel, ADD for transparency or jump to next line.
The code does exactly what he's supposed to do without unuseful instruction.
Of course, if you need left and right sprites, you have to hardcode them both.
(There's no way to display the left sprite from the hardcoded right sprite).
The code of the sprite currently looks like:
(real example)
move.l #$f58e72f2,(a6)+
move.l #$72f2526c,(a6)+
move.l #$294672f2,(a6)+
move.l #$526c41a6,(a6)+
addq.w #$0006,a6
move.l #$526841a6,d4
move.l #$41a60001,d0
add.w #$0454,a6
move.l #$41a641a6,(a6)+
...
The goal is now to fill the maximum of colors into the registers D0-D7 and A0-A5
I first took the 15 more frequently used colors and moved them into the registers.
This allow to cover 35-40% of the pixels.
But I thought I could do much better

Imagine that the color "aa" is used like this in a sprite:
aaaaaaaaaaaababaccbaddeeffgghhiiiiiiijjkkllmmnnooppqqrrssttuuvvwwxxyyzzaa
Why keeping the color "aa" in a register after the beginning (it only serves at the last position on this line)
After the first using of the color "aa" in a register, we could use the same register for the "ba" and then the "ii" color.
Using as much as possible the registers optimise a lot the speed of the sprite's displaying and the size it takes in memory.
So, that's why I compute the mathematic Variance of the color in the whole sprite.
In the previous example, "aa" will have a very big variance if I compute it on the whole line, but a very low variance if I compute it only on the first part of the line.
So, I loop on all the image, searching the lower variance for a color (there may be more than one if the color is used at the beginning and then later many times).
Then, I fill the registers by filling the holes :
move.l color_a,d0
move.l d0,(a6)+
move.l d0,(a6)+
move.l d0,(a6)+
move.l d0,(a6)+
move.l color_i,d0
move.l d0,(a6)+
move.l d0,(a6)+
move.l d0,(a6)+
..
move.l color_a,(a6)+
You can notice that I use the immediate move for the last "aa" color, as the last value of the d0 register is "ii"
I use the 15 registers to fill as much as possible the holes inbetween two groups of colors.
For the move.w pixels (odd number of pixels), I compare the word part of the 15 registers to verify if I can use a move.w reg,(a6)+ instead of the immediate value.
By limiting the move immediate to the minimum, I maximize the use of the registers.
This technic allows to save a lot of cycles (and space) for each sprite.
If everything is not clear, just ask.
Don't forget that it's not only the sprites that are optimized, but the whole code :
- the loops are in the instruction cache as much as possible
- I call some code only when something has changed that need to refresh something else.
- I avoid tests in the graphic loops (only movem or hardcoded sprites here)
Best regards
Have fun
Thadoss
PS : an example of the content of a sprite (taken from a sprite of the game)
[...]
move.l d3,(a6)+
add.w #$044c,a6
move.w d4,(a6)+
move.l #$00015268,d3
move.l d3,(a6)+
move.l d4,(a6)+
move.l d0,(a6)+
move.l d1,(a6)+
move.l #$f58e526c,(a6)+
move.l d2,(a6)+
move.l d2,(a6)+
move.l #$2946526c,(a6)+
move.l #$526cf58e,(a6)+
move.l d5,(a6)+
move.l #$736c736c,(a6)+
move.l a3,(a6)+
move.l d4,(a6)+
add.w #$0448,a6
move.w d4,(a6)+
move.l #$5268736c,d6
move.l d6,(a6)+
[...]