Oh my god another thread XD I didn't expect to have that many talks about this conversion.
So yes it may be considered as a simple video codec, but as I'm originally a demo coder and I used my old skills to do that, you can also consider it to be a demo. It's up to anyone, I don't mind at all. As for the compo, well, I didn't ask anything, I just sent it to the SV orgas and they did whatever they wanted with it. I wouldn't even mind for it to be shown outside any compo or not shown at all.
I see people compare it to other video players or other demos (thunderdome, blabla/ppera's player). Fair enough, but the first one doesn't have that much frames (I think) and bad audio quality. The latter is amazing too, but is a general purpose player that couldn't play bad apple at full framerate with this audio quality. Note that again I don't mind about being compared, it's just that I didn't even watched those players to do bad apple, I started from scratch without knowing what was really possible or not.
I thought about this conversion during ~2 years but as I didn't do any m68k code for 25 years it took a long time for me to do anything. Then a conversion was released by StackDesign. I was really glad about it, so I tried it on my STE and... it *doesn't work*! It works only on emulators like Hatari because HDD access timings are not emulated and are immediates. And even so, it uses an RLE compression which makes the ST struggle, so it's often out of sync.
I wondered if the ST was really that slow and that nothing better could be done in 30fps and full resolution. I really thought it was impossible. I did a quick first test in GfA basic by loading a pi1 sequence form my UltraSatan and it was slow as hell (2~3fps). Then I merged all pi1 in one file, and it was faster (~5fps).
Of course I opened Pandora's box, I wanted to try different solutions and I took my best vasm (which I used to patch R-Type DX a few months ago) and got into work.
In C# I did a simple deltapacking from that pi1 sequence that generates a really basic code (move.w/addq), which is a well-known technique that I never used myself back in the 90s. As expected this reduced each frame quite a lot, then I executed it and got... 10fps.
I did various optimization to the code generation but never got past 12fps if I remember correctly. I have some videos of the progress, I will do a small making of video later.
The bottleneck was actually the GEMDOS FREAD call which is, as you know, a synchronous call. So the main loop that went like FREAD/render/FREAD/render was slowed down because of the FREAD.
That was the time I got into ACSI documentation and Ppera's website. I didn't want to go low-level, I don't have the skills and I know it can be a real hell to make things work, and I wanted to stick with FREAD. I learnt that the max transfer rate is around 1,5Mbps. So why couldn't it load faster?
What I found out by doing many tests is that reading a file in big chunks (like 100kB) is way faster than doing small reads. I actually thought like a modern asshole developper that doesn't mind about doing 2 bytes reads from the HDD, but on ST it is of course not the way to go. I don't know how exactly Ppera's driver works, but I guess it may be the same with any other drivers.
So I tried to parallelized HDD reads. When I succeeded, boom: 60fps. Sixty, not 30! It was actually faster than the video needed.
This is the thing I'm proud of, being able to use GEMDOS and render at the same time, without using the VBL (that I need to keep to count elapsed frames). The main loop does the HDD reads, while the renders goes into the... HBL! It's the only interrupt that doesn't stall the VBL frame counter.
There may be other ways of doing this, but it works this way
To be exact the HBL is triggered with a Timer B. And if you run the demo on a real STE, you can press + to see the load/play pointers and left shift to see the CPU and blitter timings. Then you see that when a 100kB chunk is loading, the blitter/CPU rendering is a lot slower. That is because the disk DMA steals cycles from the rendering when emptying the FIFO, making the loading actually asynchronous.
I was really amazed, because I was ready to lower the graphic quality to 3, maybe 2 bitplanes. I didn't have to do that. And when I added sound, it all ran smoothly, I could push the quality all the way up to 50kHz stereo just because I wanted the best soundtrack possible.
That was my goal all along, the best quality ever for the animation and sound with GEMDOS loading. This is not the codec that made it possible, but the parallelized loading.
After that, I did optimize the "codec" by using the blitter. I could have released it without using the blitter (it still can be disabled in my generator) but I *never* succeeded in using that bloody blitter in the 90s and I wanted my payback. At first I had to use Steem's debugger to know what I did wrong because the blitter went crazy many times. And strangely when I used the 'tas' instruction as shown in Atari's docs, it crashed on my Mega STE. I don't know why but now I avoid tas even if theorically the ST has no issue with it (like the Amiga does). My opinion is just that my code is crap
So I used some blitter tricks, the main being that the delta-packing is done in vertical stripes of 1, 2 or 4 bitplanes depending on the frame (the generator finds the best setup). Those stripes are copied into the file and copied back to the frame with a simple blitter copy. There is also a later optimization that detects empty stripes (black) and full stripes (white) and no data is saved in the file since the blitter can fill 1's and 0's with no source.
Using the blitter reduce the filesize by ~40%. There is still some generated code when the stripes are too small.
And, finally, I used a modified gray code for the palette to minimize bit change for shades and black/white state change (black=0000, white=0001 instead of 1111). This is particularly efficient with the blitter.
Of course there should be more ways of getting the file smaller. I had some other ideas for the blitter I didn't have the time to implement. I checked the talks here about packers, and none could fit my requirements since they're tailored for floppy loading rates. But maybe... after all the rendering leaves 50-80% of CPU time. If you find a way to unpack 200kB/s of graphic data, be my guest.
You can try to compress the audio, but I highly doubt the STE would be able to lossless depack 100kB of sound per second. Prove me wrong
Anyway, it was more an exercise for me than anything else. I had fun doing it. I did reduce the data, but first of all I wanted the best quality I could get of my old STE. I didn't mind the file size: I have many 1GB cards that are mostly empty. An Atari STE with 1GB of mass storage! Why would I sacrifice quality since I had the space to do anything I wanted.