ScummVM/Falcon060 pre-release

Latest news in the Atari world

Moderators: Mug UK, Silver Surfer, Moderator Team

User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3449
Joined: Sun Jul 31, 2011 1:11 pm

Re: ScummVM/Falcon060 pre-release

Post by Eero Tamminen »

mikro wrote: Sun Sep 24, 2023 9:08 pm I've updated the links with a last-minute fix/feature: viewtopic.php?p=448188#p448188 so you can make another test whether / how much it helped.
Thanks! I'll check the games list scrolling and e.g. Putt-Putt goes to Moon intro [1] in few days.
[1] viewtopic.php?p=448130#p448130
mikro wrote: Sun Sep 24, 2023 9:08 pm I'm still thinking about a more invasive version (not adding but copying for the first channel) but maybe I'll let it as is and focus on asm/dsp version instead.
More invasive C++ version would be nice as it could be upstreamed and help all platforms.

As to asm, didn't you initially consider that for the rate conversions? That could help e.g. Dig and Return to Zork quite a bit (see earlier demo intro profiles).
mikro
Hardware Guru
Hardware Guru
Posts: 3708
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: ScummVM/Falcon060 pre-release

Post by mikro »

Eero Tamminen wrote: Sun Sep 24, 2023 10:09 pmIs there some chat/IRC channel for ScummVM where you could ask somebody else sees the same issue with (builtin theme +) those options?
There's the General channel on Discord: https://discord.com/channels/5812240605 ... 1091446795 and of course the forums: https://forums.scummvm.org/viewforum.ph ... b713b154e4.
Have you had time to look into using LTO yet? :-)
Actually good reminder -- I started with "OK, I'll try to enable the LTO plugin and see what happens" and then bumped into the leading underscore issues and forgot about it. I'm really glad I made a simple script to download & reinstall all of my libraries otherwise I would have gone crazy by now with all those reinstalls.
Rose game play is just way too slow to be (manually) profiled in Hatari. There needs to be use-case that's at least somewhat playable on 32Mhz (emulated) Falcon.
Indeed. I'll try to think of a way to at least coarsely profile it, perhaps I'll measure game's main loop without Atari rendering in milliseconds, at the rate the game is running, even measuring in seconds could be enough. :D

Composer: nice, that's what I wanted to see, indeed the blitting is what we like here. Graphics::keyBlit always makes me sad because that code is pure C and is not so open for optimisation - it's just a loop copying bytes :( and comparing each byte to the key. Very slow and even SuperVidel can't help here because it has support only for key == 0x00. :( Perhaps I could code a specialised version comparing longs (i.e. if a long == 0x44444444 with key being '0x44' then skip otherwise compare against '0x4444' otherwise '0x44').

Mixer: we'll see which part(s) will need to be rewritten. For instance even if you tell me that small change I have just published provides a massive speedup, I can't merge it. And ironically, exactly because of the silence:

- SDL backend requires mixing something
- mixers obeys, fills its buffer with samples
- SDL backend requires mixing something
- mixer has nothing to mix so it doesn't put anything in the buffer
- however the SDL audio callback expects the buffer to be filled, so it plays whatever it has there: it could be silence or random data.

Of course, SDL backend could have been convinced to clear its buffer manually (there are only three mixers - ours, SDL and null) so if/when we have some convincing data, I'll try to ask the devs about it.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3449
Joined: Sun Jul 31, 2011 1:11 pm

Re: ScummVM/Falcon060 pre-release

Post by Eero Tamminen »

Game list scrolling

Earlier profile: viewtopic.php?p=451539#p451539

Scrolling from top to bottom, and back up, with arrow keys:

Code: Select all

Time spent in profile = 87.18809s.
...
Used cycles:
  22.57%  22.73%  22.74%   631427643 635957026 636004492   Graphics::BdfFont::drawChar(Graphics::Surface*, unsigned int, int, int, unsigned int) const
  21.38%                   598185977                       c2p1x1_8_rect_start
  16.60%                   464466460                       c2p1x1_8_rect_pix16
   7.07%                   197801805                       copy256_d
   4.69%   4.73%   4.73%   131187911 132252101 132262517   Graphics::BdfFont::getCharWidth(unsigned int) const
   3.07%                    85858582                       copy16_d
   2.90%   2.92%  32.80%    81160810  81703131 917541338   Graphics::Font::drawString(Graphics::Surface*, Common::U32String const&, int, int, int, unsigned int, Graphics::TextAlign, int, bool) const
   2.86%   2.88%   3.79%    79897197  80482312 106040412   Audio::MixerImpl::mixCallback(unsigned char*, unsigned int)
   1.71%   1.72%   3.86%    47705791  48185741 108069081   int Graphics::(anonymous namespace)::getStringWidthImpl<Common::U32String>(Graphics::Font const&, Common::U32String const&)
   1.22%   1.23%   1.54%    34195902  34347301  42989125   Common::BaseString<char32_t>::ensureCapacity(unsigned int, bool)
   1.21%   2.31%   7.12%    33798252  64728392 199214224   OSystem_Atari::update()
   1.08%   1.09%   2.63%    30286751  30518643  73512513   Common::U32String::operator+=(char32_t)
   1.06%   1.07%  42.60%    29651008  298799681191578916   GUI::ListWidget::drawFormattedText()
Results:
  • mixCallback() decreased from earlier 5.1% (own 0.4%) to 3.8% (own 2.9%), with memset() reduced from earlier 4.7% to 0.2%
  • Slightly more than half of the cycles gained with memset call reductions, was lost due to mixCallback() code itself taking now more cycles
  • About same wall-clock time used
=> Surprisingly high cost from adding one boolean check to channels loop. Meaning that while this change helps silent parts, it could degrade perf when something actually needs to be mixed.


Putt-Putt goes to Moon (demo)

Earlier profile from TT / slim version: viewtopic.php?p=448130#p448130

Intro

Profile of the demo startup until it's interactive with the new version (using Falcon / fat version):

Code: Select all

Time spent in profile = 45.07248s.
...
Used cycles:
  12.00%                   173491684                       set256
  10.25%  10.33%  12.58%   148263157 149437837 181939454   Scumm::AkosRenderer::byleRLEDecode(Scumm::BaseCostumeRenderer::ByleRLEData&)
   7.21%                   104337700                       ROM_TOS
   5.49%   5.53%   5.53%    79350987  79901208  79907352   Scumm::MajMinCodec::decodeLine(unsigned char*, int, int)
   3.15%   3.18%   3.37%    45589252  45975032  48688814   Audio::RateConverter_Impl<false, true, false>::convert(Audio::AudioStream&, short*, unsigned int, unsigned short, unsigned short)
   2.95%   2.98%  28.70%    42721664  43059361 415051433   Audio::MixerImpl::mixCallback(unsigned char*, unsigned int)
   2.68%   3.12%   3.12%    38725129  45163940  45167609   Common::StackLock::StackLock(Common::Mutex const&, char const*)
...
Executed instructions:
  11.02%  11.06%  12.89%    30327814  30438397  35460606   Scumm::AkosRenderer::byleRLEDecode(Scumm::BaseCostumeRenderer::ByleRLEData&)
   9.24%   9.26%   9.26%    25422438  25471419  25472443   Scumm::MajMinCodec::decodeLine(unsigned char*, int, int)
   4.80%                    13206247                       ROM_TOS
Game play

Playing through the whole demo, and clicking everything in all scenes:

Code: Select all

Time spent in profile = 281.60314s.
...
Visits/calls:
   8.07%             4749522             set256
   5.34%             3142325             NullMutexInternal::lock()
   5.34%             3142246             NullMutexInternal::unlock()
   5.31%   5.31%     3127072   3127466   Common::StackLock::~StackLock()
   5.31%   5.31%     3126888   3127520   Common::StackLock::StackLock(Common::Mutex const&, char const*)
...
Used cycles:
  26.42%  26.64%  32.24%  238719038224067175122913282985   Scumm::AkosRenderer::byleRLEDecode(Scumm::BaseCostumeRenderer::ByleRLEData&)
  10.73%                   969416478                       set256
   6.90%                   623361239                       ROM_TOS
   3.92%   3.96%   4.19%   354496613 357471379 378895738   Audio::RateConverter_Impl<false, true, false>::convert(Audio::AudioStream&, short*, unsigned int, unsigned short, unsigned short)
   2.78%   3.26%   3.26%   250926659 294491194 294527704   Common::StackLock::StackLock(Common::Mutex const&, char const*)
   2.67%   2.69%  27.09%   241122538 2432568152447914845   Audio::MixerImpl::mixCallback(unsigned char*, unsigned int)
   2.65%   2.67%   3.58%   239844773 241679136 323671629   Scumm::ScummEngine::drawStripToScreen(Scumm::VirtScreen*, int, int, int, int)
   2.39%   2.41%   5.22%   216318156 218130527 471994745   Scumm::ScummEngine::getResourceAddress(Scumm::ResType, unsigned short)
   2.10%   2.12%   2.12%   190077221 191588187 191608913   Scumm::MajMinCodec::decodeLine(unsigned char*, int, int)
   1.78%   2.03%   2.03%   160946484 183161854 183185027   Common::StackLock::~StackLock()
   1.78%   1.80%   3.59%   160772754 162352644 324727271   Audio::QueuingAudioStreamImpl::endOfData() const
   1.64%   1.66%   9.78%   148406419 149758427 883836980   Audio::Channel::mix(short*, unsigned int)
   1.45%                   130951800                       c2p1x1_8_rect_start
   1.44%   1.46%   3.44%   130488062 131656873 310444045   Audio::QueuingAudioStreamImpl::endOfStream() const
   1.39%   1.40%   4.62%   125289978 126318826 417383281   Scumm::ScummEngine::resetActorBgs()
   1.22%   5.55%  38.49%   110486992 5016486743477854584   OSystem_Atari::update()
   1.09%                    98559850                       c2p1x1_8_rect_pix16
   1.09%                    98511125                       AtariMixerManager::update() [clone .part.0]
   1.04%   1.05%   2.26%    93755800  94490419 204166040   Scumm::Gdi::drawStripComplex(unsigned char*, int, unsigned char const*, int, bool) const
   1.00%   1.01%   1.80%    90663012  91585293 162616068   Scumm::debugC(int, char const*, ...)
   0.86%   0.87%   5.61%    78152443  78882190 506917888   Scumm::ScummEngine::getMaskBuffer(int, int, int)
   0.78%   0.79%   0.79%    70494455  70979686  70986650   Common::DebugManager::isDebugChannelEnabled(unsigned int, bool)
Observations
  • mixCallback() is reason for ~2/3 of the memset() costs, which is down from earlier profile.
  • mixCallback() total percentage grew from 23-24% to 27-29%, despite memset() percentage very clearly dropping
  • "Common::StackLock" constructor & destructor are taking 5% of play cycles => maybe locking no-op thing should be resurrected?
  • There's a large difference between what takes most cycles, and what runs most instructions
=> Overall perf degraded marginally. Mixing really needs the more clever scheme you mentioned.

mikro wrote: Mon Sep 25, 2023 6:31 am Mixer: we'll see which part(s) will need to be rewritten. For instance even if you tell me that small change I have just published provides a massive speedup, I can't merge it. And ironically, exactly because of the silence:
- SDL backend requires mixing something
...
Of course, SDL backend could have been convinced to clear its buffer manually (there are only three mixers - ours, SDL and null) so if/when we have some convincing data, I'll try to ask the devs about it.
I see. That's quite unfortunate. Hopefully SDL backend maintainers agree on pushing memsetting down to backend.

PS. Myst seems even better test-case for memset(): viewtopic.php?p=448751#p448751
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3449
Joined: Sun Jul 31, 2011 1:11 pm

Re: ScummVM/Falcon060 pre-release

Post by Eero Tamminen »

Putt-Putt goes to Moon conclusion was premature.

As earlier profile was from GCC7 slim version on TT, I took Falcon profile also with few week earlier ScummVM version with GCC7 & GCC13 builds.

With those, the Putt-Putt intro profile looks following...

GCC13:

Code: Select all

Used cycles:
  11.71%                   167875728                       set256
   9.73%   9.81%  12.03%   139472502 140653639 172399093   Scumm::AkosRenderer::byleRLEDecode(Scumm::BaseCostumeRenderer::ByleRLEData&)
   7.41%                   106252001                       ROM_TOS
   5.50%   5.54%   5.54%    78884300  79460765  79466469   Scumm::MajMinCodec::decodeLine(unsigned char*, int, int)
   3.15%   3.18%   3.37%    45196461  45580746  48258795   Audio::RateConverter_Impl<false, true, false>::convert(Audio::AudioStream&, short*, unsigned int, unsigned short, unsigned short)
   3.10%   3.12%  28.05%    44436084  44715266 401983012   Audio::MixerImpl::mixCallback(unsigned char*, unsigned int)
   2.77%   2.79%   5.28%    39760164  40029070  75613588   Scumm::Gdi::drawStripComplex(unsigned char*, int, unsigned char const*, int, bool) const
   2.67%   3.08%   3.08%    38261638  44199619  44202701   Common::StackLock::StackLock(Common::Mutex const&, char const*)
   2.43%   2.45%   2.99%    34828956  35122553  42908064   Scumm::ScummEngine::drawStripToScreen(Scumm::VirtScreen*, int, int, int, int)
   2.02%   2.04%  10.41%    28971766  29215119 149238762   Scumm::ScummEngine::dissolveEffect(int, int)
   1.84%   1.85%   8.97%    26309213  26565272 128607527   Audio::Channel::mix(short*, unsigned int)
   1.80%   8.23%  40.47%    25728613 117919357 580136516   OSystem_Atari::update()
   1.76%   3.90%   6.15%    25186124  55884562  88156543   AtariGraphicsManager::copyRectToScreenInternal(void const*, int, int, int, int, int, Graphics::PixelFormat const&, bool, bool)
   1.74%   1.76%   3.29%    24956613  25195021  47189274   Audio::QueuingAudioStreamImpl::endOfData() const
   1.73%                    24855586                       c2p1x1_8_rect_start
   1.70%   1.97%   1.97%    24350095  28235484  28237830   Common::StackLock::~StackLock()
   1.48%                    21279756                       AtariMixerManager::update() [clone .part.0]
   1.37%   1.38%   3.61%    19666401  19781857  51736477   Scumm::ScummEngine::resetActorBgs()
   1.34%                    19158769                       c2p1x1_8_rect_pix16
   1.33%   1.35%   3.86%    19129702  19284068  55380641   Audio::QueuingAudioStreamImpl::endOfStream() const
   1.27%   0.00%   0.00%    18249270      8787      8787 * Graphics::copyBlit(unsigned char*, unsigned char const*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int)
   0.93%   0.94%   2.95%    13351568  13421382  42265576   DefaultTimerManager::checkTimers(unsigned int)
GCC7:

Code: Select all

Used cycles:
  11.43%  11.52%  13.66%   167712281 169002364 200318593   Scumm::AkosRenderer::byleRLEDecode(Scumm::BaseCostumeRenderer::ByleRLEData&)
  10.35%                   151854071                       set256
   6.58%   6.63%   6.63%    96500907  97223914  97233826   Scumm::MajMinCodec::decodeLine(unsigned char*, int, int)
   6.29%                    92225007                       ROM_TOS
   3.20%   3.22%   3.77%    46970917  47296322  55362667   Scumm::ScummEngine::drawStripToScreen(Scumm::VirtScreen*, int, int, int, int)
   2.77%   2.80%  25.57%    40693052  41046178 375026245   Audio::MixerImpl::mixCallback(unsigned char*, unsigned int)
   2.55%   2.58%   2.76%    37466100  37771212  40474563   Audio::RateConverter_Impl<false, true, false>::convert(Audio::AudioStream&, short*, unsigned int, unsigned short, unsigned short)
   2.15%   2.16%  10.92%    31487414  31741421 160224100   Scumm::ScummEngine::dissolveEffect(int, int)
   2.10%                    30806355                       Common::StackLock::lock()
   2.09%   2.11%  36.53%    30586580  30876974 535791661   OSystem_Atari::update()
   2.01%   2.03%   6.12%    29552146  29755140  89831764   Scumm::Gdi::drawStripComplex(unsigned char*, int, unsigned char const*, int, bool) const
   1.71%                    25037290                       c2p1x1_8_rect_start
   1.68%   1.70%   3.87%    24654326  24941481  56830622   Audio::QueuingAudioStreamImpl::endOfData() const
   1.61%   1.62%   6.59%    23647106  23785129  96608000   AtariGraphicsManager::copyRectToScreenInternal(void const*, int, int, int, int, int, Graphics::PixelFormat const&, bool, bool)
   1.44%   1.44%   8.42%    21076905  21167789 123437866   Audio::Channel::mix(short*, unsigned int)
   1.37%   0.00%   0.00%    20056909      8828      8828 * Graphics::copyBlit(unsigned char*, unsigned char const*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int)
   1.33%   1.34%   3.50%    19566703  19664114  51291722   Audio::QueuingAudioStreamImpl::endOfStream() const
   1.30%                    19095647                       c2p1x1_8_rect_pix16
   1.24%                    18198757                       AtariMixerManager::update() [clone .part.2]
   1.22%   1.23%   3.10%    17949562  18110680  45479877   Scumm::ScummEngine::resetActorBgs()
   1.18%   1.18%   1.18%    17234245  17339155  17340453   Scumm::Gdi::writeRoomColor(unsigned char*, unsigned char) const
   1.12%   3.57%   3.57%    16368615  52354157  52359054   Common::StackLock::StackLock(Common::Mutex const&, char const*)
   1.00%                    14609322                       Common::StackLock::unlock()
Summary

Compared to other GCC13 profile, there's no significant perf difference from the memset() optimization for Putt-Putt, perf difference was from GCC7 vs GCC13. If memset() gets called less, mixCallback() slowdown eats the improvements, but at least it's not decreasing (average) perf.

For example, GCC 13 is faster than GCC7 in byleRLEDecode(), decodeLine() and drawStripToScreen(), but slower in mixCallback(), RateConverter_Impl() and drawStripComplex().
mikro
Hardware Guru
Hardware Guru
Posts: 3708
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: ScummVM/Falcon060 pre-release

Post by mikro »

Eero Tamminen wrote: Mon Sep 25, 2023 10:42 pm=> Surprisingly high cost from adding one boolean check to channels loop. Meaning that while this change helps silent parts, it could degrade perf when something actually needs to be mixed.
Are you sure about this? Is it really so easy to compare mixCallback() with memset always present vs. mixCallback() without memset?

Or to put it another way, how big performance decrease we are talking about? By definition, there has to be a few cycles per loop added as I did add the check but certainly it shouldn't be something to worry about.

EDIT: Sorry, overlooked your previous message. Ok, makes sense then. So the conclusion is that the memset() skipped did help for silence scenes, right?

Btw, about the locks... is it really ~2.5% of the all cycles while the game is running? I want to avoid another awkward situation when re-proposing the patch. ;-)
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3449
Joined: Sun Jul 31, 2011 1:11 pm

Re: ScummVM/Falcon060 pre-release

Post by Eero Tamminen »

mikro wrote: Tue Sep 26, 2023 8:08 am
Eero Tamminen wrote: Mon Sep 25, 2023 10:42 pm=> Surprisingly high cost from adding one boolean check to channels loop. Meaning that while this change helps silent parts, it could degrade perf when something actually needs to be mixed.
Are you sure about this? Is it really so easy to compare mixCallback() with memset always present vs. mixCallback() without memset?
One can always look into profiler disassembly output, e.g. for the game list scrolling.

GCC13 ScummVM from 9th of September:

Code: Select all

Audio::MixerImpl::mixCallback(unsigned char*, unsigned int):
$022cc19e             subq.l    #$8,sp                     0.00% (18959, 18953, 0, 0)
$022cc1a0             movem.l   d2-d6/a2-a6,-(sp)          0.00% (18959, 606677, 7, 0)
$022cc1a4             movea.l   $34(sp),a3                 0.00% (18959, 56877, 1, 18958)
$022cc1a8             move.l    $38(sp),d5                 0.00% (18959, 189572, 18961, 18959)
$022cc1ac             move.l    $3c(sp),d3                 0.00% (18959, 75833, 7, 18958)
$022cc1b0             clr.l     -(sp)                      0.00% (18959, 94791, 5, 0)
$022cc1b2             pea.l     $4(a3)                     0.00% (18959, 75859, 3, 0)
$022cc1b6             moveq     #$30,d6                    0.00% (18959, 0, 0, 0)
$022cc1b8             add.l     sp,d6                      0.00% (18959, 170638, 18963, 0)
$022cc1ba             move.l    d6,-(sp)                   0.00% (18959, 75836, 2, 0)
$022cc1bc             jsr       StackLock()                0.00% (18959, 246471, 18969, 0)
$022cc1c2             move.b    #$1,$12(a3)                0.00% (18959, 151667, 8, 0)
$022cc1c8             move.l    d3,-(sp)                   0.00% (18959, 227508, 18961, 0)
$022cc1ca             clr.l     -(sp)                      0.00% (18959, 56877, 0, 0)
$022cc1cc             move.l    d5,-(sp)                   0.00% (18959, 56877, 2, 0)
$022cc1ce             jsr       _memset                    0.00% (18959, 246476, 18967, 0)
$022cc1d4             adda.w    #$18,sp                    0.00% (18960, 75345, 6, 0)
$022cc1d8             tst.b     $c(a3)                     0.00% (18960, 341249, 18968, 0)
$022cc1dc             beq.w     $22cc27c   <not taken>     0.00% (18960, 56892, 12, 0)
$022cc1e0             lsr.l     #$2,d3                     0.00% (18960, 37932, 12, 0)
$022cc1e2             lea.l     $30(a3),a2                 0.00% (18960, 75840, 18, 0)
$022cc1e6             move.l    a3,d2                      0.00% (18960, 37903, 0, 0)
$022cc1e8             addi.l    #$b0,d2                    0.00% (18960, 227453, 18983, 0)
$022cc1ee             clr.l     d4                         0.00% (18960, 37899, 0, 0)
$022cc1f0             lea.l     $22cc114(pc),a6            0.00% (18960, 75832, 23, 0)
$022cc1f4             lea.l     $22cb2d4(pc),a5            0.00% (18960, 75832, 18, 0)
$022cc1f8             lea.l     $2373a7e.l,a4              0.00% (18960, 227410, 18986, 0)
$022cc1fe             movea.l   (a2)+,a3                   0.07% (606720, 2407559, 42, 436056)
$022cc200             tst.l     a3                         0.07% (606720, 1043771, 806, 0)
$022cc202             beq.b     $22cc242                   0.07% (606720, 1356475, 23447, 0)
[...]
$022cc242             cmp.l     a2,d2                      0.07% (606720, 1193018, 22, 0)
$022cc244             bne.b     $22cc1fe                   0.07% (606720, 1218395, 2240, 0)
$022cc246             move.l    d6,-(sp)                   0.00% (18960, 94710, 0, 0)
$022cc248             jsr       ~StackLock()               0.00% (18960, 246408, 18954, 0)
$022cc24e             addq.l    #$4,sp                     0.00% (18960, 18960, 0, 0)
$022cc250             move.l    d4,d0                      0.00% (18960, 37931, 11, 0)
$022cc252             movem.l   (sp)+,d2-d6/a2-a6          0.00% (18960, 383489, 20, 150272)
$022cc256             addq.l    #$8,sp                     0.00% (18960, 18960, 0, 0)
$022cc258             rts                                  0.00% (18960, 499088, 37926, 1398)
GCC13 from 25th of September:

Code: Select all

Audio::MixerImpl::mixCallback(unsigned char*, unsigned int):
$022d33ee             subq.l    #$8,sp                     0.02% (145072, 145054, 0, 0)
$022d33f0             movem.l   d2-d7/a2-a6,-(sp)          0.02% (145072, 5077505, 64, 13)
$022d33f4             movea.l   $38(sp),a3                 0.02% (145072, 435747, 18, 145000)
$022d33f8             move.l    $3c(sp),d6                 0.02% (145072, 1450577, 145102, 145067)
$022d33fc             move.l    $40(sp),d4                 0.02% (145072, 580397, 56, 145048)
$022d3400             clr.l     -(sp)                      0.02% (145072, 725366, 53, 0)
$022d3402             pea.l     $4(a3)                     0.02% (145072, 580560, 61, 1)
$022d3406             pea.l     $34(sp)                    0.02% (145072, 580304, 20, 1)
$022d340a             jsr       StackLock()                0.02% (145072, 3191622, 290228, 19)
$022d3410             move.b    #$1,$12(a3)                0.02% (145072, 1160681, 108, 4)
$022d3416             lea.l     $30(a3),a5                 0.02% (145072, 173269, 3533, 0)
$022d341a             move.l    a3,d2                      0.02% (145072, 286613, 4, 1)
$022d341c             addi.l    #$b0,d2                    0.02% (145072, 870449, 36, 3)
$022d3422             adda.w    #$c,sp                     0.02% (145072, 580288, 25, 3)
$022d3426             clr.b     d3                         0.02% (145072, 290123, 8, 4)
$022d3428             clr.l     d5                         0.02% (145072, 308666, 2724, 0)
$022d342a             lea.l     $22d3364(pc),a2            0.02% (145072, 577579, 15, 1)
$022d342e             lea.l     _memset,a4                 0.02% (145072, 870426, 25, 5)
$022d3434             movea.l   (a5)+,a6                   0.50% (4642304, 16439633, 1995, 3620598)
$022d3436             tst.l     a6                         0.50% (4642304, 8261466, 120, 25)
$022d3438             beq.b     $22d3482                   0.50% (4642304, 10316942, 152646, 36)
[...]
$022d3482             cmp.l     a5,d2                      0.50% (4642304, 9136750, 184, 37)
$022d3484             bne.b     $22d3434                   0.50% (4642304, 9291175, 3615, 24)
$022d3486             pea.l     $2c(sp)                    0.02% (145072, 1015359, 24, 8)
$022d348a             jsr       ~StackLock()               0.02% (145072, 1887882, 145322, 7)
$022d3490             addq.l    #$4,sp                     0.02% (145072, 145143, 39, 0)
$022d3492             move.l    d5,d0                      0.02% (145072, 290091, 0, 0)
$022d3494             movem.l   (sp)+,d2-d7/a2-a6          0.02% (145072, 3776591, 83, 1014068)
$022d3498             addq.l    #$8,sp                     0.02% (145072, 145060, 4, 1)
$022d349a             rts                                  0.02% (145072, 1305879, 145228, 145049)
First obvious thing is that although the use case is same, in latter case mixCallback() is called 145072 vs 18960 times = 7.7x more often.

Although the most noticeable rendering functions in the profile are still called about the same number of times:
  • drawString(): 7748 vs. 8096 = 0.96x
  • drawChar(): 359942 vs. 376787 = 0.96x
  • c2p1x1_8_rect_start(): 5626936 vs. 5760568 = 0.98x
  • c2p1x1_8_rect_pix16(): 5626496 vs.5760128 = 0.98x
Whereas in the Putt-Putt demo intro case, mixer callback is still called about same number of times in both cases (as are e.g. byleRLEDecode() and drawStripToScreen() functions from its profile).

Which explains why mixCallback() overhead increased radically in silent scrolling case although memset() calls went away, and why I did not see that same overhead in Putt-Putt case.
mikro wrote: Tue Sep 26, 2023 8:08 am Or to put it another way, how big performance decrease we are talking about? By definition, there has to be a few cycles per loop added as I did add the check but certainly it shouldn't be something to worry about.

EDIT: Sorry, overlooked your previous message. Ok, makes sense then. So the conclusion is that the memset() skipped did help for silence scenes, right?
Yes. but now the big question is how mixer callback can then end up being called nearly 8x more often in silence case after the optimization?

Optimizing audio handling does not help that much if it just gets called much more often compared to screen updates... :-)
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3449
Joined: Sun Jul 31, 2011 1:11 pm

Re: ScummVM/Falcon060 pre-release

Post by Eero Tamminen »

mikro wrote: Tue Sep 26, 2023 8:08 am Btw, about the locks... is it really ~2.5% of the all cycles while the game is running?
It's not just any locks, but StackLock constructor & destructor (+ NullMutexInternal called by it), which take (almost) 5% of all cycles (in this SCUMM engine game).

Here are things calling the constructor most (and eventually also its destructor):
stacklock-callers.png
mikro wrote: Tue Sep 26, 2023 8:08 am I want to avoid another awkward situation when re-proposing the patch. ;-)
I've now checked from profile disassembly that there's no extra included to their numbers.

It looks like this:

Code: Select all

Common::StackLock::StackLock(Common::Mutex const&, char const*):
$0233987c             move.l    a2,-(sp)                   0.18% (3127477, 12510216, 585, 0)
$0233987e             movea.l   $8(sp),a2                  0.18% (3127477, 3131669, 214, 3126744)
$02339882             move.l    $10(sp),d0                 0.18% (3127477, 12517957, 312, 3126176)
$02339886             movea.l   $c(sp),a0                  0.18% (3127477, 33378583, 2982074, 3127053)
$0233988a             movea.l   (a0),a0                    0.18% (3127477, 29626099, 538, 2820310)
$0233988c             move.l    a0,(a2)                    0.18% (3127477, 12953140, 915, 0)
$0233988e             move.l    d0,$4(a2)                  0.18% (3127477, 12514439, 914, 0)
$02339892             beq.b     $23398ac                   0.18% (3127477, 53496619, 5945102, 0)
[...]
$023398ac             movea.l   (a0),a1                    0.18% (3127477, 21020932, 757, 890874)
$023398ae             move.l    a0,$8(sp)                  0.18% (3127477, 19655475, 816, 0)
$023398b2             movea.l   $8(a1),a1                  0.18% (3127477, 16197351, 306, 1493335)
$023398b6             movea.l   (sp)+,a2                   0.18% (3127477, 5307762, 72, 2694256)
$023398b8             jmp       (a1)                       0.18% (3127477, 18616417, 1803540, 0)
Common::StackLock::~StackLock():
$023398ba             move.l    a2,-(sp)                   0.18% (3127476, 12509775, 420, 0)
$023398bc             movea.l   $8(sp),a2                  0.18% (3127476, 3132925, 202, 3126807)
$023398c0             move.l    $4(a2),d0                  0.18% (3127476, 15106326, 470, 2748060)
$023398c4             beq.b     $23398dc                   0.18% (3127476, 56198519, 6245699, 0)
[...]
$023398dc             movea.l   (a2),a0                    0.18% (3127476, 3332576, 648, 3095042)
$023398de             movea.l   (a0),a1                    0.18% (3127476, 9008435, 156, 2733342)
$023398e0             move.l    a0,$8(sp)                  0.18% (3127476, 21498403, 1146, 0)
$023398e4             movea.l   $c(a1),a1                  0.18% (3127476, 8699766, 943, 2430725)
$023398e8             movea.l   (sp)+,a2                   0.18% (3127476, 5646171, 84, 3116083)
$023398ea             jmp       (a1)                       0.18% (3127476, 25813588, 2610452, 0)
[...]
And:

Code: Select all

NullMutexInternal::lock():
$01039758             moveq     #$1,d0                     0.18% (3142495, 4467094, 102, 0)
$0103975a             rts                                  0.18% (3142495, 36476853, 3684187, 2722613)
NullMutexInternal::unlock():
$0103975c             moveq     #$1,d0                     0.18% (3142494, 4306027, 82, 0)
$0103975e             rts                                  0.18% (3142494, 16148454, 1405732, 3140582)
NullMutexInternal::~NullMutexInternal():
$01039760             bra.l     $23805a8                   0.00% (147, 1568, 119, 0)
[...]
operator delete(void*):
$023805a8             bra.l     $239d6b8                   0.00% (10293, 189275, 19648, 0)
[...]
free:
$0239d6b8             movem.l   d2-d6/a2-a6,-(sp)          0.00% (41566, 1617201, 41100, 0)
... huge amount of code...
You do not have the required permissions to view the files attached to this post.
mikro
Hardware Guru
Hardware Guru
Posts: 3708
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: ScummVM/Falcon060 pre-release

Post by mikro »

To me it looks completely natural: basically nothing is happening (UI is static, nothing is animated nor played...) so the mainloop is just calling functions one by one and returning from them. Since they take really a tiny amount of cycles, even dwarfs like Common::StackLock::lock() gets space because now they matter! :) But nevertheless, with memset() gone, mixCallback is taking way less time and therefore it allows the mainloop to call it more often.

Sounds pretty reasonable, doesn't it?

EDIT: Oops, the lock is from a game but nevertheless, I think the reasoning stays even for that case.

EDIT2: Now as I'm thinking about it, we have to be careful about this. While I can imagine that the memset optimisation *is* beneficial (i.e. in a game where nobody is speaking nor music is playing), in this specific case (being in the launcher) it's pointless. It really doesn't matter whether we are clearing ~10 KB of data while *doing nothing*. As soon as real business starts, this callback gets called maybe once per frame.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3449
Joined: Sun Jul 31, 2011 1:11 pm

Re: ScummVM/Falcon060 pre-release

Post by Eero Tamminen »

Mixer callback
mikro wrote: Tue Sep 26, 2023 10:35 pm To me it looks completely natural: basically nothing is happening (UI is static, nothing is animated nor played...) so the mainloop is just calling functions one by one and returning from them. Since they take really a tiny amount of cycles, even dwarfs like Common::StackLock::lock() gets space because now they matter! :) But nevertheless, with memset() gone, mixCallback is taking way less time and therefore it allows the mainloop to call it more often.

Sounds pretty reasonable, doesn't it?
Yes.

I kept the arrow key pressed down >90% of the profile time, so in a sense it "should not" have been idle. But as the total time for going up and down the list did not improve noticeably... I guess performance was bound by key repeat rate, rather than ScummVM rendering, so main loop would had idle time for running run mixer callback more.

StackLock
mikro wrote: Tue Sep 26, 2023 10:35 pm EDIT: Oops, the lock is from a game but nevertheless, I think the reasoning stays even for that case.
From the callgraph in my previous post, you can see that mixer callback was responsible only for <5% of the StackLock() cost / calls. While most of the calls were from audio handling, there was also ~20% of calls directly from SCUMM engine.

"Idle" mainloop optimizations
mikro wrote: Tue Sep 26, 2023 10:35 pm EDIT2: Now as I'm thinking about it, we have to be careful about this. While I can imagine that the memset optimisation *is* beneficial (i.e. in a game where nobody is speaking nor music is playing), in this specific case (being in the launcher) it's pointless. It really doesn't matter whether we are clearing ~10 KB of data while *doing nothing*. As soon as real business starts, this callback gets called maybe once per frame.
Optimizing idle mainloop could improve its responsiveness to user events. So IMHO as long as they do not regress active usage, it makes sense to do trivial optimizations on it too.

Active parts profiling

As I've mentioned earlier, in Bad Mood case I had per-frame profiling (separately for DSP and CPU side), which automatically found and profiled the slowest frame while game was either playing a demo recording, or played manually.

Although in ScummVM case, engine activity, audio handling and final screen rendering are more disconnected, and there are no real "update frames" that could be measured... Are there any parts of code which could contribute most into getting given frame on screen, where that activity can differ significantly between frames (e.g. in number of objects updated, their size etc) and it would make sense to profile them separately?

Or any other parts, where it would make sense to get more exact profiles, instead of these averages over tens of secs or minutes (needed when starting/stopping things manually), that I've been doing so far?

PS. This would mean you finding / telling me names of appropriate symbols on which profiling should start and stop, i.e. ones that are called only once at start, and once at end of thing you want profiled, and not otherwise. One can also add new symbols to the code (function, label, whatever) as markers for the profiler.
mikro
Hardware Guru
Hardware Guru
Posts: 3708
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: ScummVM/Falcon060 pre-release

Post by mikro »

I think the current approach is good enough. Because that's what ScummVM is: a tool which allows you to play games (simplified: to replay sprite positions) and we want to discover bottlenecks during gameplay. So if there's a sudden spike which goes away within 1 second, I don't care (in FPS like BadMood you do care because it causes a sudden jump / jerkiness in gameplay). If audio playback is stuttering all the time, that's what I care about because it makes the game unplayable. Or like Gobliiins, if loading takes 3 minutes, it's something to ask you for profile data (and we succeeded).

So no worries, my point was the opposite: so we always keep in mind that if somethings takes 20% of all spent CPU cycles but in reality it means 1% of VBL time because in absolute numbers it's just a few cycles, then we can leave it alone and focus on more interesting things.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3449
Joined: Sun Jul 31, 2011 1:11 pm

Re: ScummVM/Falcon060 pre-release

Post by Eero Tamminen »

mikro wrote: Wed Sep 27, 2023 8:59 am Or like Gobliiins, if loading takes 3 minutes, it's something to ask you for profile data (and we succeeded).
Yes, for things taking minutes, and having clear start and end points, manual profiling is fine.
mikro wrote: Wed Sep 27, 2023 8:59 am I think the current approach is good enough. Because that's what ScummVM is: a tool which allows you to play games (simplified: to replay sprite positions) and we want to discover bottlenecks during gameplay. So if there's a sudden spike which goes away within 1 second, I don't care (in FPS like BadMood you do care because it causes a sudden jump / jerkiness in gameplay). If audio playback is stuttering all the time, that's what I care about because it makes the game unplayable.
The problem with manual profiling is there's no way to find out whether something "goes away within 1 second".

For manual profiling one needs some clear visual indicator when to start and stop profiling, and the total profile time can easily be off 1-2 seconds, due to reaction times, and sometimes due to fast-forwarding too (if it takes long time to get to right place). Meaning that error can be 10% even for 20s profile, and it may include parts that are not relevant (skew things), which makes it impossible to reliably verify any of the normal optimizations. Optimizations would need to have a huge impact to show up in such a short profile. So for somewhat reliable results on normal optimizations, manual profile would need to be for something taking minutes (and having visually clear start & end points).

Whereas profile started & ended on a symbol breakpoint has no such problems. It can track as short activities as one wants...
mikro wrote: Wed Sep 27, 2023 8:59 am So no worries, my point was the opposite: so we always keep in mind that if somethings takes 20% of all spent CPU cycles but in reality it means 1% of VBL time because in absolute numbers it's just a few cycles, then we can leave it alone and focus on more interesting things.
...On top of which there's no way to measure ScummVM VBL time, to find whether something means 1% of it. :-)
mikro
Hardware Guru
Hardware Guru
Posts: 3708
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: ScummVM/Falcon060 pre-release

Post by mikro »

There's only one place I can think of to measure optimisations: the main loop start / end. I.e. you know you are at a good position (like start of gameplay) and then you get profile data between those two points. But I don't think it is *that* much different from just going there and profile everything. :) Perhaps key is just starting profiling when something "bad" starts to happen, not from the start of scummvm.

But as I said, this is all good enough. Huge optimisations is what we are after, so far everything you have thrown at me made sense, i.e. it made scummvm visibly faster (startup time, audio on TT and in general (infamous "rate.o"), UI refresh, loading times, ... all of those would be very hard to find without profiling).
mikro
Hardware Guru
Hardware Guru
Posts: 3708
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: ScummVM/Falcon060 pre-release

Post by mikro »

What's up, fellas?

To celebrate exactly one month of ScummVM inactivity :-) I present you latest builds:

https://mikro.naprvyraz.sk/private/scummvm-fat.zip
https://mikro.naprvyraz.sk/private/scummvm-slim.zip

Main news:

- updated to latest master (imagine my shock when I found out that keyboard is completely dead; in the end it was an upstream bug so I had to revert one commit in the meantime)
- in-house and optimised downsampling on the TT => it should be visibly faster than when letting STFA do the job
- you still need a sound XBIOS, though: STFA or X-SOUND will do (both available on https://mikrosk.github.io/xbios; no need to install and/or setup any CPX) or ... wait for it ... EmuTOS! Yes, having EmuTOS is all that's needed for playing back samples as it has X-SOUND built in!
- minor silence fix

Not much, I know but hey, the release date is nearing so we no big changes are expected. Eero, if / when you find time and mood, feel free to compare some of the games with sampled playback: the external dependency (STFA) should basically disappear in your profile stats.

I haven't tested it on my TT (installing some goodies) but when everything will be done (hopefully this weekend), quite a few games should run smoother thanks to the upgrades installed.
stormy
Atari God
Atari God
Posts: 1494
Joined: Tue Jan 26, 2016 12:39 pm

Re: ScummVM/Falcon060 pre-release

Post by stormy »

Thanks mikro. Sorry for asking a redundant question that can be quite possible found within the many pages of this thread, but could you please explain the fat/slim differences again? Thanks.
mikro
Hardware Guru
Hardware Guru
Posts: 3708
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: ScummVM/Falcon060 pre-release

Post by mikro »

stormy wrote: Fri Oct 27, 2023 12:18 pm Thanks mikro. Sorry for asking a redundant question that can be quite possible found within the many pages of this thread, but could you please explain the fat/slim differences again? Thanks.
Nothing beats the readme.txt: https://github.com/mikrosk/scummvm/blob ... #L346-L366 ;-)
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3449
Joined: Sun Jul 31, 2011 1:11 pm

Re: ScummVM/Falcon060 pre-release

Post by Eero Tamminen »

TT sound
mikro wrote: Thu Oct 26, 2023 11:51 pm - in-house and optimised downsampling on the TT => it should be visibly faster than when letting STFA do the job
- you still need a sound XBIOS, though: STFA or X-SOUND will do (both available on https://mikrosk.github.io/xbios; no need to install and/or setup any CPX) or ... wait for it ... EmuTOS! Yes, having EmuTOS is all that's needed for playing back samples as it has X-SOUND built in!
- minor silence fix
Not much, I know but hey, the release date is nearing so we no big changes are expected.
Verified. Sound works now on (emulated) TT when using just EmuTOS and no extra sound SW, which is very nice!

Previous slim / TT profiles
mikro wrote: Thu Oct 26, 2023 11:51 pm Eero, if / when you find time and mood, feel free to compare some of the games with sampled playback: the external dependency (STFA) should basically disappear in your profile stats.

I haven't tested it on my TT (installing some goodies) but when everything will be done (hopefully this weekend), quite a few games should run smoother thanks to the upgrades installed.
I found only 4 TT / slim version profiles from this thread: Last two are probably most interesting to compare?

Note: they all were done with old GCC, so there will also be a GCC difference.

Issues

There's very clear difference in ITE intro sound quality between (emulated) Falcon and TT with the the slim version. On TT, ScummVM seems to mixing white noise to the output compared to Falcon version. Is such difference expected? (This was when using just EmuTOS.)

Not sure whether you're interested, but "direct" mode did not render Putt-putt screen updates correctly:
grab0004.png
You do not have the required permissions to view the files attached to this post.
mikro
Hardware Guru
Hardware Guru
Posts: 3708
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: ScummVM/Falcon060 pre-release

Post by mikro »

Whether it's slim or not doesn't matter in this case, audio routines are the same. So for speedup comparison you can use also the fat one.

Direct mode issues are yeah, expected, nothing to do about it (at least not in short term).

As for the ITE ... did it happen also with STFA's emulation? (I mean in the older versions, not the latest one) My formula for downsampling is quite simple, all samples are mixed to -0x8000 ~ 0x7fff and I just take the high byte. So it's possible that it results in something wrong (?) but I'd like to see whether STFA is doing a better job there.
User avatar
Estrayk
Captain Atari
Captain Atari
Posts: 342
Joined: Mon Nov 23, 2015 2:52 pm
Location: Spain

Re: ScummVM/Falcon060 pre-release

Post by Estrayk »

I tested today last version and seems all works like a charm.
Just a warning before load the menu.

Warning: generateZipSet: Could not find 'gui-icons.dat'
You do not have the required permissions to view the files attached to this post.
・Falcon ct60e・Atari MegaSTE ・Atari STe ・
mikro
Hardware Guru
Hardware Guru
Posts: 3708
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: ScummVM/Falcon060 pre-release

Post by mikro »

Yeah, that warning is fine. It would just mindlessly load a file with icons which would never get used. It's intentionally removed to speed up loading.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3449
Joined: Sun Jul 31, 2011 1:11 pm

Re: ScummVM/Falcon060 pre-release

Post by Eero Tamminen »

ITE sound

For some reason I could not reproduce the white noise issue any more.

Unlike with STFA, there's no echo in first part of ITE intro when using just EmuTOS.

But I noticed that volume is clearly lower on (emulated) TT volume than on Falcon (both with and without STFA).

ITE intro

Previous fat/Falcon & slim/TT profiles: viewtopic.php?p=449745#p449745

2-3% faster than the previous TT profile with STFA (and 6% slower than previous Falcon profile):

Code: Select all

Time spent in profile = 953.40450s.
...
Used cycles:
  13.61%  13.69%  15.79%  416373992641866052234829474278   Audio::RateConverter_Impl<true, true, false>::convert(Audio::AudioStream&, short*, unsigned int, unsigned short, unsigned short)
  10.78%  10.84%  20.85%  329793342133149722446376706990   Audio::Oki_ADPCMStream::readBuffer(short*, int)
  10.73%                  3282275429                       c2p1x1_8_tt_start
   8.38%                  2563426553                       c2p1x1_8_tt_pix16
   7.74%   7.79%  28.79%  236870521623816732678807912834   Audio::RateConverter_Impl<false, true, false>::convert(Audio::AudioStream&, short*, unsigned int, unsigned short, unsigned short)
   7.65%   7.69%   7.80%  233977775223521636002384520175   Saga::Sprite::drawOccluded(Common::Array<Saga::SpriteInfo>&, unsigned int, Common::Point const&, int, int)
   6.08%   6.11%   6.11%  185979017218697304801869892088   Saga::DefaultFont::blitGlyph(Common::Point const&, unsigned char const*, int, int, int, unsigned char)
   5.21%   5.24%   5.24%  159418002116028207461602959987   Audio::Oki_ADPCMStream::decodeOKI(unsigned char)
   3.92%                  1199682790                       copy256
   3.43%   3.45%   3.45%  104850383510542194721054311484   Saga::Surface::transitionDissolve(unsigned char const*, Common::Rect const&, int, double)
   2.58%   2.59%   5.66%   789095947 7931963361731682392   Common::MemoryReadStream::read(void*, unsigned int)
   2.37%                   725679618                       ROM_TOS
   1.66%   1.67%   2.51%   508463723 511210016 767733314   Saga::Actor::findActorPath(Saga::ActorData*, Common::Point const&, Common::Point const&)
   1.44%   1.45%   2.22%   441592342 443976387 678579242   Audio::RawStream<2, false, true>::readBuffer(short*, int)
   1.38%                   422597294                       common2
   0.96%                   294108946                       AtariMixerManager::update() [clone .part.0]
   0.74%   0.38%   4.30%   227510533 1150902521315572432 * Saga::DefaultFont::outFont(Saga::FontStyle const&, char const*, unsigned long, Common::Point const&, int, Saga::FontEffectFlags)
   0.64%   0.64%   0.64%   196174843 197251086 197268152   Common::MemoryReadStream::eos() const
   0.61%   3.33%  49.27%   186634639101788818515071720437   OSystem_Atari::update()
...
Instruction cache misses:
  22.57%  22.83%  24.67%   177991990 180037763 194603761   Audio::RateConverter_Impl<true, true, false>::convert(Audio::AudioStream&, short*, unsigned int, unsigned short, unsigned short)
  21.67%  21.87%  38.25%   170930851 172465632 301682736   Audio::Oki_ADPCMStream::readBuffer(short*, int)
  10.66%  10.81%  49.09%    84107539  85257507 387186379   Audio::RateConverter_Impl<false, true, false>::convert(Audio::AudioStream&, short*, unsigned int, unsigned short, unsigned short)
...
Visits/calls:
  16.57%            30971747             c2p1x1_8_tt_start
  16.57%            30963357             c2p1x1_8_tt_pix16
   4.87%   4.87%     9102800   9103005   Common::MemoryReadStream::pos() const
   4.87%   4.87%     9102683   9103054   Common::MemoryReadStream::eos() const
   4.87%   4.87%     9093445   9096420   Audio::Oki_ADPCMStream::decodeOKI(unsigned char)
   4.59%             8569734             common
(Earlier 17% for "rate.o" + 8 % for "TT_RAM" (STFA) changed to about same percentages for 2x "Audio::RateConverter_Impl()".)

See callgraph:
ite-intro-cycles.png
ITE play

Some random demo game play shows isometric tile drawing to be still most expensive:

Code: Select all

Time spent in profile = 504.05183s.
...
Used cycles:
  20.70%  21.06%  31.49%  334710138634058113165092176580   Saga::IsoMap::drawTile(unsigned short, Common::Point const&, Saga::Location const*)
  12.54%  12.61%  14.68%  202802729620400573592373935242   Audio::RateConverter_Impl<true, true, false>::convert(Audio::AudioStream&, short*, unsigned int, unsigned short, unsigned short)
   8.00%                  1294536008                       c2p1x1_8_tt_start
   6.25%                  1011194135                       c2p1x1_8_tt_pix16
   4.12%   4.14%   8.15%   665890229 6694643371318273385   Audio::Oki_ADPCMStream::readBuffer(short*, int)
   3.68%                   595201234                       common2
   2.94%   2.96%   4.00%   475576899 478292090 647158102   Saga::Actor::findActorPath(Saga::ActorData*, Common::Point const&, Common::Point const&)
   2.56%   2.58%  10.73%   414421596 4168035551735399271   Audio::RateConverter_Impl<false, true, false>::convert(Audio::AudioStream&, short*, unsigned int, unsigned short, unsigned short)
   2.30%   2.31%   2.31%   371715083 373913956 373955600   Saga::DefaultFont::blitGlyph(Common::Point const&, unsigned char const*, int, int, int, unsigned char)
   2.20%   2.21%   4.11%   355599156 357566270 664067048   Saga::IsoMap::drawSpritePlatform(unsigned short, Common::Point const&, Saga::Location const&, short, short, short)
   2.20%                   355183061                       copy256
   2.14%   2.16%   2.16%   346812942 348743765 348776277   Audio::Oki_ADPCMStream::decodeOKI(unsigned char)
   2.05%                   331597631                       ROM_TOS
   1.95%                   315161151                       scopy
   1.41%   1.42%  31.01%   227525459 2288802215015469265   Saga::IsoMap::drawPlatform(unsigned short, Common::Point const&, short, short, short)
   1.34%   1.35%   2.84%   216521698 217702054 459837691   Common::MemoryReadStream::read(void*, unsigned int)
   1.32%   1.33%   5.43%   213388157 214473727 878728812   Saga::IsoMap::drawSpriteMetaTile(unsigned short, Common::Point const&, Saga::Location&, short, short)
   1.21%   1.22%   2.03%   196322600 197434763 327515735   Audio::RawStream<2, false, true>::readBuffer(short*, int)
   1.14%  13.06%  13.07%   18411464921127348912112961356   memmove
...
Instruction cache misses:
  26.22%  26.83%  39.90%   129746009 132759099 197435075   Saga::IsoMap::drawTile(unsigned short, Common::Point const&, Saga::Location const*)
  17.52%  17.74%  19.29%    86712146  87787340  95439507   Audio::RateConverter_Impl<true, true, false>::convert(Audio::AudioStream&, short*, unsigned int, unsigned short, unsigned short)
   7.15%   7.22%  12.63%    35390262  35708975  62475617   Audio::Oki_ADPCMStream::readBuffer(short*, int)
You do not have the required permissions to view the files attached to this post.
mikro
Hardware Guru
Hardware Guru
Posts: 3708
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: ScummVM/Falcon060 pre-release

Post by mikro »

Good, good. And that's a good observation with the volume, I wouldn't have noticed.

That poses an interesting dilemma: right now the downsampling code relies on the fact that I want to basically shift 16-bit values into 8-bit ones, i.e. taking the upper byte only. If I were to fix the volume, I would need to replace the shift from >> 8 to something like 7 or 6, losing the optimisation trick which works only for 8-bit shifts.

Before I start working on an asm version of the mixer (yeah, it's coming) I got an interesting idea: many of those sampled effects (sometimes also music, if it's not a direct rip from CD but native compressed/lower quality soundtrack) are sampled with frequencies like 11 kHz (to save space on floppy disks, naturally) while our mixer just sets 25 kHz once and that's it. Sometimes they are even mono and it is the mixer which 'expands' it to stereo even though the mixer can handle mono playback.

So I have been thinking that I could reinitialise our mixer for every new game, not only once: I'd check which frequency was requested (same of mono/stereo) and set the lowest acceptable one for our backend (11 kHz -> 12.5 kHz).

Of course, this makes sense only if one doesn't set 12.5 or even 6 kHz by default because of performance reasons, then the whole thing is pointless because it would piss of the user that it's increasing the frequency and introduces stuttering. I could implement it only one way (switching from higher to lower frequency only) but still, maybe it's not so useful as it sounded in my head. ;)

Another thing would be: what if the channels have different frequencies? So I'd set 11 kHz mono and suddenly a CD quality 44100 stereo wav file starts playing, I'd need to reinit everything. And then what if the music suddenly stops? Do I keep the frequency for the future? Or reinit again? Dilemmas, dilemmas... maybe the sanest approach would be letting the user to specify the frequency (and maybe mono/stereo) per game in scummvm.ini with some debug outputs to get an idea what playback parameters the game asks for.

One thing I'd like to also add audio-wise is detection of a DSP external clock, that would allow selection of native "PC" frequencies like 44100 (/1, /2, /4, ...) which could lead to faster resampling code (there is a code path to detect whether the src/dst frequencies aren't multiplies of each other).

Maybe this would make sense -- you, as the user, can't decide whether it's better to have 11025 (PC), 12500 (STE/TT) or 12292 Hz (Falcon) as the mixing frequency but software could make that decision for you.

I even got an evil idea: introducing an option which would PRETEND to be the right frequency (requested: 11025, set in reality: 12500) so it would always take the faster code path with some small inaccuracy. That would kill so many cycles in the mixer within a blink of eye.
User avatar
shoggoth
Nature
Nature
Posts: 1340
Joined: Tue Aug 01, 2006 9:21 am
Location: Halmstad, Sweden
Contact:

Re: ScummVM/Falcon060 pre-release

Post by shoggoth »

mikro wrote: Mon Oct 30, 2023 8:47 amI even got an evil idea: introducing an option which would PRETEND to be the right frequency (requested: 11025, set in reality: 12500) so it would always take the faster code path with some small inaccuracy. That would kill so many cycles in the mixer within a blink of eye.
Any chance you could resample sound assets and keep them in some kind of cache? Evil enough? :)
Ain't no space like PeP-space.
medmed
Captain Atari
Captain Atari
Posts: 439
Joined: Sat Apr 02, 2011 5:06 am
Location: France, Paris

Re: ScummVM/Falcon060 pre-release

Post by medmed »

I don't know the impact on performances but there's zitaresample, a realtime up/down sampling library. I used it to sync sound frames with originals video fps. It downsample 44.1Khz to 25Khz with a good rendered quality. Here a work load example.
May be the load will be negligible.

https://tho-otto.de/download/mint/zita- ... dev.tar.xz
M.Medour - 1040STF, Mega STE + Spektrum card, Milan 040 + S3Video + ES1371.
mikro
Hardware Guru
Hardware Guru
Posts: 3708
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: ScummVM/Falcon060 pre-release

Post by mikro »

Whether its real-time or not, that doesn't matter that much, it's a burden. AFAIK ScummVM has a decent resampler, I doubt a new one would gain much.

Also, I don't want to preprocess anything: that way I could basically take the path of ScummST (god I how envy that guy) - i.e. select a few chosen ones and forget the rest, forget the UI, scanning, initialising, ... forget all the annoying things. ;)

Of course, I could create an elaborate cache like cache/<engine id>/<game id>/<sample filename> but that would open another can of worms: generating tens of megabytes of data, taking a long time, detecting whether user hasn't changed the desired frequency and recalculate... I think it's easier just to cheat. ;)
elliot
Captain Atari
Captain Atari
Posts: 257
Joined: Tue Mar 17, 2009 2:00 pm

Re: ScummVM/Falcon060 pre-release

Post by elliot »

Okay, just a thought and I admit I am not familiar with it all and prob do not understand the problem BUT could you not "normalise" all the samples at the start? i.e. convert all to 12.5khz for example so you do not need to resample later? Or does it not work like that?
Falcon with CT60 in rack mountable case. Two STFMs, one upgraded lots. My original STE from when I was a teen with Switchable TOS, 1.44Mb drive, 4MB RAM, Supra Hard Drive and very very yellow case. Mega STE with (currently none working) Crazy Dots 2. Atari 2600 and a Jag. And a mountain of commercial software and lots of hardware addons.
Post Reply

Return to “News & Announcements”