C compiler benchmarking

C and PASCAL (or any other high-level languages) in here please

Moderators: simonsunnyboy, Mug UK, Zorro 2, Moderator Team

ThorstenOtto
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3329
Joined: Sun Aug 03, 2014 5:54 pm

Re: C compiler benchmarking

Post by ThorstenOtto »

I'm having a bit trouble analyzing the profile data. What i've done is
  • add start_prof()/stop_prof() calls (functions do nothing, and are only there to set breakpoints)
  • run hatari --machine st --cpuclock 8, using emutos
  • drop into emucon
  • enter hatari debugger with shortcut
  • set breakpoint:
    • b pc = TEXT :once
  • continue emulation
  • run coremark from emucon. Hatari stops at program start
  • set breakpoints:
    • b PC = _start_prof
    • b PC = _stop_prof
  • continue emulation
  • when hatari hits start_prof:
    • profile on
    • continue emulation
  • when hatari hits stop_prof:
    • profile save coremark.txt
But i can't figure how to interpret the text file. Running

Code: Select all

hatari_profile.py -r coremark.sym coremark.txt
gives an error message:

Code: Select all

Parsing profile information from coremark.txt...
parsing disassembly...
parsing call info...
ERROR: unrecognized line 11:
        '000388ec 4e75                     rts  == $0003a272                  0.00% (1, 16, 0, 0)'!
Attached is a zip with all the relevant files. Any help maybe? Hatari is version 2.4.1 compiled from git
You do not have the required permissions to view the files attached to this post.
czietz
Hardware Guru
Hardware Guru
Posts: 2734
Joined: Tue May 24, 2016 6:47 pm

Re: C compiler benchmarking

Post by czietz »

Iirc you need to use the (recently removed) external disassembler, not the UAE one, in order for the profile post-processing to work. But I often find the raw output from "profile save" very helpful, too. You can easily spot hot instructions there.
medmed
Atari Super Hero
Atari Super Hero
Posts: 985
Joined: Sat Apr 02, 2011 5:06 am
Location: France, Paris

Re: C compiler benchmarking

Post by medmed »

medmed wrote: Sat Sep 03, 2022 9:58 pm Many thanks guys - I'll try myself asap.
Hi,

I've made some tests with a sample 720p MP4 video:
st_mp4_gcc4 = 14fps
st_mp4_gcc9 = 21fps (libs recompiled with gcc9 : openh264 + fdk-aac + libyuv + zita-resampler)
M.Medour - 1040STF, Mega STE + Spektrum card, Milan 040 + S3Video + ES1371.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3899
Joined: Sun Jul 31, 2011 1:11 pm

Re: C compiler benchmarking

Post by Eero Tamminen »

ThorstenOtto wrote: Sun Sep 11, 2022 6:23 am I'm having a bit trouble analyzing the profile data. What i've done is
...
Looks good. Note that the whole chain can be fully automated with chained breakpoints.
ThorstenOtto wrote: Sun Sep 11, 2022 6:23 am But i can't figure how to interpret the text file. Running

Code: Select all

hatari_profile.py -r coremark.sym coremark.txt
gives an error message:
Because having multiple disassemblers with multiple different disassembling options is pain, post-processor expects Hatari to tell how to parse the disassembly. That's done with the "Field regexp:" header at start of the saved profile.

However, Hatari debugger does not really support providing correct regexps for all these disassembling options, and the only one that I've tested it extensively is the old external disassembler (that Thomas removed after v2.4.1 release).

That's easy to fix manually though, because regexp needs to locate only two things, the address at start, and profiling counters at end, and ignore everything between them. Just change the header to following, and parsing works fine:

Code: Select all

Field regexp:   ^([0-9a-f]+) .*% \((.*)\)$
ThorstenOtto wrote: Sun Sep 11, 2022 6:23 am Attached is a zip with all the relevant files. Any help maybe? Hatari is version 2.4.1 compiled from git
With the regexp fixed, post-processor gives:

Code: Select all

$ hatari_profile.py -stp -r coremark.sym coremark.txt

Hatari profile data processor

Parsing TEXT relative symbol address information from coremark.sym...
WARNING: replacing '__ieeefixdsb' at 0xaa88 with '__ieeefixdsl'
WARNING: replacing '__ieeefixdsl' at 0xaa88 with '__ieeefixdsw'
133 lines with 109 code symbols/addresses parsed, 0 unknown.

Parsing profile information from coremark.txt...
parsing disassembly...
parsing call info...
2472 lines processed with 13 functions.
Ignoring 7 switches to _crcu8
Ignoring 5 switches to _calc_func
Ignoring 5 switches to _cmp_complex
Ignoring 3 switches to _cmp_idx
Ignoring 17 switches to _core_state_transition
Of all 43266 switches, ignored 37 for type(s) ['r', 'u', 'x'].

CPU profile information from 'coremark.txt':
- Hatari v2.4.1

Time spent in profile = 17.05560s.

Visits/calls:
- max = 20480, in _core_state_transition at 0x3b33e, on line 1572
- 44973 in total
Executed instructions:
- max = 122400, in _core_bench_list+180 at 0x38efe, on line 228
- 12591042 in total
Used cycles:
- max = 2827164, in _matrix_test+1350 at 0x3aecc, on line 1434
- 136807208 in total

Visits/calls:
  45.50%  45.50%       20463     20463   _core_state_transition
  25.96%  25.96%       11673     11673   _crcu8
   9.90%  71.40%        4453     32111   _calc_func
   9.24%   9.24%        4155      4155   _cmp_idx
   4.95%  75.59%        2224     33996   _cmp_complex
   1.90%                 853             __ieeefltsd
   1.90%                 853             ROM_TOS

Executed instructions:
  35.03%  35.62%  36.25%     4410120   4485323   4564839   _matrix_test
  31.85%  32.26%  32.26%     4010400   4061262   4061262   _core_state_transition
  14.05%  14.23%  99.91%     1769260   1791289  12580130   _core_bench_list
  11.32%  11.42%  11.42%     1424693   1438118   1438118   _crcu8
   2.31%   2.38%  39.64%      290640    299046   4990674   _core_bench_state
   2.03%   2.06%  81.24%      255869    259123  10228347   _core_list_mergesort
   1.40%                      176610                       ROM_TOS

Used cycles:
  41.79%  42.68%  43.11%    57165656  58391820  58976716   _matrix_test
  29.88%  30.50%  30.50%    40871644  41720412  41720412   _core_state_transition
  11.65%  11.91%  99.94%    15938196  16294936 136724564   _core_bench_list
   7.49%   7.66%   7.66%    10248836  10477244  10477244   _crcu8
   2.79%   2.87%  36.72%     3815892   3930364  50242108   _core_bench_state
   2.10%                     2878276                       ROM_TOS
   1.78%   1.81%  85.05%     2430048   2482864 116351800   _core_list_mergesort
   1.19%   1.22%  81.89%     1623872   1662320 112034228   _calc_func
M68k CPU cycle usage callgraph:
[/code]
$ hatari_profile.py -stp -r coremark.sym -g --compact coremark.txt
$ dot -Tpng -Gsize=16 coremark-2.dot > coremark.png
[/code]

Looks like this:
coremark.png
Note: for interactive viewing of all 3 generated callgraphs, use "xdot" and drop the "--compact" option.

PS. Your description of the issue and providing of all data, was super!

(I often despair of how little info people give for the issues, trying to get all relevant details often feels like trying to pull their teeth out by hand.)
You do not have the required permissions to view the files attached to this post.
ThorstenOtto
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3329
Joined: Sun Aug 03, 2014 5:54 pm

Re: C compiler benchmarking

Post by ThorstenOtto »

Many thanks! Just changing the disassembler option to uae in the config file did the trick, too (that one needs to be changed IMHO in ./src/debug/profilecpu.c if HAVE_CAPSTONE_M68K is not defined).

BTW, here are the result for the Pure-C executable:

Code: Select all

Parsing TEXT relative symbol address information from coremark.sym...
WARNING: replacing 'Start' at 0x0 with '__text'
WARNING: replacing '_xlcnv' at 0x31d6 with '_xwcnv'
217 lines with 174 code symbols/addresses parsed, 0 unknown.

Parsing profile information from coremark.txt...
parsing disassembly...
parsing call info...
1426 lines processed with 31 functions.
Ignoring 1 switches to cmp_complex
Ignoring 1 switches to core_list_find
Ignoring 1 switches to core_list_reverse
Ignoring 5 switches to ee_isdigit
Ignoring 1 switches to core_state_transition
Ignoring 1 switches to crcu8
Ignoring 1 switches to crcu16
Ignoring 2 switches to crcu32
Ignoring 1 switches to crc16
Ignoring 36 switches to _ulmul
Ignoring 26 switches to _lmul
Of all 376147 switches, ignored 76 for type(s) ['r', 'u', 'x'].

CPU profile information from 'coremark.txt':
- Hatari v2.4.1

Time spent in profile = 12.31478s.

Visits/calls:
- max = 239760, in _ulmul at 0x3c19e, on line 1039
- 377380 in total
Executed instructions:
- max = 239760, in _ulmul at 0x3c19e, on line 1040
- 10043153 in total
Used cycles:
- max = 10608232, in _ulmul+16 at 0x3c1ae, on line 1048
- 98779916 in total

Visits/calls:
  63.52%  63.52%      239724    239724   _ulmul
  17.16%  17.16%       64774     64774   _lmul
  10.39%  10.39%       39195     39195   ee_isdigit
   2.71%  13.10%       10239     49430   core_state_transition
   1.55%   1.55%        5839      5839   crcu8

Executed instructions:
  23.87%  24.19%  24.19%     2397600   2429700   2429700   _ulmul
  11.73%  11.90%  17.64%     1178160   1194791   1772002   core_state_transition
  11.71%  11.83%  11.83%     1176480   1188561   1188561   _lmul
  10.90%  11.07%  28.49%     1094840   1111322   2860878   matrix_mul_matrix_bitextract
   8.72%   8.80%   8.80%      875676    883870    883870   crcu8
   8.58%   8.71%  23.19%      861560    874676   2328859   matrix_mul_matrix
   5.69%   5.75%   5.75%      571040    577271    577271   ee_isdigit
   4.43%   4.49%   4.49%      445180    451363    451363   core_list_find
   3.82%   3.86%   3.86%      383520    387622    387622   core_list_reverse
   2.56%   2.59%   3.89%      256660    260118    391043   matrix_sum
   1.32%   1.34%  87.26%      132514    134300   8763776   core_list_mergesort
   1.27%                      127265                       ROM_TOS
   1.14%   1.15%  23.03%      114000    115664   2312847   core_bench_state

Used cycles:
  24.37%  24.91%  24.91%    24076848  24603520  24603520   _ulmul
  12.29%  12.56%  29.97%    12141384  12406764  29607476   matrix_mul_matrix_bitextract
  12.28%  12.55%  17.61%    12129288  12399820  17392368   core_state_transition
  10.75%  10.96%  10.96%    10621536  10825192  10825192   _lmul
   9.45%   9.67%  24.00%     9337476   9550272  23709228   matrix_mul_matrix
   6.44%   6.58%   6.58%     6362832   6497868   6497868   crcu8
   4.95%   5.05%   5.05%     4890864   4993060   4993060   ee_isdigit
   4.61%   4.71%   4.71%     4549000   4651076   4651076   core_list_find
   3.14%   3.21%   3.21%     3106244   3174784   3174784   core_list_reverse
   2.65%   2.70%   4.04%     2614904   2671064   3991768   matrix_sum
   2.10%                     2074484                       ROM_TOS
   1.27%   1.29%  22.18%     1251272   1277972  21912232   core_bench_state
   1.19%   1.22%  88.60%     1177728   1206192  87518764   core_list_mergesort
   
They are a bit difficult to compare, though. A lot of time is spent in ulmul/lmul, which is the 32*32 bit multipllcation helper function for pure-c, but vbcc does not call a function for this, and generates the code inline. Other functions seem also to be inlined by vbcc, because of the -O3 optimization level.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3899
Joined: Sun Jul 31, 2011 1:11 pm

Re: C compiler benchmarking

Post by Eero Tamminen »

ThorstenOtto wrote: Mon Sep 12, 2022 1:18 pm Many thanks! Just changing the disassembler option to uae in the config file did the trick, too (that one needs to be changed IMHO in ./src/debug/profilecpu.c if HAVE_CAPSTONE_M68K is not defined).
Good catch. I've changed that regexp in Hatari git to be more generic so that same one can be used regardless of which disassembler is configured / built-in / used in Hatari.
ThorstenOtto wrote: Mon Sep 12, 2022 1:18 pm BTW, here are the result for the Pure-C executable:

Code: Select all

...
Visits/calls:
  63.52%  63.52%      239724    239724   _ulmul
  17.16%  17.16%       64774     64774   _lmul
  10.39%  10.39%       39195     39195   ee_isdigit
   2.71%  13.10%       10239     49430   core_state_transition
...
Used cycles:
  24.37%  24.91%  24.91%    24076848  24603520  24603520   _ulmul
  12.29%  12.56%  29.97%    12141384  12406764  29607476   matrix_mul_matrix_bitextract
  12.28%  12.55%  17.61%    12129288  12399820  17392368   core_state_transition
  10.75%  10.96%  10.96%    10621536  10825192  10825192   _lmul
   9.45%   9.67%  24.00%     9337476   9550272  23709228   matrix_mul_matrix
   6.44%   6.58%   6.58%     6362832   6497868   6497868   crcu8
...
They are a bit difficult to compare, though. A lot of time is spent in ulmul/lmul, which is the 32*32 bit multipllcation helper function for pure-c, but vbcc does not call a function for this, and generates the code inline. Other functions seem also to be inlined by vbcc, because of the -O3 optimization level.
One option could be trying to iteratively change the code by creating intermediate functions that encapsulate found differences between compilers (e.g. things needing multiplication helpers), in hopes that those are easier to compare between compilers. Such changes themselves could trigger perf changes, which could also be of interest.

Because this code structure is currently pretty simple (compared e.g. to things like Doom engine in Bad Mood, Linux kernel or EmuTOS bootup), just reading the profile annotated assembly output could be most instructive (like Christian already commented). From instruction counts one can see how any times each of the code branches was executed etc.
ThorstenOtto
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3329
Joined: Sun Aug 03, 2014 5:54 pm

Re: C compiler benchmarking

Post by ThorstenOtto »

Just tried to mark all the functions as noinline for vbcc, but that didn't work very well:

Code: Select all

Visits/calls:
   6.31%              122384             l610
   5.44%              105563             l622
   4.72%               91495             l450
   4.72%               91484             l494
   4.04%               78388             l436
   4.04%               78385             l441
   4.04%               78369             l438
   4.04%   4.04%       78348     78348   l432
   3.76%               72939             l440
   3.45%               66862             l437
   3.01%               58310             l875
   3.01%               58308             l930
   3.01%               58289             l601
   2.89%               56123             l602
   1.53%               29591             l476
   1.35%               26234             l470
   1.34%               25916             l307
   1.34%               25909             l791
   1.33%               25755             l478
   1.07%               20794             l472
   1.06%               20475             l623
   1.06%   5.09%       20467     98759   _core_state_transition

Executed instructions:
  17.21%                     2274480                       l930
   9.27%                     1224720                       l875
   6.77%                      893920                       l450
   5.56%                      734400                       l610
   4.16%                      549120                       l494
   3.74%                      494460                       l602
   2.79%  24.76%  34.38%      368640   3271880   4542818   _core_state_transition
   2.71%                      358080                       l441
   2.61%                      344960                       l622
   2.61%                      344380                       l601
   2.55%   9.63%   9.63%      336640   1271834   1271834   l432
   1.78%                      235200                       l436
   1.39%                      184203                       ROM_TOS
   1.38%                      181760                       l478
   1.27%                      167680                       l438
   1.19%  10.88%  10.88%      157350   1437128   1437128   _crcu8
   1.13%                      149920                       l472
   1.09%                      143840                       l452
   1.09%                      143520                       l791
   1.02%   1.04%  11.91%      134320    136989   1573382   _crcu16

Used cycles:
  22.22%                    31748000                       l930
  11.31%                    16157864                       l875
   4.43%                     6325220                       l450
   4.42%  23.05%  32.48%     6320940  32935196  46409484   _core_state_transition
   4.11%                     5869568                       l610
   3.23%                     4622600                       l602
   3.18%   9.44%   9.44%     4536924  13483696  13483696   l432
   2.84%                     4060200                       l494
   2.64%                     3768916                       l436
   2.57%                     3672064                       l601
   2.23%                     3187004                       l622
   2.10%                     3001844                       ROM_TOS
   1.90%                     2708292                       l623
   1.84%                     2622340                       l441
   1.24%                     1770540                       l478
   1.14%                     1635256                       l438
   1.13%                     1609348                       l767
   1.11%                     1583416                       l821
   1.06%   1.09%   8.41%     1520704   1556064  12012188   _crcu16
   1.03%                     1468740                       l472
   1.00%                     1435872                       l708
   1.00%                     1435740                       l706
Note that all those l* symbols are local labels generated by the compiler, which are neither sorted out by gst2ascii, nor by hatari_profile.py (gcc uses a .L prefix IIRC)

Iterations/sec from this run was 1.12, so slightly less than without that hack, but still much better than the value for Pure-C. Hm.

PS.: while running that test, i got an "invalid free()", followed by an abort when the program returns to emucon. Maybe some double free when freeing the profile data?
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3899
Joined: Sun Jul 31, 2011 1:11 pm

Re: C compiler benchmarking

Post by Eero Tamminen »

ThorstenOtto wrote: Tue Sep 13, 2022 12:39 pm Just tried to mark all the functions as noinline for vbcc, but that didn't work very well:

Code: Select all

Visits/calls:
   6.31%              122384             l610
   5.44%              105563             l622
 ... 
Note that all those l* symbols are local labels generated by the compiler, which are neither sorted out by gst2ascii, nor by hatari_profile.py (gcc uses a .L prefix IIRC)
Losing the symbol names for real, named functions sounds like VBCC bug. To avoid profiler tracking unwanted symbols, you could use gst2ascii to write symbols to a text file, remove unwanted ones (e.g. "grep -v -E ' l[0-9]+') from it, and provide both profiler and profile post-processor the filtered symbols file.
ThorstenOtto wrote: Tue Sep 13, 2022 12:39 pm Iterations/sec from this run was 1.12, so slightly less than without that hack, but still much better than the value for Pure-C. Hm.

PS.: while running that test, i got an "invalid free()", followed by an abort when the program returns to emucon. Maybe some double free when freeing the profile data?
If you could run (non-stripped) Hatari under Valgrind or Gdb, or build Hatari with AddressSanitizer support ("cmake -D ENABLE_ASAN:BOOL=1"), and mail backtrace to the devel list, that would be great!
ThorstenOtto
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3329
Joined: Sun Aug 03, 2014 5:54 pm

Re: C compiler benchmarking

Post by ThorstenOtto »

Eero Tamminen wrote: Tue Sep 13, 2022 10:08 pm Losing the symbol names for real, named functions sounds like VBCC bug.
Symbols from the functions are still there. It rather looks like the profiler used some of the l* symbols instead as function entry points.
If you could run (non-stripped) Hatari under Valgrind or Gdb, or build Hatari with AddressSanitizer support ("cmake -D ENABLE_ASAN:BOOL=1"), and mail backtrace to the devel list, that would be great!
Done. Just wonder why i didn't notice that in my first tries, but recompiling it with the original settings now crashes too. Maybe i just quit the debugger instead of returning to emulation, dunno.

Anyway, after stripping the local symbols, i get now this (compile with vbcc, but all functions marked as noninline):

Code: Select all

Time spent in profile = 17.81431s.

Visits/calls:
- max = 20480, in _core_state_transition at 0x3a9a2, on line 1434
- 66571 in total
Executed instructions:
- max = 122400, in _core_list_reverse+10 at 0x395ea, on line 691
- 13212719 in total
Used cycles:
- max = 2827472, in _matrix_mul_matrix+84 at 0x3a87e, on line 1309
- 142892996 in total

Visits/calls:
  30.75%  30.75%       20473     20473   _core_state_transition
  17.53%  17.53%       11673     11673   _crcu8
   8.77%  26.29%        5837     17504   _crcu16
   7.87%  31.46%        5239     20946   _crc16
   6.69%  61.22%        4453     40758   _calc_func
   6.24%   6.24%        4152      4152   _cmp_idx
   6.19%   6.19%        4119      4119   _core_list_find
   6.13%   6.13%        4078      4078   _core_list_reverse
   3.35%  64.56%        2227     42981   _cmp_complex
   1.92%  17.28%        1279     11506   _crcu32
   1.34%                 892             __etext
   1.34%                 892             ROM_TOS

Executed instructions:
  33.94%  34.40%  34.40%     4484960   4544809   4544809   _core_state_transition
  17.82%  18.15%  18.15%     2354240   2397540   2397540   _matrix_mul_matrix_bitextract
  10.78%  10.88%  10.88%     1424693   1437162   1437162   _crcu8
   9.87%  10.04%  10.04%     1304480   1326566   1326566   _matrix_mul_matrix
   6.83%   6.92%   6.92%      902700    913825    913825   _core_list_find
   5.74%   5.80%   5.80%      758880    766757    766757   _core_list_reverse
   2.75%   2.78%   2.78%      363720    367294    367294   _matrix_sum
   1.94%   1.96%  81.29%      255869    259436  10740612   _core_list_mergesort
   1.84%   1.89%  42.01%      243760    249269   5550200   _core_bench_state
   1.39%                      184132                       ROM_TOS
   1.04%   1.05%  99.91%      136760    139046  13201382   _core_bench_list
   1.02%   1.04%  11.91%      134320    137464   1573891   _crcu16

Used cycles:
  31.80%  32.49%  32.49%    45446436  46427060  46427060   _core_state_transition
  22.66%  23.16%  23.16%    32382612  33087032  33087032   _matrix_mul_matrix_bitextract
  11.75%  12.01%  12.01%    16792500  17154720  17154720   _matrix_mul_matrix
   7.17%   7.32%   7.32%    10247228  10459912  10459912   _crcu8
   6.36%   6.49%   6.49%     9088236   9274068   9274068   _core_list_find
   4.26%   4.35%   4.35%     6081824   6213076   6213076   _core_list_reverse
   2.31%   2.37%  39.01%     3300420   3385632  55749356   _core_bench_state
   2.10%                     3001236                       ROM_TOS
   2.07%   2.11%   2.11%     2954436   3014320   3014320   _matrix_sum
   1.70%   1.74%  84.57%     2429772   2486960 120844200   _core_list_mergesort
   1.17%   1.20%   1.20%     1677568   1716032   1716032   _matrix_mul_const
   1.17%   1.19%   1.19%     1665860   1705020   1705020   _matrix_mul_vect
   1.06%   1.10%   8.41%     1521524   1565340  12019972   _crcu16
   1.02%   1.04%  99.94%     1456692   1488304 142805696   _core_bench_list
Still difficult to compare to the pure-c version, since i don't know how often lmul was called there in which function.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3899
Joined: Sun Jul 31, 2011 1:11 pm

Re: C compiler benchmarking

Post by Eero Tamminen »

ThorstenOtto wrote: Wed Sep 14, 2022 11:57 am Symbols from the functions are still there. It rather looks like the profiler used some of the l* symbols instead as function entry points.
Profiler tracks all addresses with a symbol, as symbols are what you use to tell profiler memory addresses under which you want to collect the statistics.

(Having symbols for loop addresses is something that you do not want for profiling, as it can slow down profiling significantly and results from that are easy to misinterpret.)
ThorstenOtto wrote: Wed Sep 14, 2022 11:57 am Still difficult to compare to the pure-c version, since i don't know how often lmul was called there in which function.
If it has a symbol, profiler tells you that. Just ask for full callgraph and check it with xdot.
ThorstenOtto
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3329
Joined: Sun Aug 03, 2014 5:54 pm

Re: C compiler benchmarking

Post by ThorstenOtto »

Here are some other interesting results. This is comparison of
a) wolfssl vs openssl library
b) gcc 4.6.4 vs gcc 7.5.0

All tests were run on ARAnyM this time.The benchmark i used is that of wolfsll, and ported to openssl.

Results for gcc 4.6.4:

Code: Select all

------------------------------------------------------------------------------
 wolfSSL version 5.5.0
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
RNG                 35 MB took 1.040 seconds,   33.654 MB/s
AES-128-CBC-enc     60 MB took 1.060 seconds,   56.604 MB/s
AES-128-CBC-dec     50 MB took 1.080 seconds,   46.296 MB/s
AES-192-CBC-enc     50 MB took 1.010 seconds,   49.505 MB/s
AES-192-CBC-dec     45 MB took 1.090 seconds,   41.284 MB/s
AES-256-CBC-enc     50 MB took 1.110 seconds,   45.045 MB/s
AES-256-CBC-dec     40 MB took 1.050 seconds,   38.095 MB/s
AES-128-GCM-enc     10 MB took 1.910 seconds,    5.236 MB/s
AES-128-GCM-dec     10 MB took 1.920 seconds,    5.208 MB/s
AES-192-GCM-enc     10 MB took 1.940 seconds,    5.155 MB/s
AES-192-GCM-dec     10 MB took 1.930 seconds,    5.181 MB/s
AES-256-GCM-enc     10 MB took 1.940 seconds,    5.155 MB/s
AES-256-GCM-dec     10 MB took 1.930 seconds,    5.181 MB/s
GMAC Default         6 MB took 1.040 seconds,    5.769 MB/s
AES-128-ECB-enc     14 MB took 1.000 seconds,   13.641 MB/s
AES-128-ECB-dec     14 MB took 1.000 seconds,   13.607 MB/s
AES-192-ECB-enc     13 MB took 1.000 seconds,   13.190 MB/s
AES-192-ECB-dec     13 MB took 1.000 seconds,   13.100 MB/s
AES-256-ECB-enc     13 MB took 1.000 seconds,   12.843 MB/s
AES-256-ECB-dec     13 MB took 1.000 seconds,   12.736 MB/s
CHACHA             160 MB took 1.010 seconds,  158.418 MB/s
CHA-POLY            20 MB took 1.230 seconds,   16.260 MB/s
3DES                20 MB took 1.250 seconds,   16.000 MB/s
MD5                335 MB took 1.010 seconds,  331.683 MB/s
POLY1305            20 MB took 1.100 seconds,   18.182 MB/s
SHA                280 MB took 1.010 seconds,  277.228 MB/s
SHA-224            105 MB took 1.040 seconds,  100.962 MB/s
SHA-256            105 MB took 1.020 seconds,  102.941 MB/s
SHA-384             15 MB took 1.400 seconds,   10.714 MB/s
SHA-512             15 MB took 1.420 seconds,   10.563 MB/s
SHA3-224            30 MB took 1.160 seconds,   25.862 MB/s
SHA3-256            25 MB took 1.040 seconds,   24.038 MB/s
SHA3-384            20 MB took 1.060 seconds,   18.868 MB/s
SHA3-512            15 MB took 1.140 seconds,   13.158 MB/s
RIPEMD             175 MB took 1.020 seconds,  171.569 MB/s
HMAC-MD5           330 MB took 1.010 seconds,  326.733 MB/s
HMAC-SHA           270 MB took 1.010 seconds,  267.327 MB/s
HMAC-SHA224        105 MB took 1.020 seconds,  102.941 MB/s
HMAC-SHA256        105 MB took 1.010 seconds,  103.960 MB/s
HMAC-SHA384         15 MB took 1.390 seconds,   10.791 MB/s
HMAC-SHA512         15 MB took 1.420 seconds,   10.563 MB/s
PBKDF2              10 KB took 1.000 seconds,   10.281 KB/s
RSA     2048 public       200 ops took 1.300 sec, avg 6.500 ms, 153.846 ops/sec
RSA     2048 private      100 ops took 38.970 sec, avg 389.700 ms, 2.566 ops/sec
DH      2048 key gen        7 ops took 1.120 sec, avg 160.000 ms, 6.250 ops/sec
DH      2048 agree        100 ops took 16.000 sec, avg 160.000 ms, 6.250 ops/sec
ECC   [      SECP256R1]   256 key gen      500 ops took 1.220 sec, avg 2.440 ms, 409.836 ops/sec
ECDHE [      SECP256R1]   256 agree        400 ops took 1.140 sec, avg 2.850 ms, 350.878 ops/sec
ED     25519 key gen      381 ops took 1.000 sec, avg 2.625 ms, 381.000 ops/sec
ED     25519 sign         400 ops took 1.100 sec, avg 2.750 ms, 363.636 ops/sec
ED     25519 verify       100 ops took 1.020 sec, avg 10.200 ms, 98.039 ops/sec
Benchmark complete

------------------------------------------------------------------------------
 OpenSSL version OpenSSL 1.1.1p  21 Jun 2022 (OpenSSL 1.1.1p  21 Jun 2022)
------------------------------------------------------------------------------
openSSL Benchmark (block bytes 1048576, min 1.0 sec each)
RNG                 60 MB took 1.040 seconds,   57.692 MB/s
AES-128-CBC-enc     85 MB took 1.050 seconds,   80.952 MB/s
AES-128-CBC-dec     85 MB took 1.030 seconds,   82.524 MB/s
AES-192-CBC-enc     75 MB took 1.070 seconds,   70.094 MB/s
AES-192-CBC-dec     75 MB took 1.060 seconds,   70.755 MB/s
AES-256-CBC-enc     65 MB took 1.060 seconds,   61.321 MB/s
AES-256-CBC-dec     70 MB took 1.080 seconds,   64.815 MB/s
AES-128-GCM-enc     80 MB took 1.020 seconds,   78.431 MB/s
AES-128-GCM-dec     85 MB took 1.040 seconds,   81.731 MB/s
AES-192-GCM-enc     70 MB took 1.020 seconds,   68.628 MB/s
AES-192-GCM-dec     75 MB took 1.060 seconds,   70.755 MB/s
AES-256-GCM-enc     65 MB took 1.080 seconds,   60.185 MB/s
AES-256-GCM-dec     65 MB took 1.020 seconds,   63.726 MB/s
AES-128-ECB-enc      1 KB took 1.070 seconds,    1.241 KB/s
AES-128-ECB-dec      1 KB took 1.010 seconds,    1.315 KB/s
AES-192-ECB-enc      1 KB took 1.010 seconds,    1.083 KB/s
AES-192-ECB-dec      1 KB took 1.030 seconds,    1.138 KB/s
AES-256-ECB-enc      1 KB took 1.060 seconds,    0.958 KB/s
AES-256-ECB-dec      1 KB took 1.020 seconds,    0.996 KB/s
CHACHA             260 MB took 1.010 seconds,  257.426 MB/s
CHA-POLY            25 MB took 1.100 seconds,   22.727 MB/s
3DES                15 MB took 1.250 seconds,   12.000 MB/s
MD5                400 MB took 1.010 seconds,  396.040 MB/s
SHA1               390 MB took 1.010 seconds,  386.139 MB/s
SHA-224            105 MB took 1.040 seconds,  100.962 MB/s
SHA-256            105 MB took 1.010 seconds,  103.960 MB/s
SHA-384             25 MB took 1.070 seconds,   23.365 MB/s
SHA-512             25 MB took 1.080 seconds,   23.148 MB/s
SHA3-224            45 MB took 1.100 seconds,   40.909 MB/s
SHA3-256            40 MB took 1.020 seconds,   39.216 MB/s
SHA3-384            35 MB took 1.140 seconds,   30.702 MB/s
SHA3-512            25 MB took 1.130 seconds,   22.124 MB/s
RIPEMD             225 MB took 1.010 seconds,  222.772 MB/s
HMAC-MD5           405 MB took 1.010 seconds,  400.990 MB/s
HMAC-SHA           395 MB took 1.020 seconds,  387.255 MB/s
HMAC-SHA224        105 MB took 1.020 seconds,  102.941 MB/s
HMAC-SHA256        105 MB took 1.010 seconds,  103.960 MB/s
HMAC-SHA384         25 MB took 1.080 seconds,   23.148 MB/s
HMAC-SHA512         25 MB took 1.070 seconds,   23.365 MB/s
PBKDF2              10 KB took 1.000 seconds,   10.156 KB/s
RSA     2048 public       200 ops took 1.170 sec, avg 5.850 ms, 170.941 ops/sec
RSA     2048 private      100 ops took 21.810 sec, avg 218.100 ms, 4.585 ops/sec
ECC   [      SECP256R1]   256 key gen      100 ops took 4.560 sec, avg 45.600 ms, 21.930 ops/sec
ECDHE [      SECP256R1]   256 agree        100 ops took 4.570 sec, avg 45.700 ms, 21.882 ops/sec
ED     25519 key gen      379 ops took 1.000 sec, avg 2.639 ms, 378.999 ops/sec
ED     25519 sign         400 ops took 1.070 sec, avg 2.675 ms, 373.834 ops/sec
ED     25519 verify       100 ops took 1.020 sec, avg 10.200 ms, 98.039 ops/sec
Benchmark complete
Results for gcc 7.5.0:

Code: Select all

------------------------------------------------------------------------------
 wolfSSL version 5.5.0
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
RNG                 45 MB took 1.110 seconds,   40.541 MB/s
AES-128-CBC-enc     50 MB took 1.020 seconds,   49.020 MB/s
AES-128-CBC-dec     45 MB took 1.070 seconds,   42.056 MB/s
AES-192-CBC-enc     45 MB took 1.050 seconds,   42.857 MB/s
AES-192-CBC-dec     40 MB took 1.070 seconds,   37.383 MB/s
AES-256-CBC-enc     40 MB took 1.060 seconds,   37.736 MB/s
AES-256-CBC-dec     35 MB took 1.060 seconds,   33.019 MB/s
AES-128-GCM-enc     10 MB took 1.870 seconds,    5.348 MB/s
AES-128-GCM-dec     10 MB took 1.860 seconds,    5.376 MB/s
AES-192-GCM-enc     10 MB took 1.880 seconds,    5.319 MB/s
AES-192-GCM-dec     10 MB took 1.910 seconds,    5.236 MB/s
AES-256-GCM-enc     10 MB took 1.930 seconds,    5.181 MB/s
AES-256-GCM-dec     10 MB took 1.900 seconds,    5.263 MB/s
GMAC Default         7 MB took 1.160 seconds,    6.034 MB/s
AES-128-ECB-enc     13 MB took 1.000 seconds,   13.401 MB/s
AES-128-ECB-dec     13 MB took 1.000 seconds,   13.273 MB/s
AES-192-ECB-enc     13 MB took 1.000 seconds,   12.927 MB/s
AES-192-ECB-dec     13 MB took 1.000 seconds,   12.750 MB/s
AES-256-ECB-enc     12 MB took 1.000 seconds,   12.402 MB/s
AES-256-ECB-dec     12 MB took 1.000 seconds,   12.458 MB/s
CHACHA             165 MB took 1.010 seconds,  163.368 MB/s
CHA-POLY            20 MB took 1.260 seconds,   15.873 MB/s
3DES                20 MB took 1.330 seconds,   15.038 MB/s
MD5                360 MB took 1.020 seconds,  352.942 MB/s
POLY1305            20 MB took 1.150 seconds,   17.391 MB/s
SHA                255 MB took 1.010 seconds,  252.475 MB/s
SHA-224            110 MB took 1.030 seconds,  106.796 MB/s
SHA-256            110 MB took 1.040 seconds,  105.769 MB/s
SHA-384             25 MB took 1.010 seconds,   24.752 MB/s
SHA-512             25 MB took 1.020 seconds,   24.510 MB/s
SHA3-224            35 MB took 1.030 seconds,   33.981 MB/s
SHA3-256            35 MB took 1.070 seconds,   32.710 MB/s
SHA3-384            30 MB took 1.190 seconds,   25.210 MB/s
SHA3-512            20 MB took 1.150 seconds,   17.391 MB/s
RIPEMD             210 MB took 1.020 seconds,  205.883 MB/s
HMAC-MD5           360 MB took 1.010 seconds,  356.436 MB/s
HMAC-SHA           260 MB took 1.010 seconds,  257.426 MB/s
HMAC-SHA224        110 MB took 1.010 seconds,  108.911 MB/s
HMAC-SHA256        110 MB took 1.010 seconds,  108.911 MB/s
HMAC-SHA384         30 MB took 1.200 seconds,   25.000 MB/s
HMAC-SHA512         30 MB took 1.200 seconds,   25.000 MB/s
PBKDF2              11 KB took 1.000 seconds,   11.281 KB/s
RSA     2048 public       200 ops took 1.360 sec, avg 6.800 ms, 147.059 ops/sec
RSA     2048 private      100 ops took 42.870 sec, avg 428.700 ms, 2.333 ops/sec
DH      2048 key gen        6 ops took 1.120 sec, avg 186.666 ms, 5.357 ops/sec
DH      2048 agree        100 ops took 18.420 sec, avg 184.200 ms, 5.429 ops/sec
ECC   [      SECP256R1]   256 key gen      400 ops took 1.050 sec, avg 2.625 ms, 380.953 ops/sec
ECDHE [      SECP256R1]   256 agree        400 ops took 1.250 sec, avg 3.125 ms, 320.000 ops/sec
ED     25519 key gen      349 ops took 1.000 sec, avg 2.865 ms, 349.000 ops/sec
ED     25519 sign         400 ops took 1.130 sec, avg 2.825 ms, 353.982 ops/sec
ED     25519 verify       100 ops took 1.110 sec, avg 11.100 ms, 90.090 ops/sec
Benchmark complete

------------------------------------------------------------------------------
 OpenSSL version OpenSSL 1.1.1p  21 Jun 2022 (OpenSSL 1.1.1p  21 Jun 2022)
------------------------------------------------------------------------------
openSSL Benchmark (block bytes 1048576, min 1.0 sec each)
RNG                 55 MB took 1.040 seconds,   52.884 MB/s
AES-128-CBC-enc     75 MB took 1.050 seconds,   71.429 MB/s
AES-128-CBC-dec     80 MB took 1.060 seconds,   75.472 MB/s
AES-192-CBC-enc     70 MB took 1.080 seconds,   64.815 MB/s
AES-192-CBC-dec     70 MB took 1.060 seconds,   66.038 MB/s
AES-256-CBC-enc     55 MB took 1.020 seconds,   53.922 MB/s
AES-256-CBC-dec     60 MB took 1.030 seconds,   58.252 MB/s
AES-128-GCM-enc     75 MB took 1.010 seconds,   74.257 MB/s
AES-128-GCM-dec     75 MB took 1.030 seconds,   72.816 MB/s
AES-192-GCM-enc     70 MB took 1.080 seconds,   64.815 MB/s
AES-192-GCM-dec     65 MB took 1.050 seconds,   61.905 MB/s
AES-256-GCM-enc     60 MB took 1.060 seconds,   56.604 MB/s
AES-256-GCM-dec     60 MB took 1.070 seconds,   56.075 MB/s
AES-128-ECB-enc      1 KB took 1.040 seconds,    1.127 KB/s
AES-128-ECB-dec      1 KB took 1.010 seconds,    1.160 KB/s
AES-192-ECB-enc      1 KB took 1.060 seconds,    0.958 KB/s
AES-192-ECB-dec      1 KB took 1.010 seconds,    1.006 KB/s
AES-256-ECB-enc    960 bytes took 1.090 seconds,  880.735 bytes/s
AES-256-ECB-dec    960 bytes took 1.050 seconds,  914.287 bytes/s
CHACHA             275 MB took 1.020 seconds,  269.608 MB/s
CHA-POLY            25 MB took 1.050 seconds,   23.810 MB/s
3DES                15 MB took 1.190 seconds,   12.605 MB/s
MD5                365 MB took 1.020 seconds,  357.843 MB/s
SHA1               305 MB took 1.010 seconds,  301.980 MB/s
SHA-224            100 MB took 1.010 seconds,   99.010 MB/s
SHA-256            100 MB took 1.020 seconds,   98.039 MB/s
SHA-384             30 MB took 1.110 seconds,   27.027 MB/s
SHA-512             30 MB took 1.110 seconds,   27.027 MB/s
SHA3-224            60 MB took 1.080 seconds,   55.556 MB/s
SHA3-256            55 MB took 1.030 seconds,   53.398 MB/s
SHA3-384            45 MB took 1.070 seconds,   42.056 MB/s
SHA3-512            30 MB took 1.020 seconds,   29.412 MB/s
RIPEMD             210 MB took 1.010 seconds,  207.921 MB/s
HMAC-MD5           365 MB took 1.020 seconds,  357.843 MB/s
HMAC-SHA           310 MB took 1.020 seconds,  303.922 MB/s
HMAC-SHA224        100 MB took 1.010 seconds,   99.010 MB/s
HMAC-SHA256        100 MB took 1.010 seconds,   99.010 MB/s
HMAC-SHA384         30 MB took 1.100 seconds,   27.273 MB/s
HMAC-SHA512         30 MB took 1.100 seconds,   27.273 MB/s
PBKDF2              10 KB took 1.000 seconds,   10.031 KB/s
RSA     2048 public       200 ops took 1.150 sec, avg 5.750 ms, 173.913 ops/sec
RSA     2048 private      100 ops took 21.730 sec, avg 217.300 ms, 4.602 ops/sec
ECC   [      SECP256R1]   256 key gen      100 ops took 4.560 sec, avg 45.600 ms, 21.930 ops/sec
ECDHE [      SECP256R1]   256 agree        100 ops took 4.600 sec, avg 46.000 ms, 21.739 ops/sec
ED     25519 key gen      386 ops took 1.000 sec, avg 2.591 ms, 385.999 ops/sec
ED     25519 sign         400 ops took 1.050 sec, avg 2.625 ms, 380.955 ops/sec
ED     25519 verify       200 ops took 1.960 sec, avg 9.800 ms, 102.041 ops/sec
Benchmark complete
As you can see, most functions perform better with gcc 7, but there are also some that are worse.
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3954
Joined: Sat Jun 30, 2012 9:33 am

Re: C compiler benchmarking

Post by dml »

Eero Tamminen wrote: Mon Sep 12, 2022 9:12 pm Because this code structure is currently pretty simple (compared e.g. to things like Doom engine in Bad Mood, Linux kernel or EmuTOS bootup), just reading the profile annotated assembly output could be most instructive (like Christian already commented). From instruction counts one can see how any times each of the code branches was executed etc.
FWIW (and maybe a bit off-topic), I found myself hopping between [trace cpu_symbols] to [profiling] for capturing visit counts and a relatively deep [history] dump as different ways to determine what was live/executing and what executed recently e.g. before an obscure failure. Combining all of these can be very powerful. I recently found/fixed a memory corruption issue using exactly this approach.

However... I find the Windows version of the Hatari debugger is still let down by the way the logging is captured to the console - particularly using the [trace] functions - it renders Hatari nonresponsive and unusable while the console catches up with many line prints. Hatari can require forced termination to get out of the loop. I think this is due to regular fflush() operations from Hatari after each debug message. I remember having issues with the same thing a few years ago in the case where the Falcon DSP crashed with 'illegal instruction' and Hatari sends them all to the console in an endless loop, with Hatari turning nonresponsive, needing killed. I looked at the code to see if it would be easy to turn these flushes off but haven't got around to trying a locally built version as I don't have the build prerequisites set up. Having a commandline switch to suppress the fflush() ops on line-by line events would be a very nice tweak!. And yes, this is almost certainly a Windows-only thing.

Anyway I've been using this stuff again over the last week or two and it has been very useful :) And I'm keeping an eye on these GCC results too.
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3899
Joined: Sun Jul 31, 2011 1:11 pm

Re: C compiler benchmarking

Post by Eero Tamminen »

dml wrote: Sat Sep 24, 2022 12:18 pm However... I find the Windows version of the Hatari debugger is still let down by the way the logging is captured to the console - particularly using the [trace] functions - it renders Hatari nonresponsive and unusable while the console catches up with many line prints. Hatari can require forced termination to get out of the loop. I think this is due to regular fflush() operations from Hatari after each debug message. I remember having issues with the same thing a few years ago in the case where the Falcon DSP crashed with 'illegal instruction' and Hatari sends them all to the console in an endless loop, with Hatari turning nonresponsive, needing killed. I looked at the code to see if it would be easy to turn these flushes off but haven't got around to trying a locally built version as I don't have the build prerequisites set up. Having a commandline switch to suppress the fflush() ops on line-by line events would be a very nice tweak!. And yes, this is almost certainly a Windows-only thing.
On Linux such output fills terminal history so fast that it hides all the useful info of what caused that situation. I haven't noticed Hatari freezing though, so that may indeed be Windows specific terminal scrolling speed / buffering / pipe issue.

Assuming that the main problem is identical trace lines repeating, I coded support for (optionally) compressing them, took my time-machine for a whirl to previous month, and posted that patch-set to Hatari mailing list: https://listengine.tuxfamily.org/lists. ... 00124.html

Patches are relative to Hatari v2.4.1 sources, latest Git head needs some trivial changes to them.

That won't help with repeated CPU or DSP core messages, only trace messages, but hopefully that's enough?

(If not, that patch-set localizes tracing by changing it from a macro to a separate va-args function, where disabling flushing would be easier.)
ThorstenOtto
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3329
Joined: Sun Aug 03, 2014 5:54 pm

Re: C compiler benchmarking

Post by ThorstenOtto »

Eero Tamminen wrote: Sat Sep 24, 2022 3:54 pm so that may indeed be Windows specific terminal scrolling speed / buffering / pipe issue.
I agree that this is mostly windows related. On windows, the output goes to the console, which is actually part of the program. So for any output/scrolling, the application has to wait. On linux, output goes to a pipe or pseudo-tty, and actual output/scrolling is done by a different process.

Maybe that could be simulated on windows by sending the output to a thread instead, but that assumes that all such output is done by a central function, not through printf()'s , and it would certainly be quite some work.

Maybe you can also try running your program from mintty (part of cygwin), IIRC that behaves similar in that it creates pipes to catch output from programs.
czietz
Hardware Guru
Hardware Guru
Posts: 2734
Joined: Tue May 24, 2016 6:47 pm

Re: C compiler benchmarking

Post by czietz »

ThorstenOtto wrote: Sat Sep 24, 2022 4:59 pm Maybe you can also try running your program from mintty (part of cygwin), IIRC that behaves similar in that it creates pipes to catch output from programs.
Generally, running programs with a lot of console output from mintty (which uses a pipe, as you already wrote) improves performance. Iirc, the default Windows console stdout is unbuffered, i.e. it is forcefully flushed very often, which has a significant performance impact. However, I wonder if mintty would help in Hatari's special case, as Hatari under Windows opens a separate console window (-W option).
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3899
Joined: Sun Jul 31, 2011 1:11 pm

Re: C compiler benchmarking

Post by Eero Tamminen »

czietz wrote: Sat Sep 24, 2022 5:09 pm Generally, running programs with a lot of console output from mintty (which uses a pipe, as you already wrote) improves performance. Iirc, the default Windows console stdout is unbuffered, i.e. it is forcefully flushed very often, which has a significant performance impact. However, I wonder if mintty would help in Hatari's special case, as Hatari under Windows opens a separate console window (-W option).
If there's a better way to provide terminal interaction/output on Windows, than this contribution from one of Hatari's Windows users: https://git.tuxfamily.org/hatari/hatari ... /opencon.c

patches are welcome! :-)
User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3954
Joined: Sat Jun 30, 2012 9:33 am

Re: C compiler benchmarking

Post by dml »

Thanks for all the feedback on this one. I will try to get some time to look into it more closely. If I find a solution I'll post it.

It seems I don't get notifications anymore from AF in any threads - even recent posts like this one - so if I don't respond that might be why.
ThorstenOtto
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3329
Joined: Sun Aug 03, 2014 5:54 pm

Re: C compiler benchmarking

Post by ThorstenOtto »

czietz wrote: Sat Sep 24, 2022 5:09 pm as Hatari under Windows opens a separate console window (-W option).
But only if you use -W. If you already have a console, you don't need that switch. But maybe you have to patch the executable header to make it a console program.
User avatar
paulwratt
Atari freak
Atari freak
Posts: 70
Joined: Sat Dec 27, 2008 10:16 am

Re: C compiler benchmarking

Post by paulwratt »

there is a "new" windows console that is 50+ times faster (500x ?), I watched a video on the timeing comparisions, and (I think) the code was on GH too

its faster to print to file on any system too BTW (which might be hard to do on Windows, especially if its error messages you are trying to capture)

EDIT: I tried to find the video, but I cant remember the title (Windows, Console and Terminal destroy the search with modern search engine). I do remember seeing timing comparisons on piping volume and through-put. If I eventually find something I post here again ..

---

Thanks for that SSL comparison Thorsten, it seems the biggest negatives of ggc7 are in the (commonly used?) SHA (SHA-384 & SHA-512 being 2.4x + as slow) & RSA parts (being slightly slower). I wounder if its possible to mix and match object files to build a fastest version for any given executable.
Last edited by paulwratt on Thu Sep 29, 2022 9:43 am, edited 5 times in total.
ThorstenOtto
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3329
Joined: Sun Aug 03, 2014 5:54 pm

Re: C compiler benchmarking

Post by ThorstenOtto »

I don't know where you see a factor of 2, there are only slight differences between gcc4 and gcc7. Theoretically It should be possible to mix them, but it would be quite difficult to hack the Makefiles to use different compilers for different sources.
User avatar
wongck
Ultimate Atarian
Ultimate Atarian
Posts: 13529
Joined: Sat May 03, 2008 2:09 pm
Location: Far East
Contact:

Re: C compiler benchmarking

Post by wongck »

Thanks for the SSL benchmarks.

Unfortunately for us, for TLS 1.3 the relevant numbers will be the GCM ones only, as CBC is disallowed.
I would compare AES-256-GCM-xxx and that's just 1% between the compilers.

For CHA-POLY is also widely used (like on this site - at least reported by my Atari application )
and seems better with gcc 464.

I think I will stick with gcc 464 for now.

PS. Also RSA key exchange are disallowed in TLS 1.3
of course if you are still using TLS 1.2 then CBC, RSA are all still relevant.
I think we can still survive with TLS 1,2 for now (2022/2023).
My Stuff: FB/Falcon CT63 CTPCI ATI RTL8139 USB 512MB 30GB HDD CF HxC_SD/ TT030 68882 4+32MB 520MB Nova/ 520STFM 4MB Tos206 SCSI
Shared SCSI Bus:ScsiLink ethernet, 9GB HDD,SD-reader @ http://phsw.atari.org
My Atari stuff that are no longer for sale due to them over 30 years old - click here for list
ThorstenOtto
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3329
Joined: Sun Aug 03, 2014 5:54 pm

Re: C compiler benchmarking

Post by ThorstenOtto »

wongck wrote: Thu Sep 29, 2022 3:13 am For CHA-POLY is also widely used (like on this site - at least reported by my Atari application )
I think that also depends on the client doing the connection. Here for example (Windows 10, Firefox 104.0.2) i get

Code: Select all

TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, 128-Bit-Schlüssel, TLS 1.2
mikro
Hardware Guru
Hardware Guru
Posts: 4566
Joined: Sat Sep 10, 2005 11:11 am
Location: Kosice, Slovakia
Contact:

Re: C compiler benchmarking

Post by mikro »

Indeed. It's called key exchange *negotiation* for a reason. :)
User avatar
wongck
Ultimate Atarian
Ultimate Atarian
Posts: 13529
Joined: Sat May 03, 2008 2:09 pm
Location: Far East
Contact:

Re: C compiler benchmarking

Post by wongck »

yes. The Cha-Poly algo supposed to be less demanding and mostly used by mobile devices as such.
So looks like the handshake negotiation for the Atari client and this server works very well :mrgreen:
My Stuff: FB/Falcon CT63 CTPCI ATI RTL8139 USB 512MB 30GB HDD CF HxC_SD/ TT030 68882 4+32MB 520MB Nova/ 520STFM 4MB Tos206 SCSI
Shared SCSI Bus:ScsiLink ethernet, 9GB HDD,SD-reader @ http://phsw.atari.org
My Atari stuff that are no longer for sale due to them over 30 years old - click here for list
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3899
Joined: Sun Jul 31, 2011 1:11 pm

Re: C compiler benchmarking

Post by Eero Tamminen »

Yes, it looks interesting: https://en.wikipedia.org/wiki/ChaCha20-Poly1305#Use

Can somebody provide a direct link to the algorithm source?

(So that somebody else familiar with Falcon DSP programming could evaluate how feasible it looks with DSP?)

Looking at the RFC spec does not really give very clear picture of what real implementation would look like:
* https://datatracker.ietf.org/doc/html/rfc7905
* https://datatracker.ietf.org/doc/html/rfc7539
Post Reply

Return to “C / PASCAL etc.”