• Re: tcl versa python regarding performance

    From aotto1968@21:1/5 to Gerald Lester on Thu Aug 15 15:04:10 2024
    On 14.08.24 14:04, Gerald Lester wrote:
    On 8/14/24 04:16, aotto1968 wrote:
    the first analyses is quite simple:

    right now python does NOT support threads in NHI1 (will change soon) and tcl does…
    this has an influence on the "release" build because this is NHI1 without threads in python and with
    threads in tcl.

    → the difference is that the thread-local-storage is an STATIC REFERENCE in python and a POINTER in tcl.

    → the "aggressive" build does NOT use threads at all and the change between python and tcl is more compare-able
    but is still ~20%


    I think the point that androwish was making, without seeing the code we can not tell if you did something in a way that takes
    more time than doing it in a slightly different way.


    I use the kcachegrind to debug the performance but there are a lot of "small" points to end-up in the ~20% loss against python.

    I cannot post a "picture" because the "newsgroup does NOT accept pictures …
    must of the code is in the TCL-C-Api for example:

    Example my "ServiceCall" function: at the end of a service call I use:

    if (ret == TCL_OK) {
    Tcl_ResetResult(interp);
    return MkErrorGetCode_0E();
    }

    and this simple "Tcl_ResetResult" eat 0,8% of the total performance → this is 75% of my "ServiceCall" performance.

    not trivial, it seems that the Python people with a lot of “manpower” have already MAXIMIZED the optimization of Python.

    If I step in Tcl_ResetResult the highlight is:

    % eat Total performance -> function name
    1,09% -> ResetObjectResult
    0,54% -> FreeByteArrayInternalRep (this object is variable size around ~ 1000 bytes)

    mfg ao

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aotto1968@21:1/5 to All on Thu Aug 15 20:43:28 2024
    On 15.08.24 20:27, aotto1968 wrote:

    To be more precise I add an image to show the differences TCL versa PYTHON on an the example wrapper function

    ReadI8

    This is from the debugging environment with tcl/py & extension compiled in debug mode.

    https://i.postimg.cc/NjXccdRC/performance-check-tcl-versa-python.png



    better link → with callgraph https://i.postimg.cc/TYbNKXrn/performance-check-tcl-versa-python.png

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aotto1968@21:1/5 to All on Thu Aug 15 21:09:40 2024
    On 15.08.24 20:43, aotto1968 wrote:
    On 15.08.24 20:27, aotto1968 wrote:

    To be more precise I add an image to show the differences TCL versa PYTHON on an the example wrapper function

    ReadI8

    This is from the debugging environment with tcl/py & extension compiled in debug mode.

    https://i.postimg.cc/NjXccdRC/performance-check-tcl-versa-python.png



    better link → with callgraph https://i.postimg.cc/TYbNKXrn/performance-check-tcl-versa-python.png

    even better resolution: https://i.postimg.cc/wvpJV4QC/performance-check-tcl-versa-python.png

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aotto1968@21:1/5 to All on Thu Aug 15 20:27:50 2024
    To be more precise I add an image to show the differences TCL versa PYTHON on an the example wrapper function

    ReadI8

    This is from the debugging environment with tcl/py & extension compiled in debug mode.

    https://i.postimg.cc/NjXccdRC/performance-check-tcl-versa-python.png

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aotto1968@21:1/5 to All on Thu Aug 15 21:18:29 2024
    On 15.08.24 21:09, aotto1968 wrote:
    On 15.08.24 20:43, aotto1968 wrote:
    On 15.08.24 20:27, aotto1968 wrote:

    To be more precise I add an image to show the differences TCL versa PYTHON on an the example wrapper function

    ReadI8

    This is from the debugging environment with tcl/py & extension compiled in debug mode.

    https://i.postimg.cc/NjXccdRC/performance-check-tcl-versa-python.png



    better link → with callgraph
    https://i.postimg.cc/TYbNKXrn/performance-check-tcl-versa-python.png

    even better resolution: https://i.postimg.cc/wvpJV4QC/performance-check-tcl-versa-python.png

    bad that I can not EDIT old data of this news message… the problem is that the "postimage" stuff
    changes the resolution of the image → bad

    I switch to the good old Facebook to post this screenshot and wait for your comment.

    https://www.facebook.com/share/p/wihmQPR4pBRacLLF/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aotto1968@21:1/5 to All on Thu Aug 15 23:48:37 2024
    a short conclusion from Facebook …

    "If you analyze the C lib wrapper for MqReadI8, the TCL code adds about 200% wrapper load and the PYTHON code adds about 10%
    wrapper load." (ref: https://www.facebook.com/share/p/wihmQPR4pBRacLLF/)

    → I think TCL has an "performance-problem".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aotto1968@21:1/5 to saito on Fri Aug 16 07:34:15 2024
    On 16.08.24 01:39, saito wrote:
    On 8/15/2024 3:09 PM, aotto1968 wrote:

    even better resolution: https://i.postimg.cc/wvpJV4QC/performance-check-tcl-versa-python.png

    Very nice screenshots. Is this some sort a debugger?

    Assuming that you wrote both tcl and python versions and that they both wrap the same core library, wouldn't the call trees look
    the same or at least bear resemblance?


    Yes, both the TCL and PYTHON extensions are wrappers for the same library and the TOOL for writing both wrappers is the NHI1/ALC
    (All-Language-Compiler), that is why both wrappers look similar.

    the memory debugger has two parts
    1) valgrind --tool=callgrind --quiet ... your sw → create callgrind.out.*
    2) kcachegrind callgrind.out.* → create the view

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Fri Aug 16 09:12:52 2024
    Am 15.08.24 um 23:48 schrieb aotto1968:
    a short conclusion from Facebook …

    "If you analyze the C lib wrapper for MqReadI8, the TCL code adds about
    200% wrapper load and the PYTHON code adds about 10% wrapper load."
    (ref: https://www.facebook.com/share/p/wihmQPR4pBRacLLF/)

    → I think TCL has an "performance-problem".

    I won't solve the problem, just to say: It's impossible to help you with
    this, because you don't explain:
    * who wrote this wrapper
    * where to find the code
    * what benchmark are you running

    It could be, e.g. that your benchmark code introduces shimmering and
    then there's lots of conversion going on. It might be something
    completely different. Or it might be that Tcl is indeed slower than
    Python (in most of my comparisons, it was the opposite - unless you
    offload work to external libraries).

    Regards,

    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From undroidwish@21:1/5 to Christian Gollwitzer on Fri Aug 16 10:41:32 2024
    On 8/16/24 09:12, Christian Gollwitzer wrote:
    Am 15.08.24 um 23:48 schrieb aotto1968:

    ...
    → I think TCL has an "performance-problem".

    I won't solve the problem, just to say: It's impossible to help you with this, because you don't explain:
    * who wrote this wrapper
    * where to find the code
    * what benchmark are you running
    ...

    +1

    PS: Philosophically, the perpetual perception of performance problems
    is inherent to human design (and possibly inextricable even).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aotto1968@21:1/5 to Christian Gollwitzer on Fri Aug 16 11:44:26 2024
    On 16.08.24 09:12, Christian Gollwitzer wrote:
    Am 15.08.24 um 23:48 schrieb aotto1968:
    a short conclusion from Facebook …

    "If you analyze the C lib wrapper for MqReadI8, the TCL code adds about 200% wrapper load and the PYTHON code adds about 10%
    wrapper load." (ref: https://www.facebook.com/share/p/wihmQPR4pBRacLLF/)

    → I think TCL has an "performance-problem".

    I won't solve the problem, just to say: It's impossible to help you with this, because you don't explain:
    * who wrote this wrapper
    * where to find the code
    * what benchmark are you running

    It could be, e.g. that your benchmark code introduces shimmering and then there's lots of conversion going on. It might be
    something completely different. Or it might be that Tcl is indeed slower than Python (in most of my comparisons, it was the
    opposite - unless you offload work to external libraries).

    Regards,

          Christian

    1) just the "stupid" Tcl_ObjectGetMetadata to retrieve the pointer associated with an oo-object cost 1/3 of the wrapper
    performance → the whole header of a tcl OO wrapper cost more than everything else in the wrapper.

    if you look into the code it is an hash-table lookup !!!
    in python it is a ZERO-time operation

    2) just to create an INT-object from an integer the TCL create always an object from scratch inclusive malloc etc
    python uses for small numbers (integer) a table of already pre-alloc objects as ZERO-time operation

    3) the set/reset-result have to free all the (stupid) objects that add additional 1/3 of the wrapper cost


    analysis.

    the Tcl_ObjectGetMetadata is clear an design-error
    the missing small-int-object pre-alloc is an programmer-lazy-error


    if someone can setup a screen sharing session than I can explain the problem in more detail
    ( need to test the screen-sharing first because because I not use to it )

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aotto1968@21:1/5 to All on Fri Aug 16 22:16:25 2024
    I spend some time on research and further optimization ... but ...

    One thing seems clear: "lang-Python" with AGGRESSIVE optimization is NOT far from "lang-C" speed.
    With aggressive optimization, Python creates a runtime optimization (--enable-optimizations) during compilation and that WITH
    threads which, unlike TCL, CANNOT be disabled in Python. Also, the runtime library is FIRMLY integrated into Python.

    TCL with aggressive optimization also uses the static runtime library BUT no threads.

    → for updates check the picture in the comment. https://www.facebook.com/share/p/WYmfnRWybY1Sh42f/

    summary for aggressive …

    .../perf-aggressive/inst/sbin/c/x86_64-suse-linux-gnu-perfclient --timeout 2 --send --sec 4 @
    .../perf-aggressive/inst/sbin/c/x86_64-suse-linux-gnu-perfserver
    :PerfClientExec }: start ------------------------ : result [ count / sec ]
    :statistics }: --send : 403779.6 [ 1615234 / 4.000286 ]
    :PerfClientExec }: end: ----------------------------------------

    .../perf-aggressive/inst/sbin/c/x86_64-suse-linux-gnu-perfclient --timeout 2 --send --sec 4 @ $PYTHON
    .../perf-aggressive/inst/sbin/py/x86_64-suse-linux-gnu-perfserver.py
    :PerfClientExec }: start ------------------------ : result [ count / sec ]
    :statistics }: --send : 311506.7 [ 1246216 / 4.000608 ]
    :PerfClientExec }: end: ----------------------------------------

    .../perf-aggressive/inst/sbin/c/x86_64-suse-linux-gnu-perfclient --timeout 2 --send --sec 4 @ $TCLSH
    .../perf-aggressive/inst/sbin/tcl/x86_64-suse-linux-gnu-perfserver.tcl
    :PerfClientExec }: start ------------------------ : result [ count / sec ]
    :statistics }: --send : 227151.4 [ 908663 / 4.000253 ]
    :PerfClientExec }: end: ----------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aotto1968@21:1/5 to saito on Fri Aug 16 22:19:27 2024
    On 16.08.24 20:29, saito wrote:
    On 8/16/2024 1:34 AM, aotto1968 wrote:
    On 16.08.24 01:39, saito wrote:
    On 8/15/2024 3:09 PM, aotto1968 wrote:

    even better resolution: https://i.postimg.cc/wvpJV4QC/performance-check-tcl-versa-python.png

    Very nice screenshots. Is this some sort a debugger?

    Assuming that you wrote both tcl and python versions and that they both wrap the same core library, wouldn't the call trees
    look the same or at least bear resemblance?


    Yes, both the TCL and PYTHON extensions are wrappers for the same library and the TOOL for writing both wrappers is the
    NHI1/ALC (All-Language-Compiler), that is why both wrappers look similar.


    What I meant was that the two images look very different. I can't make out what the boxes say, but nevertheless one is wide and
    shallow, the other narrow and deep. So this may not be an apples-to-apples comparison. As has been noted already, shimmering may
    play a role here or extra levels of abstraction via extra proc calls may skew the results in one language vs. the other.



    these two pictures are generated by the tool and not by me…
    the TCL picture is so "wide" because the TCL uses a lot of "overhead"
    the python picture is so "narrow" because PYTHON uses much less "overhead".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From undroidwish@21:1/5 to All on Sat Aug 17 06:27:22 2024
    On 8/16/24 22:19, aotto1968 wrote:

    ...
    these two pictures are generated by the tool and not by me…
    the TCL picture is so "wide" because the TCL uses a lot of "overhead"
    the python picture is so "narrow" because PYTHON uses much less
    "overhead".

    Hmm, so here we are:

    a) you complain about Tcl's bad performance
    b) you seem to be unwilling to disclose enough information about the
    Python and Tcl implementations in order to get the big picture and
    see the cause of differences and to try to discuss improvements
    with you
    c) due to b) you continue to complain about Tcl's bad performance

    Not quite a fruitful cycle.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aotto1968@21:1/5 to undroidwish on Sat Aug 17 07:28:59 2024
    On 17.08.24 06:27, undroidwish wrote:
    On 8/16/24 22:19, aotto1968 wrote:

    ...
    these two pictures are generated by the tool and not by me…
    the TCL picture is so "wide" because the TCL uses a lot of "overhead"
    the python picture is so "narrow" because PYTHON uses much less "overhead".

    Hmm, so here we are:

    a) you complain about Tcl's bad performance
    b) you seem to be unwilling to disclose enough information about the
       Python and Tcl implementations in order to get the big picture and
       see the cause of differences and to try to discuss improvements
       with you
    c) due to b) you continue to complain about Tcl's bad performance

    Not quite a fruitful cycle.


    I'm "not" complain about TCL bad performance I just mention that PYTHON has done much more work on performance than TCL.

    If you have 300.000 transaction per second (PYTHON) or 200.000 transaction
    per second (TCL) is just an case for someone who need this difference.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From undroidwish@21:1/5 to All on Sat Aug 17 09:59:41 2024
    On 8/17/24 07:28, aotto1968 wrote:

    ...
    I'm "not" complain about TCL bad performance I just mention that PYTHON has done much more work on performance than TCL.

    Fine, then please elaborate on this claim. What exactly did Python
    better and more in terms of performance? Any pointers welcome.

    If you have 300.000 transaction per second (PYTHON) or 200.000 transaction per second (TCL) is just an case for someone who need this difference.

    Indeed could this be a reason to ask if there are better ways of using
    the Tcl framework in order to get the Tcl implementation be on par with
    the Python one. As stated many times before, to discuss this on c.l.t.
    will require that you provide more implementation details.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)