• Parallel Forth on a 44 core machine

    From mhx@21:1/5 to All on Sat Aug 17 15:34:31 2024
    My refurbished HP Z840 (from 2016) is finally running iForth.

    Initially, the HP had a problem with its fan control and was
    unbearably loud. I fixed that by replacing a failed transistor
    that was being used as a temperature sensor. It's incredible how
    large, knowledgeable, and helpful the HP community is, and how
    well engineered and documented the Z series workstation are.

    The Z840 is prepared for Linux and Windows 10. Because it came
    with Windows pre-installed, I tried that first.

    Installing iForth was the easy part, some of the other tools
    (WSL2, Octave, MATLAB, VS) took quite a bit longer.

    Although the Z840 is equipped with modern 1TB Samsung SSDs, these
    are connected to the SATA interface and run at a maximum speed of
    only 500MB/sec (instead of the 12GB/s we are now used to). I was
    afraid that would become a bottleneck, but for now it will do.

    Below the results for the first experiments with iSPICE (a SPICE
    compatible circuit simulator that is written in iForth and supports
    explicit parallel processing). I gave it a circuit of an SMPS with
    44 component variations. Depending on the number of allotted cores,
    iSPICE distributes the 44 jobs over the available processors and
    stores the results in text and graphical formats. As can be seen
    below, with 44 processors the tasks finish 22x faster than
    with a single core. The CPU temperatures stay below 62 deg C.

    The maximum RAM use is 64GB (this machine has 128GB).

    Disk I/O is clearly a problem to be worked on, for now I
    fake it by spacing the benchmark runs 30 seconds apart.

    Scaling with the number of processors appears to be linear
    and quite a bit better than it is on my AMD Ryzen 5800X,
    although well below the theoretical factor of 44x.

    iSPICE> .TICKER-INFO
    Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
    TICKS-GET uses os time & PROCESSOR-CLOCK 3000MHz
    Do: < n TO PROCESSOR-CLOCK RECALIBRATE >
    ok
    iSPICE> BENCHTEST
    Starting 1 process to run 44 jobs.
    Master task (0) ready, waiting for the workers, performing FIX-UP ...
    Job `2input-boost/2input-boost.cir` finished in 52.627 seconds.
    waiting 30 seconds for flush to disk . . .

    Starting 11 processes to run 44 jobs.
    Master task (0) ready, waiting for the workers, performing FIX-UP ...
    Job `2input-boost/2input-boost.cir` finished in 6.130 seconds.
    waiting 30 seconds for flush to disk . . .

    Starting 22 processes to run 44 jobs.
    Master task (0) ready, waiting for the workers, performing FIX-UP ...
    Job `2input-boost/2input-boost.cir` finished in 3.255 seconds.
    waiting 30 seconds for flush to disk . . .

    Starting 44 processes to run 44 jobs.
    Master task (0) ready, waiting for the workers, performing FIX-UP ...
    Job `2input-boost/2input-boost.cir` finished in 2.30 seconds.
    waiting 30 seconds for flush to disk . . .

    % cpus time [s] performance ratio
    1 52.921 1
    11 6.521 8.115473
    22 3.668 14.427753
    44 2.431 21.76923 ok

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Sun Aug 18 09:28:09 2024
    Impressive! A PCIe NVMe drive will be a boost, but don't expect
    too much, when you already have so much RAM. And electric power. ;-)

    My experiments with parallel threads were a bit sobering. You
    really need rather isolated subprocesses that require little
    synchronisation. Otherwise the slowest process plus additional
    syncing costs can eat up all the expected benefits. Nothing new.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to minforth on Sun Aug 18 11:31:37 2024
    On Sun, 18 Aug 2024 9:28:09 +0000, minforth wrote:

    Impressive! A PCIe NVMe drive will be a boost, but don't expect
    too much, when you already have so much RAM. And electric power. ;-)

    I tried a RAM drive (from AMD), but it has a throughput of only 50MB/s,
    10x slower than the SATA 6GBs connected Samsung SSD (500MB/s). I am a
    bit puzzled why that is so devastatingly slow.

    My experiments with parallel threads were a bit sobering. You
    really need rather isolated subprocesses that require little
    synchronisation.

    Yes, that is Amdahl's law. We constantly struggled with that
    for tForth. Fine-grained parallelism never gave us good results.

    Otherwise the slowest process plus additional
    syncing costs can eat up all the expected benefits. Nothing new.

    A new (to me) thing was that processes slow down enormously from
    accessing shared global variables (depending on their physical
    location), even when no locks are needed/used. For iSPICE such
    variables are in OS managed shared memory (aka the swap file)
    and are used very infrequently.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to mhx on Sun Aug 18 14:47:39 2024
    In article <2df471d1ec39c22949169f8a612b780d@www.novabbs.com>,
    mhx <mhx@iae.nl> wrote:
    On Sun, 18 Aug 2024 9:28:09 +0000, minforth wrote:

    Impressive! A PCIe NVMe drive will be a boost, but don't expect
    too much, when you already have so much RAM. And electric power. ;-)

    I tried a RAM drive (from AMD), but it has a throughput of only 50MB/s,
    10x slower than the SATA 6GBs connected Samsung SSD (500MB/s). I am a
    bit puzzled why that is so devastatingly slow.

    My experiments with parallel threads were a bit sobering. You
    really need rather isolated subprocesses that require little
    synchronisation.

    Yes, that is Amdahl's law. We constantly struggled with that
    for tForth. Fine-grained parallelism never gave us good results.

    Otherwise the slowest process plus additional
    syncing costs can eat up all the expected benefits. Nothing new.

    A new (to me) thing was that processes slow down enormously from
    accessing shared global variables (depending on their physical
    location), even when no locks are needed/used. For iSPICE such
    variables are in OS managed shared memory (aka the swap file)
    and are used very infrequently.

    That agrees with my experience. Parallel processes work with the
    same image. The protocol is that one process write to a shared variable,
    the other reads. The last process signals the chain that it is
    ready. All processes are busy waiting on the signal to stop and to
    pass it down the chain.

    That was on linux with AMD.
    Was your experience MS with Intel?


    -marcel

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mhx on Sun Aug 18 13:42:33 2024
    mhx@iae.nl (mhx) writes:
    What I meant is severe slowdown when reading variables that are
    physically *close* to variables that belong to another process.

    That is known as false sharing. The cache coherence protocols work at
    the granularity of a a cache line (usually 64 bytes). If core A
    writes to a variable, and core B, say, reads one in the same cache
    line, the cache coherence protocol first makes that cache line
    modified by core A (and every other core has to invalidate that cache
    line), and then core B has to wait until core A sends out the data to
    the other cores.

    It happens for both AMD and Intel on both Windows and Linux.
    Spacing such variables farther apart has dramatic impact but
    is quite inconvenient in most cases.

    Yes, but if you want performance, you have to rearrange your data to
    avoid false sharing.

    I don't recall that transputers had these problems.

    Transputers have no shared memory and therefore no cache coherence
    protocols.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2024: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to mhx on Sun Aug 18 14:01:04 2024
    On Sun, 18 Aug 2024 13:33:27 +0000, mhx wrote:
    What I meant is severe slowdown when reading variables that are
    physically *close* to variables that belong to another process.
    It happens for both AMD and Intel on both Windows and Linux.
    Spacing such variables farther apart has dramatic impact but
    is quite inconvenient in most cases.

    IIRC I once read a recommendation to group shared variables in
    (larger) structs. With structs you have control over their memory
    spacing and improve cache behaviour.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to albert@spenarnc.xs4all.nl on Sun Aug 18 13:33:27 2024
    On Sun, 18 Aug 2024 12:47:39 +0000, albert@spenarnc.xs4all.nl wrote:

    In article <2df471d1ec39c22949169f8a612b780d@www.novabbs.com>,
    mhx <mhx@iae.nl> wrote:
    [..]
    A new (to me) thing was that processes slow down enormously from
    accessing shared global variables (depending on their physical
    location), even when no locks are needed/used. For iSPICE such
    variables are in OS managed shared memory (aka the swap file)
    and are used very infrequently.

    That agrees with my experience. Parallel processes work with the
    same image. The protocol is that one process write to a shared variable,
    the other reads. The last process signals the chain that it is
    ready. All processes are busy waiting on the signal to stop and to
    pass it down the chain.

    That was on linux with AMD.
    Was your experience MS with Intel?

    What you seem to describe is that processes interfere when wanting
    access to the same (multi-byte) variable. It is obviously tricky to
    read a value byte-by-byte when somebody else is updating it
    byte-by-byte.
    What I meant is severe slowdown when reading variables that are
    physically *close* to variables that belong to another process.
    It happens for both AMD and Intel on both Windows and Linux.
    Spacing such variables farther apart has dramatic impact but
    is quite inconvenient in most cases.

    I don't recall that transputers had these problems. It may have
    to do with the physical memory read/write hardware.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to Anton Ertl on Sun Aug 18 14:32:16 2024
    On Sun, 18 Aug 2024 13:42:33 +0000, Anton Ertl wrote:

    mhx@iae.nl (mhx) writes:
    What I meant is severe slowdown when reading variables that are
    physically *close* to variables that belong to another process.

    Yes, but if you want performance, you have to rearrange your data to
    avoid false sharing.

    Do you know if shared memory as provided by the OS (or Windows)
    has these problems too?

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mhx on Sun Aug 18 15:14:52 2024
    mhx@iae.nl (mhx) writes:
    Do you know if shared memory as provided by the OS (or Windows)
    has these problems too?

    Shared memory has false sharing problems, however that sharing is
    arranged. The slowdown comes from the hardware. See <https://en.wikipedia.org/wiki/False_sharing>.

    The I-cache/D-cache ping-pong when you have writable data close to
    executed code on AMD64 is also false sharing, this time within one
    core.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2024: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to All on Sun Aug 18 18:37:02 2024
    Here are the results on a bit more modern AMD CPU with
    32GB memory and a 7GB SSD.

    The scaling is near perfect and much better than expected
    (based on experiments a few months ago). The 10% decrease
    (7 i.s.o. 8x) for 8 cores might be because 32GB is just a bit
    too tight.

    The waiting for files to flush is not really necessary now,
    but I used 5s for good measure.

    This is the same circuit as used with the HPZ840 but with
    8 instead of 44 jobs and, to compensate, for a 5x longer
    time period.

    iSPICE> .TICKER-INFO
    AMD Ryzen 7 5800X 8-Core Processor
    TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
    Do: < n TO PROCESSOR-CLOCK RECALIBRATE >
    ok

    Starting 1 process to run 8 jobs.
    Master task (0) ready, waiting for the workers, performing FIX-UP ...
    Job `2input-boost/2input-boost.cir` finished in 40.169 seconds.
    waiting 5 seconds for flush to disk . . .

    Starting 2 processes to run 8 jobs.
    Master task (0) ready, waiting for the workers, performing FIX-UP ...
    Job `2input-boost/2input-boost.cir` finished in 20.157 seconds.
    waiting 5 seconds for flush to disk . . .

    Starting 4 processes to run 8 jobs.
    Master task (0) ready, waiting for the workers, performing FIX-UP ...
    Job `2input-boost/2input-boost.cir` finished in 10.206 seconds.
    waiting 5 seconds for flush to disk . . .

    Starting 8 processes to run 8 jobs.
    Master task (0) ready, waiting for the workers, performing FIX-UP ...
    Job `2input-boost/2input-boost.cir` finished in 5.675 seconds.
    waiting 5 seconds for flush to disk . . .

    % cpus time [s] performance ratio
    1 40.240 1
    2 20.232 1.988928
    4 10.283 3.913254
    8 5.750 6.99826 ok

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to minforth on Wed Aug 28 09:29:37 2024
    On Sun, 18 Aug 2024 9:28:09 +0000, minforth wrote:

    Impressive! A PCIe NVMe drive will be a boost, but don't expect
    too much, when you already have so much RAM. And electric power. ;-)

    I didn't catch your drift there until I found out why there are no
    really fast RAM drives. The fastest drive is no drive at all, and
    that is possible by writing the simulation data to a temp file.
    Windows has a special attribute for that ( _O_SHORT_LIVED ) and
    Linux has shm.

    This means that there is no need for iSPICE to require a fast disk
    as long as there is enough free memory. And nice: I didn't have to
    change the code much, only the file attributes for one CREATE-FILE.

    With this change there is a slight possibility of losing data
    when the OS crashes before the RAM runs out, or before there's
    a reboot.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to mhx on Wed Aug 28 09:35:42 2024
    mhx@iae.nl (mhx) writes:
    I didn't catch your drift there until I found out why there are no
    really fast RAM drives. The fastest drive is no drive at all, and
    that is possible by writing the simulation data to a temp file.
    Windows has a special attribute for that ( _O_SHORT_LIVED ) and
    Linux has shm.

    On Linux you can make a ramdisk (use some of your system ram as a file
    system) with tmpfs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)