• Re: Memory ordering

    From Tim Rentsch@21:1/5 to Anton Ertl on Sat Nov 16 17:28:21 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    jseigh <jseigh_es00@xemaps.com> writes:

    Anybody doing that sort of programming, i.e. lock-free or distributed
    algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    It isn't a claim, just an opinion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Sun Nov 17 09:03:06 2024
    On 11/16/24 16:21, Chris M. Thomasson wrote:

    Fwiw, in C++ std::memory_order_consume is useful for traversing a node
    based stack of something in RCU. In most systems it only acts like a
    compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
    mb for alpha? Cannot remember that one right now.

    That got deprecated. Too hard for compilers to deal with. It's now
    same as memory_order_acquire.

    Which brings up an interesting point. Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect. Or maybe disable reordering or optimization altogether
    for those target architectures.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to jseigh on Sun Nov 17 15:17:52 2024
    jseigh <jseigh_es00@xemaps.com> writes:
    Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect.

    That's something between the user of a programming language and the
    compiler. If you use a programming language or compiler that gives
    weaker memory ordering guarantees than the architecture it compiles
    to, that's your choice. Nothing forces compilers to behave that way,
    and it's actually easier to write compilers that do not do such
    reordering.

    Or maybe disable reordering or optimization altogether
    for those target architectures.

    So you want to throw out the baby with the bathwater.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Sun Nov 17 15:15:08 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 11:37 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 9:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    For instance, a 100% sequential memory order won't help you with, say,
    solving ABA.

    Sure, not all problems are solved by sequential consistency, and yes,
    it won't solve race conditions like the ABA problem. But jseigh
    implied that finding it easier to write correct and efficient code for
    sequential consistency than for a weakly-consistent memory model
    (e.g., Alphas memory model) is incompetent.

    What if you had to write code for a weakly ordered system, and the >performance guidelines said to only use a membar when you absolutely
    have to. If you say something akin to "I do everything using >std::memory_order_seq_cst", well, that is a violation right off the bat.

    Fair enough?

    Are you trying to support my point?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Mon Nov 18 07:11:04 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    What if you had to write code for a weakly ordered system, and the
    performance guidelines said to only use a membar when you absolutely
    have to. If you say something akin to "I do everything using
    std::memory_order_seq_cst", well, that is a violation right off the bat. ...
    I am trying to say you might not be hired if you only knew how to handle >std::memory_order_seq_cst wrt C++... ?

    I am not looking to be hired.

    In any case, this cuts both ways: If you are an employer working on multi-threaded software, say, for Windows or Linux, will you reduce
    your pool of potential hires by including a requirement like the one
    above? And then pay for longer development time and additional
    hard-to-find bugs coming from overshooting the requirement you stated
    above. Or do you limit your software support to TSO hardware (for
    lack of widely available SC hardware), and gain all the benefits of
    more potential hires, reduced development time, and fewer bugs?

    I have compared arguments against strong memory ordering with those
    against floating-point. Von Neumann argued for fixed point as follows <https://booksite.elsevier.com/9780124077263/downloads/historial%20perspectives/section_3.11.pdf>:

    |[...] human time is consumed in arranging for the introduction of
    |suitable scale factors. We only argue that the time consumed is a
    |very small percentage of the total time we will spend in preparing an |interesting problem for our machine. The first advantage of the
    |floating point is, we feel, somewhat illusory. In order to have such
    |a floating point, one must waste memory capacity which could
    |otherwise be used for carrying more digits per word.

    Kahan writes <https://people.eecs.berkeley.edu/~wkahan/SIAMjvnl.pdf>:

    |Papers in 1947/8 by Bargman, Goldstein, Montgomery and von Neumann
    |seemed to imply that 40-bit arithmetic would hardly ever deliver
    |usable accuracy for the solution of so few as 100 linear equations in
    |100 unknowns; but by 1954 engineers were solving bigger systems
    |routinely and getting satisfactory accuracy from arithmetics with no
    |more than 40 bits.

    The flaw in the reasoning of the paper was:

    |To solve it more easily without floating–point von Neumann had
    |transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
    |doubling the number of sig. bits lost to ill-condition

    This is an example of how the supposed gains that the harder-to-use
    interface provides (in this case the bits "wasted" on the exponent)
    are overcompensated by then having to use a software workaround for
    the harder-to-use interface.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Anton Ertl on Mon Nov 18 12:03:55 2024
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification?

    I don't know. Given the contortions that the Linux kernel people had
    to go through, maybe it really was present in hardware.

    As a programming language implementer, I don't much think about "Will
    the hardware really do this?" because new hardware arises all the
    time, and I don't want users' programs to stop working.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to jseigh on Mon Nov 18 11:56:48 2024
    jseigh <jseigh_es00@xemaps.com> wrote:
    On 11/16/24 16:21, Chris M. Thomasson wrote:

    Fwiw, in C++ std::memory_order_consume is useful for traversing a node
    based stack of something in RCU. In most systems it only acts like a
    compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
    mb for alpha? Cannot remember that one right now.

    That got deprecated. Too hard for compilers to deal with. It's now
    same as memory_order_acquire.

    It's back in C++20. I think the problem wasn't so much implementing
    it, which as you say can be trivially done by aliasing with acquire,
    but specifying it. We use load dependency ordering in Java on AArch64
    to satisfy some memory model requirements, so it's not as if it's
    difficult to use.

    Which brings up an interesting point. Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect.

    Yes, exactly. It's not as if this is an issue that affects people who
    program in high level languages, it's about what language implementers
    choose to do.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Tue Nov 26 01:29:19 2024
    On Mon, 25 Nov 2024 23:59:02 +0000, Chris M. Thomasson wrote:

    On 11/18/2024 3:34 PM, Chris M. Thomasson wrote:


    Don't tell me you want all of std::memory_order_* to default to
    std::memory_order_seq_cst? If your on a system that only has seq_cst and
    nothing else, okay, but not on other weaker (memory order) systems,
    right?

    defaulting a relaxed to a seq_cst is a bit much.... ;^o

    Defaulting to Strongly_Ordered is even worse.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Chris M. Thomasson on Fri Nov 15 15:24:59 2024
    On Fri, 15 Nov 2024 03:17:22 -0800
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:

    On 11/14/2024 11:25 PM, Anton Ertl wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack
    of guarantees in memory ordering is a bad idea, and so is ARM's:
    "It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously? Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the
    operations of each individual processor appear in this sequence in
    the order specified by its program."
    [...]


    Well, iirc, the Alpha is the only system that requires an explicit
    membar for a RCU based algorithm. Even SPARC in RMO mode does not
    need this. Iirc, akin to memory_order_consume in C++:

    https://en.cppreference.com/w/cpp/atomic/memory_order

    data dependent loads


    You response does not answer Anton's question.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Anton Ertl on Fri Nov 15 11:08:29 2024
    On 11/15/2024 2:25 AM, Anton Ertl wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
    only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
    Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the operations
    of each individual processor appear in this sequence in the order
    specified by its program."

    However, I don't think that the Alpha architects considered the Alpha
    memory ordering to be an error, and probably still don't, just like
    the ARM architects don't consider their memory model to be an error.
    I am pretty sure that no Alpha implementation ever made use of the
    lack of causality in the Alpha memory model, so they could have added causality without outlawing existing implementations. That they did
    not indicates that they thought that their memory model was right. An advocacy paper for weak memory models [adve&gharachorloo95] came from
    the same place as Alpha, so it's no surprise that Alpha specifies weak consistency.

    @TechReport{adve&gharachorloo95,
    author = {Sarita V. Adve and Kourosh Gharachorloo},
    title = {Shared Memory Consistency Models: A Tutorial},
    institution = {Digital Western Research Lab},
    year = {1995},
    type = {WRL Research Report},
    number = {95/7},
    annote = {Gives an overview of architectural features of
    shared-memory computers such as independent memory
    banks and per-CPU caches, and how they make the (for
    programmers) most natural consistency model hard to
    implement, giving examples of programs that can fail
    with weaker consistency models. It then discusses
    several categories of weaker consistency models and
    actual consistency models in these categories, and
    which ``safety net'' (e.g., memory barrier
    instructions) programmers need to use to work around
    the deficiencies of these models. While the authors
    recognize that programmers find it difficult to use
    these safety nets correctly and efficiently, it
    still advocates weaker consistency models, claiming
    that sequential consistency is too inefficient, by
    outlining an inefficient implementation (which is of
    course no proof that no efficient implementation
    exists). Still the paper is a good introduction to
    the issues involved.}
    }

    - anton

    Anybody doing that sort of programming, i.e. lock-free or distributed algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place. Strongly
    consistent memory won't help incompetence.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to jseigh on Fri Nov 15 17:27:37 2024
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed >algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Nov 16 00:51:36 2024
    On Fri, 15 Nov 2024 23:35:22 +0000, BGB wrote:

    On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:
    On 11/15/2024 12:53 PM, BGB wrote:
    On 11/15/2024 11:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?


    In my case, as I see it:
       The tradeoff is more about implementation cost, performance, etc.

    Weak model:
       Cheaper (and simpler) to implement;
       Performs better when there is no need to synchronize memory;
       Performs worse when there is need to synchronize memory;
       ...
    [...]

    A TSO from a weak memory model is as it is. It should not necessarily
    perform "worse" than other systems that have TSO as a default. The
    weaker models give us flexibility. Any weak memory model should be able
    to give sequential consistency via using the right membars in the right
    places.


    The speed difference is mostly that, in a weak model, the L1 cache
    merely needs to fetch memory from the L2 or similar, may write to it whenever, and need not proactively store back results.

    As I understand it, a typical TSO like model will require, say:
    Any L1 cache that wants to write to a cache line, needs to explicitly
    request write ownership over that cache line;

    The cache line may have been fetched from a core which modified the
    data, and handed this line directly to this requesting core on a
    typical read. So, it is possible for the line to show up with
    write permission even if the requesting core did not ask for write
    permission. So, not all lines being written have to request owner-
    ship.

    Any attempt by other cores to access this line,

    You are being rather loose with your time analysis in this question::

    Access this line before write permission has been requested,
    or
    Access this line after write permission has been requested but
    before it has arrived,
    or
    Access this line after write permission has arrived.

    may require the L2 cache
    to send a message to the core currently holding the cache line for
    writing to write back its contents, with the request unable to be
    handled until after the second core has written back the dirty cache
    line.

    L2 has to know something about how L1 has the line, and likely which
    core cache the data is in.

    This would create potential for significantly more latency in cases
    where multiple cores touch the same part of memory; albeit the cores
    will see each others' memory stores.

    One can ARGUE that this is a good thing as it makes latency part
    of the memory access model. More interfering accesses=higher
    latency.


    So, initially, weak model can be faster due to not needing any
    additional handling.


    But... Any synchronization points, such as a barrier or locking or
    releasing a mutex, will require manually flushing the cache with a weak model.

    Not necessarily:: My 66000 uses causal memory consistency, yet when
    an ATOMIC event begins it reverts to sequential consistency until
    the end of the event where it reverts back to causal. Use of MMI/O
    space reverts to sequential consistency, while access to config
    space reverts all the way back to strongly ordered.

    And, locking/releasing the mutex itself will require a mechanism
    that is consistent between cores (such as volatile atomic swaps or
    similar, which may still be weak as a volatile-atomic-swap would still
    not be atomic from the POV of the L2 cache; and an MMIO interface could
    be stronger here).


    Seems like there could possibly be some way to skip some of the cache flushing if one could verify that a mutex is only being locked and
    unlocked on a single core.

    Issue then is how to deal with trying to lock a mutex which has thus far
    been exclusive to a single core. One would need some way for the core
    that last held the mutex to know that it needs to perform an L1 cache
    flush.

    This seems to be a job for Cache Consistency.

    Though, one possibility could be to leave this part to the OS scheduler/syscall/...

    The OS wants nothing to do with this.

    mechanism; so the core that wants to lock the
    mutex signals its intention to do so via the OS, and the next time the
    core that last held the mutex does a syscall (or tries to lock the mutex again), the handler sees this, then performs the L1 flush and flags the
    mutex as multi-core safe (at which point, the parties will flush L1s at
    each mutex lock, though possibly with a timeout count so that, if the
    mutex has been single-core for N locks, it reverts to single-core
    behavior).

    This could reduce the overhead of "frivolous mutex locking" in programs
    that are otherwise single-threaded or single processor (leaving the
    cache flushes for the ones that are in-fact being used for
    synchronization purposes).

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Sat Nov 16 07:37:44 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 9:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed
    algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    For instance, a 100% sequential memory order won't help you with, say, >solving ABA.

    Sure, not all problems are solved by sequential consistency, and yes,
    it won't solve race conditions like the ABA problem. But jseigh
    implied that finding it easier to write correct and efficient code for sequential consistency than for a weakly-consistent memory model
    (e.g., Alphas memory model) is incompetent.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Sat Nov 16 07:46:17 2024
    BGB <cr88192@gmail.com> writes:
    The tradeoff is more about implementation cost, performance, etc.

    Yes. And the "etc." includes "ease of programming".

    Weak model:
    Cheaper (and simpler) to implement;

    Yes.

    Performs better when there is no need to synchronize memory;

    Not in general. For a cheap multiprocessor implementation, yes. A sophisticated implementation of sequential consistency can just storm
    ahead in that case and achieve the same performance. It just has to
    keep checkpoints around in case that there is a need to synchronize
    memory.

    Performs worse when there is need to synchronize memory;

    With a cheap multiprocessor implementation, yes. In general, no: Any sequentially consistent implementation is also an implementation of
    every weaker memory model, and the memory barriers become nops in that
    kind of implementation. Ok, nops still have a cost, but it's very
    close to 0 on a modern CPU.

    Another potential performance disadvantage of sequential consistency
    even with a sophisticated implementation:

    If you have some algorithm that actually works correctly even when it
    gets stale data from a load (with some limits on the staleness), the sophisticated SC implementation will incur the latency coming from
    making the load non-stale while that latency will not occur or be less
    in a similarly-sophisticated implementation of an appropriate weak
    consistency model.

    However, given that the access to actually-shared memory is slow even
    on weakly-consistent hardware, software usually takes measures to
    avoid having a lot of such accesses, so that cost will usually be
    miniscule.


    What you missed: the big cost of weak memory models and cheap hardware implementations of them is in the software:

    * For correctness, the safe way is to insert a memory barrier between
    any two memory operations.

    * For performance (on cheap implementations of weak memory models) you
    want to execute as few memory barriers as possible.

    * You cannot use testing to find out whether you have enough (and the
    right) memory barriers. That's not only because the involved
    threads may not be in the right state during testing for uncovering
    the incorrectness, but also because the hardware used for testing
    may actually have stronger consistency than the memory model, and so
    some kinds of bugs will never show up in testing on that hardware,
    even when the threads reach the right state. And testing is still
    the go-to solution for software people to find errors (nowadays even
    glorified by continuous integration and modern fuzz testing
    approaches).

    The result is that a lot of software dealing with shared memory is
    incorrect because it does not have a memory barrier that it should
    have, or inefficient on cheap hardware with expensive memory barriers
    because it uses more memory barriers than necessary for the memory
    model. A program may even be incorrect in one place and have
    superflouous memory barriers in another one.

    Or programmers just don't do this stuff at all (as advocated by
    jseigh), and instead just write sequential programs, or use bottled
    solutions that often are a lot more expensive than superfluous memory
    barriers. E.g., in Gforth the primary inter-thread communication
    mechanism is currently implemented with pipes, involving the system
    calls read() and write(). And Bernd Paysan who implemented that is a
    really good programmer; I am sure he would be able to wrap his head
    around the whole memory model stuff and implement something much more efficient, but that would take time that he obviously prefers to spend
    on more productive things.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Tue Dec 3 08:32:52 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/18/2024 3:20 PM, Chris M. Thomasson wrote:
    On 11/17/2024 11:11 PM, Anton Ertl wrote:
    The flaw in the reasoning of the paper was:

    |To solve it more easily without floating–point von Neumann had
    |transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
    |doubling the number of sig. bits lost to ill-condition

    This is an example of how the supposed gains that the harder-to-use
    interface provides (in this case the bits "wasted" on the exponent)
    are overcompensated by then having to use a software workaround for
    the harder-to-use interface.
    ...
    Don't tell me you want all of std::memory_order_* to default to >std::memory_order_seq_cst? If your on a system that only has seq_cst and >nothing else, okay, but not on other weaker (memory order) systems, right?

    I tell anyone who wants to read it to stop buying hardware without FP
    for non-integer work, and with weak memory ordering for work that
    needs concurrent programming. There are enough affordable offerings
    with FP and TSO that we do not need to waste programming time and
    increase the frequency of hard-to-find bugs by figuring out how to get
    good performance out of hardware without FP hardware and with weak
    memory ordering.

    Those who enjoy the challenge of dealing with the unnecessary problems
    of sub-par hardware can continue to enjoy that.

    But when developing production software, as a manager don't let
    programmers with this hobby horse influence your hardware and
    development decisions. Give full support for FP and TSO hardware, and
    limited support to weakly-ordered hardware. That limited support may
    consist of using software implementations of FP (instead of designing
    software for fixed point arithmetic). In case of hardware with weak
    ordering the limited support could be to use memory barriers liberally
    (without trying to minimize them at all; every memory barrier
    elimination costs development time and increases the potential for
    hard-to-find bugs), of using OS mechanisms for concurrency (rather
    than, e.g., lock-free algorithms), or maybe even only supporting single-threaded operation.

    Efficiently-implemented sequentially-consistent hardware would be even
    more preferable, and if it was widely available, I would recommend
    buying that over TSO hardware, but unfortunately we are not there yet.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Tue Dec 3 09:01:44 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/17/2024 7:17 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Or maybe disable reordering or optimization altogether
    for those target architectures.

    So you want to throw out the baby with the bathwater.

    No, keep the weak order systems and not throw them out wrt a system that
    is 100% seq_cst? Perhaps? What am I missing here?

    Disabling optimization altogether costs a lot; e.g., look at <http://www.complang.tuwien.ac.at/anton/bentley.pdf>: if you compare
    the lines for clang-3.5 -O0 with clang-3.5 -O3, you see a factor >2.5
    for the tsp9 program. For gcc-5.2.0 the difference is even bigger.

    That's why jseigh and people like him (I have read that suggestion
    several times before) love to suggest disabling optimization
    altogether. It's a straw man that does not even need beating up. Of
    course they usually don't show results for the supposed benefits of
    the particular "optimization" they advocate (or the drawbacks of
    disabling it), and jseigh follows this pattern nicely.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Dec 3 11:36:37 2024
    On Tue, 03 Dec 2024 08:32:52 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/18/2024 3:20 PM, Chris M. Thomasson wrote:
    On 11/17/2024 11:11 PM, Anton Ertl wrote:
    The flaw in the reasoning of the paper was:

    |To solve it more easily without floating–point von Neumann had
    |transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
    |doubling the number of sig. bits lost to ill-condition

    This is an example of how the supposed gains that the
    harder-to-use interface provides (in this case the bits "wasted"
    on the exponent) are overcompensated by then having to use a
    software workaround for the harder-to-use interface.
    ...
    Don't tell me you want all of std::memory_order_* to default to >std::memory_order_seq_cst? If your on a system that only has seq_cst
    and nothing else, okay, but not on other weaker (memory order)
    systems, right?

    I tell anyone who wants to read it to stop buying hardware without FP
    for non-integer work, and with weak memory ordering for work that
    needs concurrent programming. There are enough affordable offerings
    with FP and TSO that we do not need to waste programming time and
    increase the frequency of hard-to-find bugs by figuring out how to get
    good performance out of hardware without FP hardware and with weak
    memory ordering.

    Those who enjoy the challenge of dealing with the unnecessary problems
    of sub-par hardware can continue to enjoy that.

    But when developing production software, as a manager don't let
    programmers with this hobby horse influence your hardware and
    development decisions. Give full support for FP and TSO hardware, and limited support to weakly-ordered hardware. That limited support may
    consist of using software implementations of FP (instead of designing software for fixed point arithmetic). In case of hardware with weak
    ordering the limited support could be to use memory barriers liberally (without trying to minimize them at all; every memory barrier
    elimination costs development time and increases the potential for hard-to-find bugs), of using OS mechanisms for concurrency (rather
    than, e.g., lock-free algorithms), or maybe even only supporting single-threaded operation.

    Efficiently-implemented sequentially-consistent hardware would be even
    more preferable, and if it was widely available, I would recommend
    buying that over TSO hardware, but unfortunately we are not there yet.

    - anton

    If you want capable dual-core or quad-core processor integrated with
    FPGA then Arm Cortex-A is the only game in town right now, and probably
    for a few years going forward. Typically, old low end cores. FPU is
    there, TSO is not.
    Fortunately, in majority of applications of this chips there is no need
    for concurrent programming, but one is rarely 100% sure that need
    wouldn't emerge when he starts a project.

    BTW, does your stance means that your are strongly against A64FX ?

    My own stance is that people should not do lockless concurrent
    programming. Period.
    Well, almost period. Something like RCU in Linux kernel is an exception.
    May be, atomic updates of statistical counters is another exception,
    but only when one is sure that his application will never have to scale
    above 2 dozens of cores.

    Lockless programming is horrendously complicated and error prone.
    Sequential consistency removes only small part of potential
    complications.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Tue Dec 3 10:03:29 2024
    Michael S <already5chosen@yahoo.com> writes:
    BTW, does your stance means that your are strongly against A64FX ?

    No. According to <https://lwn.net/Articles/970907/> A64FX is one of
    the few ARM A64 implementations that provides TSO.

    Lockless programming is horrendously complicated and error prone.
    Sequential consistency removes only small part of potential
    complications.

    It's only a part, true, but I am not sure that the part is small.

    Interestingly, the FP analogy persists here: having FP hardware does
    not mean that numerical programming is easy, just that it is not as
    hard as with fixed point.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Anton Ertl on Tue Dec 3 08:59:18 2024
    On 12/3/24 04:01, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/17/2024 7:17 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Or maybe disable reordering or optimization altogether
    for those target architectures.

    So you want to throw out the baby with the bathwater.

    No, keep the weak order systems and not throw them out wrt a system that
    is 100% seq_cst? Perhaps? What am I missing here?

    Disabling optimization altogether costs a lot; e.g., look at <http://www.complang.tuwien.ac.at/anton/bentley.pdf>: if you compare
    the lines for clang-3.5 -O0 with clang-3.5 -O3, you see a factor >2.5
    for the tsp9 program. For gcc-5.2.0 the difference is even bigger.

    That's why jseigh and people like him (I have read that suggestion
    several times before) love to suggest disabling optimization
    altogether. It's a straw man that does not even need beating up. Of
    course they usually don't show results for the supposed benefits of
    the particular "optimization" they advocate (or the drawbacks of
    disabling it), and jseigh follows this pattern nicely.

    That wasn't a serious suggestion.

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected. If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    If you are writing code with concurrent shared data access then
    you need let the compiler know. One way is with locks.
    Another way for lock-free data structures with with
    memory barriers. Even if you had cst hardware you
    still need to tell the compiler so cst hardware doesn't
    buy you any less effort from a programming point of view.

    If you are arguing lock-free programming with memory barrriers
    is hard, let's use locks for everything (disregarding that
    locks have acquire/release semantics that the compiler has
    to be aware of and programmers aren't always aware of), you
    might want to consider the following performance timings
    on some stuff I've been playing with.

    unsafe 53.344 nsecs ( 0.000) 54.547 nsecs ( 0.000)*
    smr 53.828 nsecs ( 0.484) 55.485 nsecs ( 0.939) smrlite 53.094 nsecs ( 0.000) 54.329 nsecs ( 0.000)
    arc 306.674 nsecs ( 253.330) 313.931 nsecs ( 259.384)
    rwlock 730.012 nsecs ( 676.668) 830.340 nsecs ( 775.793)
    mutex 2,881.690 nsecs ( 2,828.346) 3,305.382 nsecs ( 3,250.835)

    smr is smrproxy, something like user space rcu. smrlite is smr
    is smr w/o thread_local access so I have an idea how much that
    adds to overhead. arc is arcproxy, lock-free reference count
    based deferred reclamation. rwlock and mutex are what their
    names would suggest. unsafe is no synchronization to get a
    base timing on the reader loop body.

    2nd col is per loop read lock/unlock average cpu time
    3rd col is with unsafe time subtracted out
    4th col is average elapsed time
    5th col is with unsafe time subtracted out.

    cpu time doesn't measure lock wait time so elapsed time
    gives some indication of that.

    8 reader threads, 1 writer thread

    smrproxy is the version that doesn't need the cst_seq
    memory barrier so it is pretty fast (you are welcome).

    arc, rwlock, and mutex use interlocked instructions which
    cause cache thrashing. mutex will not scale well with
    number of threads on top of that. rwlock depends on
    how much write locking is going on. With few write
    updates, it will look more like arc.

    Timings are for 8 reader threads, 1 writer thread on
    4 core/8 hw thread machine.

    There's going to be applications where that 2 to 3+ order
    difference of overhead is going to matter a lot.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to jseigh on Tue Dec 3 19:27:47 2024
    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Dec 3 18:37:41 2024
    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    We could start with something like

    critical_region {
    ...
    }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as if execution was singlethreaded.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Dec 4 02:47:58 2024
    On Tue, 3 Dec 2024 23:37:41 +0000, Stefan Monnier wrote:

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    We could start with something like

    critical_region {
    ...
    }

    In the spirit of Fortran rule about obeying parenthesis; one could
    write::

    {
    ...
    }

    and the compiler would not allow code motion beyond the {s or }s.

    {{You may want to disable code motion even when the region is not critical--just dangerous in some way.}}

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as
    if execution was singlethreaded.

    You identify a second problem. Is it that you don't want code motion
    across the boundary or you do not want code motion within the boundary??


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Dec 4 10:29:33 2024
    You identify a second problem. Is it that you don't want code motion
    across the boundary or you do not want code motion within the boundary??

    Concurrency is hard. 🙂


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Stefan Monnier on Wed Dec 4 11:13:17 2024
    On 12/3/24 18:37, Stefan Monnier wrote:
    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    We could start with something like

    critical_region {
    ...
    }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as if execution was singlethreaded.


    C/C++11 already defines what lock acquire/release semantics are.
    Roughly you can move stuff outside of a critical section into it
    but not vice versa.

    Java uses synchronized blocks to denote the critical section.
    C++ (the society for using RAII for everything) has scoped_lock
    if you want to use RAII for your critical section. It's not
    always obvious what the actual critical section is. I usually
    use it inside its own bracket section to make it more obvious.
    { std::scoped_lock m(mutex);
    // .. critical section
    }

    I'm not a big fan of c/c++ using acquire and release memory order
    directives on everything since apart from a few situations it's
    not intuitively obvious what they do in all cases. You can
    look a compiler assembler output but you have to be real careful
    generalizing from what you see.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Dec 4 16:37:41 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    C and C++ have the 'volatile' keyword for this purpose.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Dec 4 19:43:48 2024
    On Wed, 4 Dec 2024 15:29:33 +0000, Stefan Monnier wrote:

    You identify a second problem. Is it that you don't want code motion
    across the boundary or you do not want code motion within the boundary??

    Concurrency is hard. 🙂

    Suggest a way to disallow code motion across the boundary
    that still allows code motion within a block ??



    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Dec 4 19:42:11 2024
    On Wed, 4 Dec 2024 16:37:41 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    C and C++ have the 'volatile' keyword for this purpose.

    What if you want the volatile attribute only to hold
    on an inner block::

    {
    int i = ...;
    ... // I is not volitile here
    {
    ... // I is volitile in here
    }
    ... // I is not volitile here
    ...
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Dec 4 19:48:21 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 4 Dec 2024 16:37:41 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    C and C++ have the 'volatile' keyword for this purpose.

    What if you want the volatile attribute only to hold
    on an inner block::

    {
    int i = ...;
    ... // I is not volitile here
    {
    ... // I is volitile in here
    }
    ... // I is not volitile here
    ...
    }

    Cast it. Linux uses the macro ACCESS_ONCE to support this:

    uint64_t value = ACCESS_ONCE(&memory_location);

    #define ACCESS_ONCE(x) (*(volatile __typeof__(x) *)&(x))

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Thu Dec 5 08:00:40 2024
    On 12/5/24 02:44, Chris M. Thomasson wrote:
    On 12/4/2024 8:13 AM, jseigh wrote:
    On 12/3/24 18:37, Stefan Monnier wrote:
                                               If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    We could start with something like

         critical_region {
           ...
         }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections
    as if
    execution was singlethreaded.


    C/C++11 already defines what lock acquire/release semantics are.
    Roughly you can move stuff outside of a critical section into it
    but not vice versa.

    Java uses synchronized blocks to denote the critical section.
    C++ (the society for using RAII for everything) has scoped_lock
    if you want to use RAII for your critical section.  It's not
    always obvious what the actual critical section is.  I usually
    use it inside its own bracket section to make it more obvious.
       { std::scoped_lock m(mutex);
         // .. critical section
       }

    I'm not a big fan of c/c++ using acquire and release memory order
    directives on everything since apart from a few situations it's
    not intuitively obvious what they do in all cases.  You can
    look a compiler assembler output but you have to be real careful
    generalizing from what you see.

    The release on the unlock can allow some following stores and things to
    sort of "bubble up before it?

    Acquire and release confines things to the "critical section", the
    release can allow for some following things to go above it, so to speak.
    This is making me think of Alex over on c.p.t. !

    :^)

    Did I miss anything? Sorry Joe.


    Maybe. For thread local non-shared data if the compiler can make that determination but I don't know if the actual specs say that.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Scott Lurndal on Thu Dec 5 07:19:24 2024
    scott@slp53.sl.home (Scott Lurndal) writes:

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    C and C++ have the 'volatile' keyword for this purpose.

    A problem with using volatile is that volatile doesn't do what
    most people think it does, especially with respect to what
    reordering is or is not allowed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Fri Dec 6 10:04:29 2024
    Tim, did you send me a PM to check my email? I responded but then
    silence. Could someone be pretending to be you?

    Terje

    Tim Rentsch wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    C and C++ have the 'volatile' keyword for this purpose.

    A problem with using volatile is that volatile doesn't do what
    most people think it does, especially with respect to what
    reordering is or is not allowed.



    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Fri Dec 6 07:36:33 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim, did you send me a PM to check my email? I responded but then
    silence. Could someone be pretending to be you?

    Yes that was me. I'm just about to send you a followup.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Tue Dec 17 07:33:58 2024
    On 12/16/24 16:48, Chris M. Thomasson wrote:
    On 12/5/2024 5:00 AM, jseigh wrote:


    Maybe.  For thread local non-shared data if the compiler can make that
    determination but I don't know if the actual specs say that.

    It would be strange to me if the compiler executed a weaker barrier than
    what I said needed to be there. If I say I need a #LoadStore |
    #StoreStore here, then the compiler better put that barrier in there.
    Humm...

    C++ concurrency was designed by a committee. They try to fit things
    into their world view even if reality is a bit more nuanced or complex
    than that world view.

    C++ doesn't use #LoadStore, etc... memory ordering terminology. They
    use acquire, release, cst, relaxed, ... While in some cases it's straightforward as to what that means, in others it's less obvious.
    Non-obvious isn't exactly what you want when writing multi-threaded
    code. There's enough subtlety as it is.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to jseigh on Tue Dec 17 20:38:18 2024
    jseigh <jseigh_es00@xemaps.com> wrote:

    C++ doesn't use #LoadStore, etc... memory ordering terminology. They
    use acquire, release, cst, relaxed, ... While in some cases it's straightforward as to what that means, in others it's less obvious.

    Indeed you don't know the exact mapping to instructions, but that's
    the idea: you ask for the ordering model you want, and the compiler
    chooses the instructions.

    Non-obvious isn't exactly what you want when writing multi-threaded
    code. There's enough subtlety as it is.

    There are efficiency advantages to be had from getting away from
    explicit barriers, though. AArch64 has seq-cst load and store
    instructions which don't need the sledgehammer of of a full StoreLoad
    between a store and a load.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Tue Dec 17 21:45:55 2024
    On Tue, 17 Dec 2024 20:41:23 +0000, Chris M. Thomasson wrote:

    On 12/17/2024 4:33 AM, jseigh wrote:
    On 12/16/24 16:48, Chris M. Thomasson wrote:
    On 12/5/2024 5:00 AM, jseigh wrote:


    Maybe.  For thread local non-shared data if the compiler can make that >>>> determination but I don't know if the actual specs say that.

    It would be strange to me if the compiler executed a weaker barrier
    than what I said needed to be there. If I say I need a #LoadStore |
    #StoreStore here, then the compiler better put that barrier in there.
    Humm...

    C++ concurrency was designed by a committee.  They try to fit things
    into their world view even if reality is a bit more nuanced or complex
    than that world view.

    Indeed.

    A committee with no/little HW design experience ...


    C++ doesn't use #LoadStore, etc... memory ordering terminology.  They
    use acquire, release, cst, relaxed, ...  While in some cases it's
    straightforward as to what that means, in others it's less obvious.
    Non-obvious isn't exactly what you want when writing multi-threaded
    code.  There's enough subtlety as it is.

    Agreed. Humm... The CAS is interesting to me.

    atomic_compare_exchange_weak
    atomic_compare_exchange_strong

    The weak one can fail spuriously... Akin to LL/SC in a sense?

    There are architectures in which CAS* is allowed to fail spuriously.
    In one case, if the miss buffers overflow while holding a locked
    variable, the CAS would fail (preventing an ABA problem possibility).

    (*) or any multi-instruction ATOMIC sequence.

    atomic_compare_exchange_weak_explicit
    atomic_compare_exchange_strong_explicit

    A membar for the success path and one for the failure path. Oh that's
    fun. Sometimes I think its better to use relaxed for all of the atomics
    and use explicit barriers ala atomic_thread_fence for the order. Well,
    that is more in line with the SPARC way of doing things... ;^)

    :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Wed Dec 18 06:43:43 2024
    On 12/17/24 15:41, Chris M. Thomasson wrote:

    Agreed. Humm... The CAS is interesting to me.

    atomic_compare_exchange_weak
    atomic_compare_exchange_strong

    The weak one can fail spuriously... Akin to LL/SC in a sense?

    Most likely LL/SC in the implementation. If you are calling
    cas in a loop, then the weak form will be safe, it will just
    retry. The strong form is if you are not in a loop and the
    cas emulation code has extra logic to filter out spurious
    failures and retry internally.

    atomic_compare_exchange_weak_explicit
    atomic_compare_exchange_strong_explicit

    A membar for the success path and one for the failure path. Oh that's
    fun. Sometimes I think its better to use relaxed for all of the atomics
    and use explicit barriers ala atomic_thread_fence for the order. Well,
    that is more in line with the SPARC way of doing things... ;^)

    No sure why that is there. Possibly for non loop usages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Thu Dec 19 18:33:36 2024
    On Thu, 5 Dec 2024 7:44:19 +0000, Chris M. Thomasson wrote:

    On 12/4/2024 8:13 AM, jseigh wrote:
    On 12/3/24 18:37, Stefan Monnier wrote:
                                               If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    We could start with something like

         critical_region {
           ...
         }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as
    if
    execution was singlethreaded.


    C/C++11 already defines what lock acquire/release semantics are.
    Roughly you can move stuff outside of a critical section into it
    but not vice versa.

    Java uses synchronized blocks to denote the critical section.
    C++ (the society for using RAII for everything) has scoped_lock
    if you want to use RAII for your critical section.  It's not
    always obvious what the actual critical section is.  I usually
    use it inside its own bracket section to make it more obvious.
      { std::scoped_lock m(mutex);
        // .. critical section
      }

    I'm not a big fan of c/c++ using acquire and release memory order
    directives on everything since apart from a few situations it's
    not intuitively obvious what they do in all cases.  You can
    look a compiler assembler output but you have to be real careful
    generalizing from what you see.

    The release on the unlock can allow some following stores and things to
    sort of "bubble up before it?

    Acquire and release confines things to the "critical section", the
    release can allow for some following things to go above it, so to speak.
    This is making me think of Alex over on c.p.t. !

    This sounds dangerous if the thing allowed to go above it is unCacheable
    while the lock:release is cacheable, the cacheable lock can arrive at
    another core before the unCacheable store arrives at its destination.

    :^)

    Did I miss anything? Sorry Joe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Thu Dec 19 23:59:01 2024
    On Thu, 19 Dec 2024 21:19:24 +0000, Chris M. Thomasson wrote:

    On 12/19/2024 10:33 AM, MitchAlsup1 wrote:
    On Thu, 5 Dec 2024 7:44:19 +0000, Chris M. Thomasson wrote:

    On 12/4/2024 8:13 AM, jseigh wrote:
    On 12/3/24 18:37, Stefan Monnier wrote:
                                               If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references >>>>>> are "more special" than normal--when languages give few mechanisms. >>>>>
    We could start with something like

         critical_region {
           ...
         }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as >>>>> if
    execution was singlethreaded.


    C/C++11 already defines what lock acquire/release semantics are.
    Roughly you can move stuff outside of a critical section into it
    but not vice versa.

    Java uses synchronized blocks to denote the critical section.
    C++ (the society for using RAII for everything) has scoped_lock
    if you want to use RAII for your critical section.  It's not
    always obvious what the actual critical section is.  I usually
    use it inside its own bracket section to make it more obvious.
       { std::scoped_lock m(mutex);
         // .. critical section
       }

    I'm not a big fan of c/c++ using acquire and release memory order
    directives on everything since apart from a few situations it's
    not intuitively obvious what they do in all cases.  You can
    look a compiler assembler output but you have to be real careful
    generalizing from what you see.

    The release on the unlock can allow some following stores and things to
    sort of "bubble up before it?

    Acquire and release confines things to the "critical section", the
    release can allow for some following things to go above it, so to speak. >>> This is making me think of Alex over on c.p.t. !

    This sounds dangerous if the thing allowed to go above it is unCacheable
    while the lock:release is cacheable, the cacheable lock can arrive at
    another core before the unCacheable store arrives at its destination.

    Humm... Need to ponder on that. Wrt the sparc:

    membar #LoadStore | #StoreStore

    can allow following stores to bubble up before it. If we want to block
    that then we would use a #StoreLoad. However, a #StoreLoad is not
    required for unlocking a mutex.

    It is the cacheable locks covering unCacheable data that got MOESI
    protocol in trouble (SPARC V8 era). MESI does not have this kind
    of problem. {{SuperSPARC MESI did not have this problem because
    writes to memory (via SNOOP hits) were slow, Ross MOESI did have
    this problem because cache-to-cache transfers (SNOOP hit) were as
    few as 6 cycles.}}

    S O , What kind of barriers a relaxed memory model needs becomes
    dependent on the cache coherency model !?!?!?! How is software
    going to deal with that ?!? It them becomes dependent on the
    memory order model, as a cascade of Oh-crap-what-have-I-done-to-
    myself ...

    It is stuff like this that lead My 66000 to alter memory models
    as it accesses memory and mandates that all critical sections
    are denoted (.lock) at the beginning and end of the ATOMIC event.
    Thus, the programmer gets the performance of the relaxed memory
    with the sanity of sequential consistency without programmer
    inivolvement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Chris M. Thomasson on Fri Dec 20 14:39:03 2024
    Chris M. Thomasson wrote:
    On 12/19/2024 10:33 AM, MitchAlsup1 wrote:
    On Thu, 5 Dec 2024 7:44:19 +0000, Chris M. Thomasson wrote:

    On 12/4/2024 8:13 AM, jseigh wrote:
    On 12/3/24 18:37, Stefan Monnier wrote:
    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references >>>>>> are "more special" than normal--when languages give few mechanisms. >>>>>
    We could start with something like

    critical_region {
    ...
    }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as >>>>> if
    execution was singlethreaded.


    C/C++11 already defines what lock acquire/release semantics are.
    Roughly you can move stuff outside of a critical section into it
    but not vice versa.

    Java uses synchronized blocks to denote the critical section.
    C++ (the society for using RAII for everything) has scoped_lock
    if you want to use RAII for your critical section. It's not
    always obvious what the actual critical section is. I usually
    use it inside its own bracket section to make it more obvious.
    { std::scoped_lock m(mutex);
    // .. critical section
    }

    I'm not a big fan of c/c++ using acquire and release memory order
    directives on everything since apart from a few situations it's
    not intuitively obvious what they do in all cases. You can
    look a compiler assembler output but you have to be real careful
    generalizing from what you see.

    The release on the unlock can allow some following stores and things to
    sort of "bubble up before it?

    Acquire and release confines things to the "critical section", the
    release can allow for some following things to go above it, so to speak. >>> This is making me think of Alex over on c.p.t. !

    This sounds dangerous if the thing allowed to go above it is unCacheable
    while the lock:release is cacheable, the cacheable lock can arrive at
    another core before the unCacheable store arrives at its destination.

    Humm... Need to ponder on that. Wrt the sparc:

    membar #LoadStore | #StoreStore

    can allow following stores to bubble up before it. If we want to block
    that then we would use a #StoreLoad. However, a #StoreLoad is not
    required for unlocking a mutex.

    I had an idea a few weeks back of a different way to do membars
    that should be more flexible and controllable (if that's a good thing)
    so I thought I'd toss it out there for comments.

    This hypothetical ISA has normal LD and ST instructions, to which I
    would add a LW Load for Write instruction to optimize moving shared lines between caches. There are also the Atomic Fetch and OP instructions
    AFADD, AFAND, AFOR, AFXOR, plus ASWAP and ACAS, LL Load Locked and
    SC Store Conditional, for various size of naturally aligned data,
    and with various address modes.

    Here is the new part:

    To the above instructions is added a 3-bit Coherence Group (CG) field.
    This allows one to specify different groups that various above data
    accesses belong to.

    The ISA has a membar instruction: MBG Memory Barrier for Group

    MBG has three fields:
    - one 4-bit field where each bit enables which operations this barrier
    applies to, in older-younger order: Load-Load, Load-Store, Store-Load,
    and Store-Store.
    - two 8-bit fields where each bit selects which sets of Coherence Group(s)
    this barrier applies to, one field for the older (before the membar) sets,
    one for the younger (after the membar) sets.

    Also the Load Store Queue is assumed to be self coherent - that loads
    and stores to the same address by a single core are performed in order,
    and that nothing can bypass a load or store with an unresolved address.

    The CG numbers are assigned by convention, probably by the OS designers
    when they define the ABI for this ISA.
    Here I assigned CG:0 to be thread normal access, CG:1 to be atomic items,
    CG:2 to be shared memory sections. The remaining 5 CG's can be used to
    indicate different shared memory sections if their locks can overlap.

    Eg. An MBG with op bits for Load-Load and Load-Store, with a before CG of 1
    and after CG's 3 and 4 would block all younger loads and stores in groups
    3 and 4 from starting execution until all older loads in group 1 completed. Loads and stores in all other groups are free to reorder, within the
    LSQ self coherence rules.
    An MBG with all op bits and all CG bits set is a full membar.

    Also if one is say juggling multiple shared sections with multiple
    spinlocks or mutexes, then one can use multiple membars applied to
    different groups to achieve specific bypassing blocking effects.

    An MBG instruction completes and retires when no older groups of
    selected loads or stores are incomplete.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)