• Re: Computer architects leaving Intel...

    From Stephen Fuld@21:1/5 to David Brown on Fri Sep 13 13:09:00 2024
    On 9/3/2024 4:14 PM, David Brown wrote:
    On 03/09/2024 18:54, Stephen Fuld wrote:
    On 9/2/2024 11:23 PM, David Brown wrote:
    On 02/09/2024 18:46, Stephen Fuld wrote:
    On 9/2/2024 1:23 AM, Terje Mathisen wrote:

    Anyway, that is all mostly moot since I'm using Rust for this kind
    of programming now. :-)

    Can you talk about the advantages and disadvantages of Rust versus C?


    And also for Rust versus C++ ?

    I asked about C versus Rust as Terje explicitly mentioned those two
    languages, but you make a good point in general.


    I want to know about both :-)

    In my field, small-systems embedded development, C has been dominant for
    a long time, but C++ use is increasing.  Most of my new stuff in recent times has been C++.  There are some in the field who are trying out
    Rust, so I need to look into it myself - either because it is a better
    choice than C++, or because customers might want it.



    My impression - based on hearsay for Rust as I have no experience -
    is that the key point of Rust is memory "safety".  I use scare-quotes
    here, since it is simply about correct use of dynamic memory and
    buffers.

    I agree that memory safety is the key point, although I gather that it
    has other features that many programmers like.


    Sure.  There are certainly plenty of things that I think are a better
    idea in a modern programming language and that make it a good step up compared to C.  My key interest is in comparison to C++ - it is a step
    up in some ways, a step down in others, and a step sideways in many features.  But is it overall up or down, for /my/ uses?

    Examples of things that I think are good in Rust are making variables immutable by default and pattern matching.  Steps down include lack of function overloading

    Rust's generic functions are not sufficient?



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Fri Sep 13 21:39:39 2024
    Michael S <already5chosen@yahoo.com> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would be
    bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.

    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety. Good idea, that.
    That created *really* interesting bugs, and Real Programmers (TM)
    have to have something that pays their salaries, right?

    SCNR

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Sep 13 23:16:19 2024
    On Fri, 13 Sep 2024 21:39:39 +0000, Thomas Koenig wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would be
    bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.

    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety.

    FORTAN allowed::
    subroutine1:
    COMMON /ALPHA/i,j,k,l,m,n
    subroutine2:
    COMMON /ALPHA/x.y.z
    expecting {i,j} which are INT*4 to overlap with x Read*8 ;...
    {Completely neglecting the BE/LE problems,...}

    Good idea, that.
    That created *really* interesting bugs, and Real Programmers (TM)
    have to have something that pays their salaries, right?

    SCNR

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sat Sep 14 07:25:00 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Fri, 13 Sep 2024 21:39:39 +0000, Thomas Koenig wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Fri, 13 Sep 2024 04:12:21 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    struct {
    char x[8]
    int y;
    } bar;
    bar.y = 0; bar.x[8] = 42;

    IMHO, here behavior should be fully defined by implementation. And
    in practice it is. Just not in theory.

    Do you mean union rather than struct? And do you mean bar.x[7]
    rather than bar.x[8]? Surely no one would expect that storing
    into bar.x[8] should be well-defined behavior.

    If the code were this

    union {
    char x[8];
    int y;
    } bar;
    bar.y = 0; bar.x[7] = 42;

    and assuming sizeof(int) == 4, what is it that you think should
    be defined by the C standard but is not? And the same question
    for a struct if that is what you meant.


    No, I mean struct and I mean 8.
    And I mean that a typical implementation-defined behavior would be
    bar.y==42 on LE machines and bar.y==42*2**24 on BE machines.
    As it actually happens in reality with all production compilers.

    Ah, you want to re-introduce Fortran's storage association and
    common blocks, but without the type safety.

    FORTAN allowed::
    subroutine1:
    COMMON /ALPHA/i,j,k,l,m,n
    subroutine2:
    COMMON /ALPHA/x.y.z
    expecting {i,j} which are INT*4 to overlap with x Read*8 ;...
    {Completely neglecting the BE/LE problems,...}

    Not only that, also different FP formats...

    The only thing that was guaranteed is the storage unit. An INTEGER
    and a REAL occupies one storage unit, a DOUBLE PRECISION occoupies
    two. Through EQUIVALENCE or through different COMMON blocks in
    different procedures, an INTEGER and a REAL can occupy the same
    storage location. And if a value was assigned to a variable of
    one time (the entity became defined, in standardese) the variable
    with the same storage location becomes undefined (at least as far
    back as Fortran 77, I didn't check earlier).

    This was very widely ignored, people used COMMON and EQUIVALENCE
    for type punning all the time.

    There also was the issue of alignment; by playing tricks with
    EQUIVALENCE, you could put a double precision variable on an
    unaligned memory location. With the advent of the RISC CPUs which
    didn't support this, this became the most-ignored provision in the
    standard (but with a flag to restorte standard-conforming behavior).

    Hmm... what were the alignment restrictions on double precision
    on the /360?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to BGB on Sat Sep 14 08:24:29 2024
    BGB <cr88192@gmail.com> schrieb:
    On 9/13/2024 10:55 AM, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    Most of the commonly used parts of C99 have been "safe" to use for 20
    years. There were a few bits that MSVC did not implement until
    relatively recently, but I think even have caught up now.

    What about VLAs?


    IIRC, VLAs and _Complex and similar still don't work in MSVC.
    Most of the rest does now at least.

    It's only been 25 years. You have to give Microsoft a bit of
    time to catch up. I'm sure they will get there by 2099.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Fri Sep 20 09:52:32 2024
    MitchAlsup1 wrote:
    On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:


    - register specifier fields are either source or dest, never both

    I happen to be wishywashy on this


    This is deeply interesting, can you expound on why it is fine a register
    field can be shared by loads and stores, and sometimes both like x86.

    My 66000 encodes store data register in the same field position as it
    encodes "what kind of branch" is being performed, and the same position
    as all calculation (and load) results.

    I started doing this in 1982 with Mc88100 ISA, and never found a problem
    with the encoding nor in the decoding nor with the pipelining of it.

    Let me be clear, I do not support necessarily damaging a source operand
    to fit in another destination as::

    ADD SP,SP,#0x40

    by specifying SP only once in the instruction.

    So,

    +------+-----+-----+----------------+
    | major| Rd | Rs1 | whatever |
    +------+-----+-----+----------------+
    | BC | cnd | Rs1 | label offset |
    +------+-----+-----+----------------+
    | LD | Rd | Rb | displacement |
    +------+-----+-----+----------------+
    | ST | Rs0 | Rb | displacement |
    +------+-----+-----+----------------+

    Is:
    a) no burden in encoding
    b) no burden in decoding
    c) no burden in pipelining
    d) no burden in stealing the Store data port late in the pipeline
    {in particular, this saves lots of flip-flops deferring store
    data until after cache hit, TLB hit, and data has arrived at
    cache.}

    I disagree with things like::

    +------+-----+-----+----------------+
    | big OpCode | Rds | whatever |
    +------+-----+-----+----------------+

    Where Rds means the specifier is used as both a source and destination.

    Notice in my encoding one can ALWAYS take the register specification
    fields and wire them directly into the RF/renamer decoder ports.
    You lose this property the other way around.

    I assume in your examples that you want to start your register file
    read access and or rename register lookup access in the decode stage,
    and not wait to start at the end of the decode stage.
    Effectively pipelining those accesses.
    That's fine.

    But that's my point - it doesn't make a difference because in both
    cases you can wire the reg fields to the reg file or rename directly
    and start the access ASAP.
    In both cases the enable signal determining what to do shows up
    later after decode has done its thing. And the critical path for
    that decode enable signal is the same both ways.

    And if you are not doing this early access start but the traditional
    of latch the decode output THEN start your RegRd or Rename access
    it makes no timing difference at all.

    By allowing the opcode-Rds style instructions to be *CONSIDERED*
    it opens an avenue to potential instructions that cost little or
    nothing extra in terms of logic or performance.

    And this is particularly useful with fixed width 32-bit instructions
    where one is try to pack as much function into a fixed size space as
    possible. Even more so with 16-bit compact instructions.

    For example, a 32-bit fixed format instruction with four 5-bit registers
    could do a full width integer multiply wide-accumulate

    IMAC (Rsd_hi,Rsd_lo) = (Rsd_hi,Rsd_lo) + Rs1 * Rs2

    with little more logic than the existing MULL,MULH approach.
    It still only needs 2 read ports because Rs1,Rs2 are read first to start
    the multiply, then (Rsd_hi,Rsd_lo) second as they aren't needed until
    late in the multiply-accumulate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Sep 20 17:39:34 2024
    On Fri, 20 Sep 2024 13:52:32 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 19 Sep 2024 19:12:41 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:


    - register specifier fields are either source or dest, never both

    I happen to be wishywashy on this


    This is deeply interesting, can you expound on why it is fine a register >>> field can be shared by loads and stores, and sometimes both like x86.

    My 66000 encodes store data register in the same field position as it
    encodes "what kind of branch" is being performed, and the same position
    as all calculation (and load) results.

    I started doing this in 1982 with Mc88100 ISA, and never found a problem
    with the encoding nor in the decoding nor with the pipelining of it.

    Let me be clear, I do not support necessarily damaging a source operand
    to fit in another destination as::

    ADD SP,SP,#0x40

    by specifying SP only once in the instruction.

    So,

    +------+-----+-----+----------------+
    | major| Rd | Rs1 | whatever |
    +------+-----+-----+----------------+
    | BC | cnd | Rs1 | label offset |
    +------+-----+-----+----------------+
    | LD | Rd | Rb | displacement |
    +------+-----+-----+----------------+
    | ST | Rs0 | Rb | displacement |
    +------+-----+-----+----------------+

    Is:
    a) no burden in encoding
    b) no burden in decoding
    c) no burden in pipelining
    d) no burden in stealing the Store data port late in the pipeline
    {in particular, this saves lots of flip-flops deferring store
    data until after cache hit, TLB hit, and data has arrived at
    cache.}

    I disagree with things like::

    +------+-----+-----+----------------+
    | big OpCode | Rds | whatever |
    +------+-----+-----+----------------+

    Where Rds means the specifier is used as both a source and destination.

    Notice in my encoding one can ALWAYS take the register specification
    fields and wire them directly into the RF/renamer decoder ports.
    You lose this property the other way around.

    I assume in your examples that you want to start your register file
    read access and or rename register lookup access in the decode stage,
    and not wait to start at the end of the decode stage.
    Effectively pipelining those accesses.
    That's fine.

    But that's my point - it doesn't make a difference because in both
    cases you can wire the reg fields to the reg file or rename directly
    and start the access ASAP.

    Not when a source field and a destination field are the same
    field sometimes but not always. Your thought train adds a
    register specifier mux between the destination field and
    the overused source field in front of the destination
    rename port. It is not a BIG hinderance, but it is not
    insignificant is you are doing a "balls to the walls"
    design.

    In both cases the enable signal determining what to do shows up
    later after decode has done its thing. And the critical path for
    that decode enable signal is the same both ways.

    And if you are not doing this early access start but the traditional
    of latch the decode output THEN start your RegRd or Rename access
    it makes no timing difference at all.

    By allowing the opcode-Rds style instructions to be *CONSIDERED*
    it opens an avenue to potential instructions that cost little or
    nothing extra in terms of logic or performance.

    The actual calculations are easy, it is the routing of data
    to and from the calculation that is hard.

    And this is particularly useful with fixed width 32-bit instructions
    where one is try to pack as much function into a fixed size space as possible. Even more so with 16-bit compact instructions.

    RISC-V, because of where the various fields ARE, have a mux between
    every source field and every register port--simply because their
    positions move between non-compressed and compressed.

    I agree with the position that if the mux is already there
    that one should use it often and greatly.

    Where I disagree is that the mux HAS to be there.

    For example, a 32-bit fixed format instruction with four 5-bit registers could do a full width integer multiply wide-accumulate

    IMAC (Rsd_hi,Rsd_lo) = (Rsd_hi,Rsd_lo) + Rs1 * Rs2

    This violates the RISC tenet where each calculation instruction
    produces exactly 1 result. I get around this with the mechanical
    definition of the CARRY instruction. The MUL instruction produces
    its result, CARRY captures the other, and deposits it in RF when
    possible.

    with little more logic than the existing MULL,MULH approach.
    It still only needs 2 read ports because Rs1,Rs2 are read first to start
    the multiply, then (Rsd_hi,Rsd_lo) second as they aren't needed until
    late in the multiply-accumulate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Sep 20 20:34:01 2024
    On Fri, 20 Sep 2024 2:09:35 +0000, BGB wrote:

    On 9/18/2024 1:42 PM, MitchAlsup1 wrote:

    One simple option would be to assume an instruction looks like:
    [Prefix Bytes]
    [REX byte]
    OP_Byte | 0F+OP_Byte
    Mod/RM + SIB + ...

    No, the simple option is that an instruction looks like:

    +------+-----+-----+----------------+
    | major| Rd | Rs1 | imm16 |
    +------+-----+-----+----------------+
    | mem | Rd | Rb | disp16 |
    +------+-----+-----+----------------+
    | Bcnd | cnd | Rs1 | disp18 |
    +------+-----+-----+----------------+
    | 2OP | Rd | Rs1 |mods| 2op | Rs2 |
    +------+-----+-----+----------------+
    | 3OP | Rd | Rs1 | Rs3 | 3op| Rs2 |
    +------+-----+-----+----------------+

    And then use a heuristic to try to guess how to interpret the
    instruction stream based on "looks better" (more likely to be aligned
    with the instruction stream vs random unaligned garbage).

    Though, such a "looks good" heuristic could itself risk skewing the
    results.


    I may still consider defining an encoding for this, but not yet. It is
    in a similar boat as auto-increment. Both add resource cost with
    relatively little benefit in terms of overall performance.
    Auto-increment because if one has superscalar, the increment can usually >>> be co-executed. And, full [Rb+Ri*Sc+Disp], because it is just too
    infrequent to really justify the extra cost of a 3-way adder even if
    limited mostly to the low-order bits...

    Myopathy--look it up.


    OK.

    Not sure how that is related (a medical condition involving muscle defects...).

    Myopathy is NEAR SIGHTEDNESS.

    You are not looking far enough into the future to avoid problems in your
    ISA and architecture. {I did the same in my youth. almost everyone
    does.}


    Can also note that a worthwhile design goal is to not add significant
    cost over what would be needed for a plain RV64GC implementation, but,
    could define a [Rb+Ri*Sc+Disp] encoding or similar if it would likely be beneficial enough to justify its existence.

    486 showed that "[Rbase+Rindex<<scale+displacement]:segment" could all
    be performed in a single cycle at a frequency competitive with the RISC processors available at the time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to All on Sat Sep 21 10:45:47 2024
    On 2024-09-20 23:34, MitchAlsup1 wrote:

    Myopathy is NEAR SIGHTEDNESS.


    Perhaps you meant "myopia", https://en.wikipedia.org/wiki/Myopia.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Paul A. Clayton on Fri Sep 27 13:31:13 2024
    Paul A. Clayton wrote:
    On 9/22/24 6:19 PM, MitchAlsup1 wrote:
    On 9/19/24 11:07 AM, EricP wrote:
    <sound of soap box being dragged out>
    This idea that macro-op fusion is some magic solution is bullshit.

    The argument is, at best, of Academic Quality, made by a student
    at the time as a way to justify RISC-V not having certain easy
    for HW to perform calculations.

    The RISC-V published argument for fusion is not great, but fusion
    (and cracking/fission) seem natural architectural mechanisms *if*
    one is stuck with binary compatibility.

    As far as I know there are only 3 published articles on RV fusion.

    The Renewed Case for the Reduced Instruction Set Computer
    Avoiding ISA Bloat with Macro-Op Fusion for RISC-V, 2016 http://people.eecs.berkeley.edu/~krste/papers/EECS-2016-130.pdf

    is an academic paper that proposes some fusion and compares compiler
    outputs but does not consider hardware cost.

    Exploring Instruction Fusion Opportunities in
    General Purpose Processors, 2022 https://webs.um.es/aros/papers/pdfs/ssingh-micro22.pdf

    looks at a much more difficult fusion:
    "In this paper, we propose and study techniques to increase the number of
    fused memory instructions, notably nonconsecutive and non-contiguous fusion. Non-ConSecutive Fusion (NCSF) is the operation of fusing two (or more) μ-ops that are not consecutive in the dynamic execution stream of the program. Non-ConTiguous Fusion (NCTF) is the operation of fusing two (or more)
    memory μ-ops that access non-contiguous memory bytes."

    There is a very recent paper that I have not read as it is paywalled.

    [paywalled]
    Evaluating and Enhancing Performance through Macro-Op Fusion Optimization
    with RISC-V, 2024
    https://dl.acm.org/doi/abs/10.1145/3677333.3678150

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Fri Sep 27 18:01:40 2024
    On Wed, 25 Sep 2024 2:49:07 +0000, Paul A. Clayton wrote:

    On 9/22/24 6:19 PM, MitchAlsup1 wrote:
    On Sun, 22 Sep 2024 20:43:38 +0000, Paul A. Clayton wrote:

    On 9/19/24 11:07 AM, EricP wrote:
    [snip]
    If the multiplier is pipelined with a latency of 5 and throughput
    of 1,
    then MULL takes 5 cycles and MULL,MULH takes 6.

    But those two multiplies still are tossing away 50% of their work.

    I do not remember how multipliers are actually implemented — and
    am not motivated to refresh my memory at the moment — but I
    thought a multiply low would not need to generate the upper bits,
    so I do not understand where your "50% of their work" is coming
    from.

        +-----------+   +------------+
        \  mplier  /     \   mcand  /        Big input mux >>      +--------+       +--------+
              |                |
              |      +--------------+
              |     /               /
              |    /               /
              +-- /               /
                 /     Tree      /
                /               /--+
               /               /   |
              /               /    |
             +---------------+-----------+
                   hi             low        Products

    two n-bit operands are multiplied into a 2×n-bit result.
    {{All the rest is HOW not what}}

    So are you saying the high bits come for free? This seems
    contrary to the conception of sums of partial products, where
    some of the partial products are only needed for the upper bits
    and so could (it seems to me) be uncalculated if one only wanted
    the lower bits.

    The high order bits are free WRT gates of delay, but consume as much
    area as the lower order bits. I was answering the question of
    "I do not remember how multipliers are actually implemented".

    The high result needs the low result carry-out but not the rest of
    the result. (An approximate multiply high for multiply by
    reciprocal might be useful, avoiding the low result work. There
    might also be ways that a multiplier could be configured to also
    provide bit mixing similar to middle result for generating a
    hash?)

    I seem to recall a PowerPC implementation did semi-pipelined 32-
    bit multiplication 16-bits at a time. This presumably saved area
    and power

    You save 1/2 of the tree area, but ultimately consume more power.

    The power consumption would seem to depend on how frequently both
    multiplier and multiplicand are larger than 16 bits. (However, I
    seem to recall that the mentioned implementation only checked one
    operand.) I suspect that for a lot of code, small values are
    common.

    It is 100% of the time in FP codes, and generally unknowable in
    integer codes.
    <snip>

    My 66000's CARRY and PRED are "extender prefixes", admittedly
    included in the original architecture so compensating for encoding constraints (e.g., not having 36-bit instruction parcels) rather
    than microarchitectural or architectural variation.

    Since they cast extra bits over a number of instructions, and
    while they precede the instructions they modify, they are not
    classical prefixes--so I use the term Instruction-modifier instead.

    [snip]>> (I feel that encoding some of the dependency information
    could
    be useful to avoid some of this work. In theory, common
    dependency detection could also be more broadly useful; e.g.,
    operand availability detection and execution/operand routing.)

    So useful that it is encoded directly in My 66000 ISA.

    How so? My 66000 does not provide any explicit declaration what
    operation will be using a result (or where an operand is being
    sourced from). Register names express the dependencies so the
    dataflow graph is implicit.

    I was talking about how operand routing is explicitly described
    in ISA--which is mainly about how constants override register
    file reads by the time operands get to the calculation unit.

    I was speculating that _knowing_ when an operand will be available
    and where a result should be sent (rather than broadcasting) could
    be useful information.

    It is easier to record which FU will deliver a result, the when
    part is simply a pipeline sequencer from the end of a FU to the
    entries in the reservation station.


    Even with reduced operations per cycle, fusion could still provide
    a net energy benefit.

    Here I disagree:: but for a different reason::

    In order for RISC-V to use a 64-bit constant as an operand, it has
    to execute either::  AUPIC-LD to an area of memory containing the
    64-bit constant, or a 6-7 instruction stream to build the constant
    inline. While an ISA that directly supports 64-bit constants in ISA
    does not execute any of those.

    Thus, while it may save power seen at the "its my ISA" level it
    may save power, but when seem from the perspective of "it is
    directly supported in my ISA" it wastes power.

    Yes, but "computing" large immediates is obviously less efficient
    (except for compression), the computation part is known to be
    unnecessary. Fusing a comparison and a branch may be a consequence
    of bad ISA design in not properly estimating how much work an
    instruction can do (and be encoded in available space) and there
    is excess decode overhead with separate instructions, but the
    individual operations seem to be doing actual work.

    I suspect there can be cases where different microarchitectures
    would benefit from different amounts of instruction/operation
    complexity such that cracking and/or fusion may be useful even in
    an optimally designed generic ISA.

    [snip]
    - register specifier fields are either source or dest, never both

    This seems mostly a code density consideration. I think using a
    single name for both a source and a destination is not so
    horrible, but I am not a hardware guy.

    All we HW guys want is the where ever the field is specified,
    it is specified in exactly 1 field in the instruction. So, if
    field<a..b> is used to specify Rd in one instruction, there is
    no other field<!a..!b> specifies the Rd register. RISC-V blew
    this "requirement.

    Only with the Compressed extension, I think. The Compressed
    extension was somewhat rushed and, in my opinion, philosophically
    flawed by being redundant (i.e., every C instruction can be
    expanded to a non-C instruction). Things like My 66000's ENTER
    provide code density benefits but are contrary to the simplicity
    emphasis. Perhaps a Rho (density) extension would have been
    better.☺ (The extension letter idea was interesting for an
    academic ISA but has been clearly shown to be seriously flawed.)

    The R in RISC-V does not represent REDUCED.

    16-bit instructions could have kept the same register field
    placements with masking/truncation for two-register-field
    instructions.

    The whole layout of the ISA is sloppy...

    Even a non-destructive form might be provided by
    different masking or bit inversion for the destination. However,
    providing three register fields seems to require significant
    irregularity in extracting register names. (Another technique
    would be using opcode bits for specifying part or all of a
    register name. Some special purpose registers or groups of
    registers may not be horrible for compiler register allocation,
    but such seems rather funky/clunky.)

    It is interesting that RISC-V chose to split the immediate field
    for store instructions so that source register names would be in
    the same place for all (non-C) instructions.

    Lipstick on a pig.

    Comparing an ISA design to RISC-V is not exactly the same as
    comparing to "best in class".

    I don't even know if My 66000 can or should be termed RISC since
    it is a bit closer to VAX but did not go so far as to allow all
    operands to be constants--just one; the memory unit has a sequencer
    to perform ENTER, EXIT, LDM, STM, MM, MS; the FPU has a sequencer
    to do FDIV, SQRT, Log-family, exp-family, sin-family, arc-family
    and pow, flow control unit has a sequencer to do PIC switch-case:
    all while allowing other FUs to process instructions while those
    sequencers run.

    I postulate that My 66000 ISA is RISC because it actually IS a
    Reduced instruction set computer--currently standing at 64
    instructions including SIMD and vectors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)