• Re: Matmul in VVM

    From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun May 10 08:58:34 2026
    From Newsgroup: comp.arch

    On 5/3/2026 3:28 PM, MitchAlsup wrote:

    big snip
    The solution to the excessive a[] traffic would be having the ability
    to index the register file Ra[#] so the array can be allocated into
    registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.

    A possible alternative that I have seen is to "memory map" the registers
    as an alternative accessing mechanism. This allows you to "index" the registers, similarly to indexing a memory array.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun May 10 16:15:12 2026
    From Newsgroup: comp.arch

    Thomas Koenig [2026-05-10 06:55:39] wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    The solution to the excessive a[] traffic would be having the ability
    to index the register file Ra[#] so the array can be allocated into
    registers and indexed from the file itself. Most ISAs do not have this
    ability--although a few GPU ISAs do.
    What are its drawbacks?

    I guess the problem is that in an OoO design, this introduces a deeply problematic dependency between the in-order front-end that renames
    logical registers to physical registers and the OoO core.

    Usually such dependencies (where the front-end needs info from the
    OoO core) are handled via speculation, the classical example being
    branches.

    Do register accesses get slower?

    In order not to mess the whole pipeline, I think you'd have to predict
    the register indexing.

    In the mean time you can "simulate" it by replacing the register
    indexing by a `switch` table to various copies of the code, each one
    using the appropriate register. Of course, this wouldn't work in vVM
    since IIRC vVM doesn't support branches within the loop, and using
    predication to simulate the `switch` table would be impractical.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue May 12 02:17:49 2026
    From Newsgroup: comp.arch

    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 11 20:11:43 2026
    From Newsgroup: comp.arch

    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue May 12 05:14:48 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >used in rare circumstances.

    It would have to be implemented. How? And how does the supposed
    rareness help?

    Remember the subject: You suggested this mechanism as a way to
    eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix multiplication, and the disadvantage was that VVM cannot not eliminate
    some memory accesses that AVX-512 can. Turning the registers into
    memory does not solve that, and probably incurs additional costs.
    This cure is worse than the disease.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 11 23:11:07 2026
    From Newsgroup: comp.arch

    On 5/11/2026 10:14 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    It would have to be implemented. How?

    Let me give one possible implementation. There are certainly others.
    Say you have 32 registers. They are "memory mapped" into the first 32 addresses of memory. So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not thought through all of the details.) So now when the CPU encounters a load (or
    store) instruction where the virtual address is less than 32, it is
    resolved not by the memory system, but by the appropriate register.
    i.e. if the virtual address was say 4, the load would be from register
    R4, not memory location 4. Yes, the virtual addressing mechanism
    would have to be sensitive to whether the address was below 32 or not,
    but that is simple within the CPU. Note that the load instruction in
    this case would not touch the memory system at all, so no cache lookups,
    no TLB lookups, etc.


    And how does the supposed
    rareness help?

    Laurence said it would defeat the purpose of registers. My comment was
    that since it would be rare, i.e. most of the register references would
    be the same as before, it wouldn't defeat the purpose.>
    Remember the subject: You suggested this mechanism as a way to
    eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix multiplication, and the disadvantage was that VVM cannot not eliminate
    some memory accesses that AVX-512 can. Turning the registers into
    memory does not solve that, and probably incurs additional costs.
    This cure is worse than the disease.

    The register accesses are not turned into memory accesses. If the
    address is less than 32, the instruction references the actual register,
    not the memory. The only advantage of this scheme is that it allows "indexing" the registers similarly to how one indexes memory today.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue May 12 07:23:37 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 10:14 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.

    It would have to be implemented. How?

    Let me give one possible implementation. There are certainly others.
    Say you have 32 registers. They are "memory mapped" into the first 32 >addresses of memory. So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not thought >through all of the details.) So now when the CPU encounters a load (or >store) instruction where the virtual address is less than 32, it is >resolved not by the memory system, but by the appropriate register.
    i.e. if the virtual address was say 4, the load would be from register
    R4, not memory location 4.

    How is that done? You have a load instruction in the scheduler. How
    does it know when the "memory" location it is waiting for is ready?
    The register renamer knows nothing about the address that the
    instruction accesses, because the address will usually come in at
    run-time (otherwise one would use a direct register access). So
    should every load and store wait for all physical registers to be
    ready that corresponded to logical registers when the instruction came
    through the register renamer? That would slow down all loads and
    stores. If not, how else should it be done?

    Once we know all the registers are ready, the instruction is admitted
    to the load/store unit. Now you have to add an extra data path that
    accesses the physical register. That's a completely new data path
    with an additional register port that will cost area, and possibly
    gate delays.

    When the load actually has its value, it will continue to the ROB, and
    only when it is retired, will all the 32 or 64 registers that it has
    reserved will lower their reference count (and of course, you have to
    add hardware for performing this many updates, whereas normal
    instructions only update a few register reference counts).

    And for stores, similar problems show up that also have to be dealt with.

    Hardware has been optimized for fast access to registers encoded
    directly in instructions, and for (in the usual case) fast access to
    memory where the address is determined at run-time.

    There are reasons why architectural features that allowed accessing
    registers with memory instructions have vanished, even before OoO
    execution became popular, and the developments since then, in
    particular OoO execution, have added more reasons.

    Yes, the virtual addressing mechanism
    would have to be sensitive to whether the address was below 32 or not,
    but that is simple within the CPU. Note that the load instruction in
    this case would not touch the memory system at all, so no cache lookups,
    no TLB lookups, etc.

    The alternative is unlikely to be any faster.

    Remember the subject: You suggested this mechanism as a way to
    eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix
    multiplication, and the disadvantage was that VVM cannot not eliminate
    some memory accesses that AVX-512 can. Turning the registers into
    memory does not solve that, and probably incurs additional costs.
    This cure is worse than the disease.

    The register accesses are not turned into memory accesses.

    Then why map the registers into memory and let load/store instructions
    (or uops) access them?

    Having separate read-indexed-reg and write-indexed-reg instructions
    would avoid burdening the load/store unit with this idea, but would
    still suffer from many of the problems mentioned above. There are
    reasons why no modern architecture has such instructions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue May 12 09:48:07 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be used in rare circumstances.

    The PDP-10 comes tom mind... with its "registers" which were just
    aliases for the first memory locations, which were usually
    implemented to be faster than main memory. You could even run
    code there. But that fell out of fashion for a reason.

    I believe that this would be particularly difficult with OoO.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue May 12 12:59:44 2026
    From Newsgroup: comp.arch

    On 12/05/2026 11:48, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    The PDP-10 comes tom mind... with its "registers" which were just
    aliases for the first memory locations, which were usually
    implemented to be faster than main memory. You could even run
    code there. But that fell out of fashion for a reason.


    It is not uncommon with small microcontroller cores to have registers
    mapped to memory. You have it on the AVR, for example, where the 32
    8-bit registers all have a memory address. For some of the "tiny"
    devices, there is no other ram on the chip - memory-mapped registers
    means you can have an array that is within the registers.

    It is also not uncommon for small 8-bit devices to have very few
    registers, but be able to treat a specific part of the memory as pseudo-registers. The 6502 had zero-page memory that could be accessed
    faster than general memory, and could be used for indirect addressing.
    The COP8 had 16 bytes of memory that could be treated as registers for
    some instructions (indeed, some of the real registers were mapped to
    that area).

    I believe that this would be particularly difficult with OoO.

    I've only seen it on small devices that certainly did not have any kind
    of OoO. (The 6502 had a bit of pipelining and overlap between pipeline stages.)


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue May 12 14:29:40 2026
    From Newsgroup: comp.arch

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    Of course not. Memory Mapping a register doesnt imply any relationship
    to DRAM. The registers are still kept in flops, but are mapped into
    the address space directly.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue May 12 14:33:37 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 10:14 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.

    It would have to be implemented. How?

    Let me give one possible implementation. There are certainly others.
    Say you have 32 registers. They are "memory mapped" into the first 32 >addresses of memory. So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not thought >through all of the details.) So now when the CPU encounters a load (or >store) instruction where the virtual address is less than 32, it is >resolved not by the memory system, but by the appropriate register.
    i.e. if the virtual address was say 4, the load would be from register
    R4, not memory location 4. Yes, the virtual addressing mechanism
    would have to be sensitive to whether the address was below 32 or not,
    but that is simple within the CPU. Note that the load instruction in
    this case would not touch the memory system at all, so no cache lookups,
    no TLB lookups, etc.

    It's really pretty straightforward - consider a SATA controller,
    for instance - it has a bunch of registers, generally implemented
    with flops that are accessable by reading and writing specific addresses
    in the system address map.

    There's no reason that a certain range of physical (or even virtual)
    addresses cannot refer to a set of flopped registers rather than the
    underlying memory architecture; such addresses being known to the
    processor as special and intercepted before reaching the
    processor interconnect fabric. Either a fixed address, or derived
    from a programmable base register.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 12 18:06:10 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 85: '\xE2'" while decoding:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence D|o-C-OOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    Suffering at least LD pipeline delay or more, and consuming the LDed
    register.

    Doesn|o-C-Ot this defeat the point of how registers are supposed to work?

    Only a little.

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >used in rare circumstances.

    It would have to be implemented. How? And how does the supposed
    rareness help?

    Remember the subject: You suggested this mechanism as a way to
    eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix

    vVM

    multiplication, and the disadvantage was that VVM cannot not eliminate
    some memory accesses that AVX-512 can. Turning the registers into
    memory does not solve that, and probably incurs additional costs.
    This cure is worse than the disease.

    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 12 18:12:51 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/11/2026 10:14 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >> used in rare circumstances.

    It would have to be implemented. How?

    Let me give one possible implementation. There are certainly others.
    Say you have 32 registers. They are "memory mapped" into the first 32 addresses of memory.

    Very PDP-6 of you.

    But you really do not want code and data (registers) in the same page,
    so just start at page[1].

    But ?!? what do you do with NULL pointers ??? now you actually can
    dereference them.

    No, I think this brings more problems than solutions.

    So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not thought through all of the details.) So now when the CPU encounters a load (or store) instruction where the virtual address is less than 32, it is resolved not by the memory system, but by the appropriate register.
    i.e. if the virtual address was say 4, the load would be from register
    R4, not memory location 4.

    Do you really want to have a LD access span registers ?!?

    Yes, the virtual addressing mechanism
    would have to be sensitive to whether the address was below 32 or not,

    That is actually easy enough to do VAS[HOB..5] == 0

    but that is simple within the CPU. Note that the load instruction in
    this case would not touch the memory system at all, so no cache lookups,
    no TLB lookups, etc.


    And how does the supposed
    rareness help?

    Laurence said it would defeat the purpose of registers. My comment was
    that since it would be rare, i.e. most of the register references would
    be the same as before, it wouldn't defeat the purpose.

    Remember the subject: You suggested this mechanism as a way to
    eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix multiplication, and the disadvantage was that VVM cannot not eliminate
    some memory accesses that AVX-512 can. Turning the registers into
    memory does not solve that, and probably incurs additional costs.
    This cure is worse than the disease.

    The register accesses are not turned into memory accesses. If the
    address is less than 32, the instruction references the actual register,

    LDD Rd,[31] spans R3 and R4

    not the memory. The only advantage of this scheme is that it allows "indexing" the registers similarly to how one indexes memory today.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 12 18:16:44 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 185: '\xE2'" while decoding:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 10:14 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence D|o-C-OOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >>>>> "index" the registers, similarly to indexing a memory array.

    Doesn|o-C-Ot this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But >>> you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.

    It would have to be implemented. How?

    Let me give one possible implementation. There are certainly others.
    Say you have 32 registers. They are "memory mapped" into the first 32 >addresses of memory. So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not thought >through all of the details.) So now when the CPU encounters a load (or >store) instruction where the virtual address is less than 32, it is >resolved not by the memory system, but by the appropriate register.
    i.e. if the virtual address was say 4, the load would be from register
    R4, not memory location 4.

    How is that done?

    Exotic interlocking if you want vonNeuman order, or get what you get
    if you don't. For the application at hand you don't need vN order.

    You have a load instruction in the scheduler. How
    does it know when the "memory" location it is waiting for is ready?

    The scheduler is only waiting for operand registers to be ready.

    The register renamer knows nothing about the address that the
    instruction accesses, because the address will usually come in at
    run-time (otherwise one would use a direct register access). So
    should every load and store wait for all physical registers to be
    ready that corresponded to logical registers when the instruction came through the register renamer? That would slow down all loads and
    stores. If not, how else should it be done?

    Only use it when you don't care about that stuff.

    Once we know all the registers are ready, the instruction is admitted
    to the load/store unit. Now you have to add an extra data path that
    accesses the physical register. That's a completely new data path
    with an additional register port that will cost area, and possibly
    gate delays.

    When the load actually has its value, it will continue to the ROB, and
    only when it is retired, will all the 32 or 64 registers that it has
    reserved will lower their reference count (and of course, you have to
    add hardware for performing this many updates, whereas normal
    instructions only update a few register reference counts).

    And for stores, similar problems show up that also have to be dealt with.

    Hardware has been optimized for fast access to registers encoded
    directly in instructions, and for (in the usual case) fast access to
    memory where the address is determined at run-time.

    There are reasons why architectural features that allowed accessing
    registers with memory instructions have vanished, even before OoO
    execution became popular, and the developments since then, in
    particular OoO execution, have added more reasons.

    Yes, the virtual addressing mechanism
    would have to be sensitive to whether the address was below 32 or not,
    but that is simple within the CPU. Note that the load instruction in
    this case would not touch the memory system at all, so no cache lookups, >no TLB lookups, etc.

    The alternative is unlikely to be any faster.

    Remember the subject: You suggested this mechanism as a way to
    eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix
    multiplication, and the disadvantage was that VVM cannot not eliminate
    some memory accesses that AVX-512 can. Turning the registers into
    memory does not solve that, and probably incurs additional costs.
    This cure is worse than the disease.

    The register accesses are not turned into memory accesses.

    Then why map the registers into memory and let load/store instructions
    (or uops) access them?

    Having separate read-indexed-reg and write-indexed-reg instructions
    would avoid burdening the load/store unit with this idea, but would
    still suffer from many of the problems mentioned above. There are
    reasons why no modern architecture has such instructions.

    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 12 18:23:29 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    ERROR "unexpected byte sequence starting at index 185: '\xE2'" while decoding:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 10:14 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence D|o-C-OOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >>>>> "index" the registers, similarly to indexing a memory array.

    Doesn|o-C-Ot this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But >>> you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.

    It would have to be implemented. How?

    Let me give one possible implementation. There are certainly others.
    Say you have 32 registers. They are "memory mapped" into the first 32 >addresses of memory. So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not thought >through all of the details.) So now when the CPU encounters a load (or >store) instruction where the virtual address is less than 32, it is >resolved not by the memory system, but by the appropriate register.
    i.e. if the virtual address was say 4, the load would be from register
    R4, not memory location 4. Yes, the virtual addressing mechanism
    would have to be sensitive to whether the address was below 32 or not,
    but that is simple within the CPU. Note that the load instruction in
    this case would not touch the memory system at all, so no cache lookups, >no TLB lookups, etc.

    It's really pretty straightforward - consider a SATA controller,
    for instance - it has a bunch of registers, generally implemented
    with flops that are accessable by reading and writing specific addresses
    in the system address map.

    There's no reason that a certain range of physical (or even virtual) addresses cannot refer to a set of flopped registers rather than the underlying memory architecture; such addresses being known to the
    processor as special and intercepted before reaching the
    processor interconnect fabric. Either a fixed address, or derived
    from a programmable base register.

    In My 66000 architecture, each thread has its own register file, and
    when the thread is "running in a core" there is a control register
    pointing at the 1/8 page where the file is located. When control is
    transferred to the thread, HW loads the file, when control leaves
    the thread HW stores (the modified parts) of the file.

    I did not include indexing the register file because it would "take
    a long time" possibly violating security. I did, however, allow
    supervisory (i.e. higher privilege) SW to read/write those registers
    <more or less> as supervisory SW desires. At Super layers, time does
    not mater so much.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue May 12 17:08:23 2026
    From Newsgroup: comp.arch

    On 5/12/2026 11:12 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/11/2026 10:14 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >>>>>> "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work? >>>>
    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But >>>> you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >>>> used in rare circumstances.

    It would have to be implemented. How?

    Let me give one possible implementation. There are certainly others.
    Say you have 32 registers. They are "memory mapped" into the first 32
    addresses of memory.

    Very PDP-6 of you.

    :-) I did say that others had implemented it, and other posters gave
    several examples

    But you really do not want code and data (registers) in the same page,
    so just start at page[1].

    Good Point.


    But ?!? what do you do with NULL pointers ??? now you actually can dereference them.

    No, I think this brings more problems than solutions.

    If you change to use specific op codes instead of load/store semantics,
    this issue goes away.


    So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not thought
    through all of the details.) So now when the CPU encounters a load (or
    store) instruction where the virtual address is less than 32, it is
    resolved not by the memory system, but by the appropriate register.
    i.e. if the virtual address was say 4, the load would be from register
    R4, not memory location 4.

    Do you really want to have a LD access span registers ?!?

    Of course not. I thought about the alternative (giving addresses
    instead of register numbers), but decided that having the hardware add
    three zeros to the register number was worth the tradeoff of having the instruction actually show register numbers. Of course, YMMV. I am not
    wedded to either solution. A further possibility is Anton's suggestion
    of using two additional op codes to signify "indexed" register
    moves/copies. I thought not using those op codes was worth something,
    but perhaps the confusion engendered by using load/store op codes is
    worse. Of course, the implementation within the CPU is essentially the
    same.



    Yes, the virtual addressing mechanism
    would have to be sensitive to whether the address was below 32 or not,

    That is actually easy enough to do VAS[HOB..5] == 0

    Yes, and if so, append three zeros to the value to get an address.


    but that is simple within the CPU. Note that the load instruction in
    this case would not touch the memory system at all, so no cache lookups,
    no TLB lookups, etc.


    And how does the supposed
    rareness help?

    Laurence said it would defeat the purpose of registers. My comment was
    that since it would be rare, i.e. most of the register references would
    be the same as before, it wouldn't defeat the purpose.

    Remember the subject: You suggested this mechanism as a way to
    eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix
    multiplication, and the disadvantage was that VVM cannot not eliminate
    some memory accesses that AVX-512 can. Turning the registers into
    memory does not solve that, and probably incurs additional costs.
    This cure is worse than the disease.

    The register accesses are not turned into memory accesses. If the
    address is less than 32, the instruction references the actual register,

    LDD Rd,[31] spans R3 and R4

    No, see above.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed May 13 02:57:32 2026
    From Newsgroup: comp.arch

    On Tue, 12 May 2026 18:12:51 GMT, MitchAlsup wrote:

    Very PDP-6 of you.

    But you really do not want code and data (registers) in the same
    page, so just start at page[1].

    But remember, in those days, impure code was considered a feature, not
    a bug. ;)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Bernd Linsel@bl1-thispartdoesnotbelonghere@gmx.com to comp.arch on Wed May 13 14:02:46 2026
    From Newsgroup: comp.arch

    On 5/12/26 08:11, Stephen Fuld wrote:
    Let me give one possible implementation.-a There are certainly others.
    Say you have 32 registers.-a They are "memory mapped" into the first 32 addresses of memory.-a So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not thought through all of the details.)-a So now when the CPU encounters a load (or store) instruction where the virtual address is less-a than 32, it is resolved not by the memory system, but by the appropriate register. i.e.
    if the virtual address was say 4, the load would be from register R4,
    not memory location 4.-a-a-a Yes, the virtual addressing mechanism would have to be sensitive to whether the address was below 32 or not, but
    that is simple within the CPU.-a Note that the load instruction in this
    case would not touch the memory system at all, so no cache lookups, no
    TLB lookups, etc.

    Why not look at the problem from another perspective?

    Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core,
    accessible in 1 or 2 clocks, and two transfer instructions

    ldqr Rd, <index>
    stqr Rd, <index>

    This should work our perfectly even in a tight vVM loop.
    --
    Bernd Linsel
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 13 14:12:02 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/12/2026 11:12 AM, MitchAlsup wrote:

    Let me give one possible implementation. There are certainly others.
    Say you have 32 registers. They are "memory mapped" into the first 32
    addresses of memory.

    Very PDP-6 of you.

    :-) I did say that others had implemented it, and other posters gave >several examples

    But you really do not want code and data (registers) in the same page,
    so just start at page[1].

    Good Point.

    While true, most modern operating systems mark code pages read-only;
    which would make it impossible to have writable data registers in
    the same page as code anyway.

    The B3500 (1964) had a number of reserved fields in the first 100 digits
    of memory (stack pointer, three index registers,
    indirect field lengths, SCAN instruction storage, an array of
    'insert' characters for the EDT (edit - e.g. COBOL PIC clause formatting) instruction, etc). Each program had 1 million digits of memory,
    but due to the max offset of a branch instruction, code was
    limited to the first 300,000 digits.

    When we introduced a new memory management scheme in the early 1980s,
    we introduced a segmented scheme, where segment zero was application
    data (including the reserved fields in the first 100 digits) and
    stack, while segment 1 was code (and read-only). The remaining
    6 segments in an 'environment' were reserved for application data.
    To maintain backward compatability, the segments were limited to
    1 megadigit in size. The processor supported up to 100,000
    environments per task - where memory area zero was shared across
    all environments. A non-local call instruction would switch
    environments. FWIW, the stack grew towards higher addresses.



    But ?!? what do you do with NULL pointers ??? now you actually can
    dereference them.

    On that system, a NULL pointer was encoded as the six digit
    value @EEEEEE@, which since it contained undigits, was otherwise
    an invalid address. There were instructions to search
    linked lists, and the NULL pointer value was honored by
    the hardware during searches (setting a condition flag
    when encountered by the instruction).
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed May 13 09:09:25 2026
    From Newsgroup: comp.arch

    On 5/12/2026 12:23 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 10:14 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >>>>>> "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work? >>>>
    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But >>>> you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >>>> used in rare circumstances.

    It would have to be implemented. How?

    Let me give one possible implementation. There are certainly others.
    Say you have 32 registers. They are "memory mapped" into the first 32
    addresses of memory. So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not thought
    through all of the details.) So now when the CPU encounters a load (or
    store) instruction where the virtual address is less than 32, it is
    resolved not by the memory system, but by the appropriate register.
    i.e. if the virtual address was say 4, the load would be from register
    R4, not memory location 4.

    How is that done? You have a load instruction in the scheduler. How
    does it know when the "memory" location it is waiting for is ready?
    The register renamer knows nothing about the address that the
    instruction accesses, because the address will usually come in at
    run-time (otherwise one would use a direct register access). So
    should every load and store wait for all physical registers to be
    ready that corresponded to logical registers when the instruction came through the register renamer? That would slow down all loads and
    stores. If not, how else should it be done?

    Once we know all the registers are ready, the instruction is admitted
    to the load/store unit. Now you have to add an extra data path that
    accesses the physical register. That's a completely new data path
    with an additional register port that will cost area, and possibly
    gate delays.

    When the load actually has its value, it will continue to the ROB, and
    only when it is retired, will all the 32 or 64 registers that it has
    reserved will lower their reference count (and of course, you have to
    add hardware for performing this many updates, whereas normal
    instructions only update a few register reference counts).

    And for stores, similar problems show up that also have to be dealt with.

    Hardware has been optimized for fast access to registers encoded
    directly in instructions, and for (in the usual case) fast access to
    memory where the address is determined at run-time.

    There are reasons why architectural features that allowed accessing
    registers with memory instructions have vanished, even before OoO
    execution became popular, and the developments since then, in
    particular OoO execution, have added more reasons.

    Yes, the virtual addressing mechanism
    would have to be sensitive to whether the address was below 32 or not,
    but that is simple within the CPU. Note that the load instruction in
    this case would not touch the memory system at all, so no cache lookups,
    no TLB lookups, etc.

    The alternative is unlikely to be any faster.

    Remember the subject: You suggested this mechanism as a way to
    eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix
    multiplication, and the disadvantage was that VVM cannot not eliminate
    some memory accesses that AVX-512 can. Turning the registers into
    memory does not solve that, and probably incurs additional costs.
    This cure is worse than the disease.

    The register accesses are not turned into memory accesses.

    Then why map the registers into memory and let load/store instructions
    (or uops) access them?

    I take a good portion of the blame for your over-emphasis on issues with
    the load unit within the CPU and issues with using it for the indexed
    register moves. I repeat, the instruction will almost certainly not be implemented internally in the load/store functional unit(s). If the
    virtual address is below the number of registers (easily determined internally), it will be handled differently. I chose that syntax
    because it seemed a "natural" fit with the "indexed" nature of the move,
    and because it saved having two new opcodes.


    Having separate read-indexed-reg and write-indexed-reg instructions
    would avoid burdening the load/store unit with this idea,

    Yes. At least it makes clear that the load store unit wouldn't deal
    with this.


    but would
    still suffer from many of the problems mentioned above. There are
    reasons why no modern architecture has such instructions.

    While I agree that no modern architecture has such instructions, neither
    does any have VVM. As you point out, I suggested this as a way provide
    the functionality that Mitch suggested as a way to overcome the issue
    with VVM that Thomas pointed out in the post that started this thread.
    The question is whether the cost of such an implementation is justified
    to improve VVM. To that question, I leave the answer to others.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed May 13 09:14:08 2026
    From Newsgroup: comp.arch

    On 5/13/2026 5:02 AM, Bernd Linsel wrote:
    On 5/12/26 08:11, Stephen Fuld wrote:
    Let me give one possible implementation.-a There are certainly others.
    Say you have 32 registers.-a They are "memory mapped" into the first 32
    addresses of memory.-a So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not
    thought through all of the details.)-a So now when the CPU encounters a
    load (or store) instruction where the virtual address is less-a than
    32, it is resolved not by the memory system, but by the appropriate
    register. i.e. if the virtual address was say 4, the load would be
    from register R4, not memory location 4.-a-a-a Yes, the virtual
    addressing mechanism would have to be sensitive to whether the address
    was below 32 or not, but that is simple within the CPU.-a Note that the
    load instruction in this case would not touch the memory system at
    all, so no cache lookups, no TLB lookups, etc.

    Why not look at the problem from another perspective?

    Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, accessible in 1 or 2 clocks, and two transfer instructions

    ldqr Rd, <index>
    stqr Rd, <index>

    This should work our perfectly even in a tight vVM loop.

    Absolutely agree about the new instructions. I was just trying to save
    the two additional op codes. I am not so sure about the extra ram, as
    you would have to maintain consistency with the "real" internal registers.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 13 17:32:39 2026
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Tue, 12 May 2026 18:12:51 GMT, MitchAlsup wrote:

    Very PDP-6 of you.

    But you really do not want code and data (registers) in the same
    page, so just start at page[1].

    But remember, in those days, impure code was considered a feature, not
    a bug. ;)

    And we barely survived...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 13 17:36:09 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/12/2026 11:12 AM, MitchAlsup wrote:

    Let me give one possible implementation. There are certainly others.
    Say you have 32 registers. They are "memory mapped" into the first 32 >>> addresses of memory.

    Very PDP-6 of you.

    :-) I did say that others had implemented it, and other posters gave >several examples

    But you really do not want code and data (registers) in the same page,
    so just start at page[1].

    Good Point.

    While true, most modern operating systems mark code pages read-only;
    which would make it impossible to have writable data registers in
    the same page as code anyway.

    The B3500 (1964) had a number of reserved fields in the first 100 digits
    of memory (stack pointer, three index registers,
    indirect field lengths, SCAN instruction storage, an array of
    'insert' characters for the EDT (edit - e.g. COBOL PIC clause formatting) instruction, etc). Each program had 1 million digits of memory,
    but due to the max offset of a branch instruction, code was
    limited to the first 300,000 digits.

    When we introduced a new memory management scheme in the early 1980s,
    we introduced a segmented scheme, where segment zero was application
    data (including the reserved fields in the first 100 digits) and
    stack, while segment 1 was code (and read-only). The remaining
    6 segments in an 'environment' were reserved for application data.
    To maintain backward compatability, the segments were limited to
    1 megadigit in size. The processor supported up to 100,000
    environments per task - where memory area zero was shared across
    all environments. A non-local call instruction would switch
    environments. FWIW, the stack grew towards higher addresses.



    But ?!? what do you do with NULL pointers ??? now you actually can
    dereference them.

    On that system, a NULL pointer was encoded as the six digit
    value @EEEEEE@,

    Integers (values in GP registers) no longer have that property.

    which since it contained undigits, was otherwise
    an invalid address. There were instructions to search
    linked lists, and the NULL pointer value was honored by
    the hardware during searches (setting a condition flag
    when encountered by the instruction).
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 13 17:40:05 2026
    From Newsgroup: comp.arch


    Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> posted:

    On 5/12/26 08:11, Stephen Fuld wrote:
    Let me give one possible implementation.-a There are certainly others.
    Say you have 32 registers.-a They are "memory mapped" into the first 32 addresses of memory.-a So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not thought through all of the details.)-a So now when the CPU encounters a load (or store) instruction where the virtual address is less-a than 32, it is resolved not by the memory system, but by the appropriate register. i.e. if the virtual address was say 4, the load would be from register R4,
    not memory location 4.-a-a-a Yes, the virtual addressing mechanism would have to be sensitive to whether the address was below 32 or not, but
    that is simple within the CPU.-a Note that the load instruction in this case would not touch the memory system at all, so no cache lookups, no
    TLB lookups, etc.

    Why not look at the problem from another perspective?

    Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, accessible in 1 or 2 clocks, and two transfer instructions

    ldqr Rd, <index>
    stqr Rd, <index>

    This should work our perfectly even in a tight vVM loop.

    For the MatMul case at hand, only Rd to those registers (32 count)
    is required--but when the number of them grows to page-size, that
    is when you also need Write.

    --
    Bernd Linsel

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed May 13 17:55:53 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    I take a good portion of the blame for your over-emphasis on issues with
    the load unit within the CPU and issues with using it for the indexed >register moves. I repeat, the instruction will almost certainly not be >implemented internally in the load/store functional unit(s).

    If you encode it as load/store instruction, with the address (that the
    decoder knows nothing about) deciding whether it's a register or an
    address, then it will be treated as load or store and scheduled for
    the load/store unit.

    but would
    still suffer from many of the problems mentioned above. There are
    reasons why no modern architecture has such instructions.

    While I agree that no modern architecture has such instructions, neither >does any have VVM.

    Sure, and maybe none ever will. I am somewhat sceptical of
    auto-vectorization in all forms, but at least vVM does not go totally
    against the grain of modern high-performance CPU design. Indexed
    access to registers does.

    The question is whether the cost of such an implementation is justified
    to improve VVM. To that question, I leave the answer to others.

    My take is that it isn't.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Bernd Linsel@bl1-thispartdoesnotbelonghere@gmx.com to comp.arch on Wed May 13 21:35:02 2026
    From Newsgroup: comp.arch

    On 5/13/26 14:02, Bernd Linsel wrote:
    Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, accessible in 1 or 2 clocks, and two transfer instructions

    ldqr Rd, <index>
    stqr Rd, <index>

    This should work our perfectly even in a tight vVM loop.


    Should of course read

    ldqr Rd, Rs // Rs indexes into ultra-fast on-chip SRAM
    stqr Rs1, Rs2 // Rs2 indexes into ultra-fast on-chip SRAM

    I think "direct addressing" with an immediate index instead of via an
    index in a register is not needed.
    --
    Bernd Linsel
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 13 20:46:05 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    I take a good portion of the blame for your over-emphasis on issues with >the load unit within the CPU and issues with using it for the indexed >register moves. I repeat, the instruction will almost certainly not be >implemented internally in the load/store functional unit(s).

    If you encode it as load/store instruction, with the address (that the decoder knows nothing about) deciding whether it's a register or an
    address, then it will be treated as load or store and scheduled for
    the load/store unit.

    but would
    still suffer from many of the problems mentioned above. There are
    reasons why no modern architecture has such instructions.

    While I agree that no modern architecture has such instructions, neither >does any have VVM.

    Sure, and maybe none ever will.

    Several GPUs have such functionality. I doubt anyone could call a
    GPU non-modern.

    I am somewhat sceptical of auto-vectorization in all forms, but at least vVM does not go totally
    against the grain of modern high-performance CPU design. Indexed
    access to registers does.

    The question is whether the cost of such an implementation is justified
    to improve VVM. To that question, I leave the answer to others.

    My take is that it isn't.

    With 32 registers it is definitely not needed.
    But, I suspect, that at some register count it does become needed;
    possibly around 128+.

    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 13 20:52:39 2026
    From Newsgroup: comp.arch


    Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> posted:

    On 5/13/26 14:02, Bernd Linsel wrote:
    Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, accessible in 1 or 2 clocks, and two transfer instructions

    ldqr Rd, <index>
    stqr Rd, <index>

    This should work our perfectly even in a tight vVM loop.


    Should of course read

    ldqr Rd, Rs // Rs indexes into ultra-fast on-chip SRAM
    stqr Rs1, Rs2 // Rs2 indexes into ultra-fast on-chip SRAM

    I think "direct addressing" with an immediate index instead of via an
    index in a register is not needed.

    How do you access a different register each loop iteration ???
    if you don't have indexing ???

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu May 14 09:39:34 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Tue, 12 May 2026 18:12:51 GMT, MitchAlsup wrote:

    Very PDP-6 of you.

    But you really do not want code and data (registers) in the same
    page, so just start at page[1].

    But remember, in those days, impure code was considered a feature, not
    a bug. ;)

    And we barely survived...

    We all remember Mel, right?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu May 14 08:05:55 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    You have a load instruction in the scheduler. How
    does it know when the "memory" location it is waiting for is ready?

    The scheduler is only waiting for operand registers to be ready.

    Let's assume that you have


    A: load r1<-mem_location
    B: load r5<-mem_location
    C: load r2<-(r1)
    D: add r5<-r5+1


    and the architectural value loaded into r1 is the address of r5. What
    "operand registers" should the instruction C wait for?

    1) Should it wait for r1? If so, how does know when it can access r5?
    How does it know which physical register r5 lives in?

    One might forward C to another scheduler that waits for r5. But how
    does that scheduler know in which physical register r5 resides? And
    how does the register renamer know that D is not the only user of
    B:r5? Ok, in a simple OoO implementation D retires after C, and B:r5
    is not freed until then, but I can imagine optimizations that
    invalidate this principle; such optimizations would have to be
    disabled when adding such a feature.

    2) Should it wait for r1 and r5? How does it learn that it should
    wait for r5?

    3) Should it wait for r1, and when that is known and points to the
    address of r5, enter a scheduler that waits for r5? Again, how does
    it learn what the physical register for r5 is at C? And how is it
    known that D is not the only user?

    If you add storing to registers in indexed ways, additional issues
    come into play:

    A: load r1<-mem_location
    B: load r5<-mem_location
    C: store r2->(r1)
    D: add r5<-r5+1

    Now the renamer has to deal with the possibility that C may or may not
    store to the architectural register that B and D access.

    One solution for both the load and the store problems would be to
    predict the accessed register (if any) and do a misprediction recovery
    if the prediction turns out to be wrong. Recovering from
    misprediction tends to be slow (I use 20 cycles for estimating the
    cost of a misprediction (including followon cost), and that tends to
    be a pretty good estimate in many cases). One would need a good (area-expensive) predictor to deal with Matmul in VVM, and to compete
    with the AVX-512 code, IIRC the hardware would need the ability to
    deal with 8 or so such accesses per cycle.

    The register renamer knows nothing about the address that the
    instruction accesses, because the address will usually come in at
    run-time (otherwise one would use a direct register access). So
    should every load and store wait for all physical registers to be
    ready that corresponded to logical registers when the instruction came
    through the register renamer? That would slow down all loads and
    stores. If not, how else should it be done?

    Only use it when you don't care about that stuff.

    Don't care about what stuff? If you only use it when you don't care
    what it does, why use it at all, and why implement it at all?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 14 18:20:58 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    You have a load instruction in the scheduler. How
    does it know when the "memory" location it is waiting for is ready?

    The scheduler is only waiting for operand registers to be ready.

    Let's assume that you have


    A: load r1<-mem_location
    B: load r5<-mem_location
    C: load r2<-(r1)
    D: add r5<-r5+1


    and the architectural value loaded into r1 is the address of r5. What "operand registers" should the instruction C wait for?

    The value loaded in A

    1) Should it wait for r1? If so, how does know when it can access r5?
    How does it know which physical register r5 lives in?

    Reasons I do not like this thread accessing this threads registers
    using memory.
    ------------------------

    2) Should it wait for r1 and r5? How does it learn that it should
    wait for r5?

    C waits only for R1
    D waits only for R5

    3) Should it wait for r1, and when that is known and points to the
    address of r5, enter a scheduler that waits for r5? Again, how does
    it learn what the physical register for r5 is at C? And how is it
    known that D is not the only user?

    In the case where this thread accesses that threads' registers, once
    the access arrives, the access then waits for the instruction stream
    to quiess and then takes what is in R5.

    In the cases where this thread accesses this threads' registers you have
    no clean abstraction of which R5 you are accessing--and no way back to
    the register map when R1 was loaded. Thus, you might have to take a
    branch misprediction repair back to A after discovering that R1 points
    at R5--killing any performance advantage.

    If you add storing to registers in indexed ways, additional issues
    come into play:

    A: load r1<-mem_location
    B: load r5<-mem_location
    C: store r2->(r1)
    D: add r5<-r5+1

    Now the renamer has to deal with the possibility that C may or may not
    store to the architectural register that B and D access.

    Yep, backup is likely the only clean way through this.

    One solution for both the load and the store problems would be to
    predict the accessed register (if any) and do a misprediction recovery
    if the prediction turns out to be wrong.

    Compounding the complexity of getting it right.
    -----------------------

    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Bernd Linsel@bl1-thispartdoesnotbelonghere@gmx.com to comp.arch on Fri May 15 00:03:47 2026
    From Newsgroup: comp.arch

    On 5/13/26 22:52, MitchAlsup wrote:

    Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> posted:

    On 5/13/26 14:02, Bernd Linsel wrote:
    Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core,
    accessible in 1 or 2 clocks, and two transfer instructions

    ldqr Rd, <index>
    stqr Rd, <index>

    This should work our perfectly even in a tight vVM loop.


    Should of course read

    ldqr Rd, Rs // Rs indexes into ultra-fast on-chip SRAM
    stqr Rs1, Rs2 // Rs2 indexes into ultra-fast on-chip SRAM

    I think "direct addressing" with an immediate index instead of via an
    index in a register is not needed.

    How do you access a different register each loop iteration ???
    if you don't have indexing ???


    It's meant as:

    ld Rd, qregs[Rd] and
    st Rs1, qregs[Rs2],

    i.e. the second register as index into the "quick regs" local SRAM bank,
    Only aligned full word access possible should be sufficient, so that
    these are really indices, not addresses.
    --
    Bernd Linsel
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu May 14 16:35:31 2026
    From Newsgroup: comp.arch

    On 5/14/2026 3:03 PM, Bernd Linsel wrote:
    On 5/13/26 22:52, MitchAlsup wrote:

    Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> posted:

    On 5/13/26 14:02, Bernd Linsel wrote:
    Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core,
    accessible in 1 or 2 clocks, and two transfer instructions

    ldqr Rd, <index>
    stqr Rd, <index>

    This should work our perfectly even in a tight vVM loop.


    Should of course read

    ldqr Rd, Rs-a-a-a-a-a // Rs indexes into ultra-fast on-chip SRAM
    stqr Rs1, Rs2-a-a-a // Rs2 indexes into ultra-fast on-chip SRAM

    I think "direct addressing" with an immediate index instead of via an
    index in a register is not needed.

    How do you access a different register each loop iteration ???
    if you don't have indexing ???


    It's meant as:

    ld Rd, qregs[Rd] and
    st Rs1, qregs[Rs2],

    i.e. the second register as index into the "quick regs" local SRAM bank,
    Only aligned full word access possible should be sufficient, so that
    these are really indices, not addresses.

    I must be missing something. Doesn't this quick regs memory have to be
    saved and restored on each context switch? If so, that is very expensive.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 15 00:17:23 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/14/2026 3:03 PM, Bernd Linsel wrote:
    On 5/13/26 22:52, MitchAlsup wrote:

    Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> posted:

    On 5/13/26 14:02, Bernd Linsel wrote:
    Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core,
    accessible in 1 or 2 clocks, and two transfer instructions

    ldqr Rd, <index>
    stqr Rd, <index>

    This should work our perfectly even in a tight vVM loop.


    Should of course read

    ldqr Rd, Rs-a-a-a-a-a // Rs indexes into ultra-fast on-chip SRAM
    stqr Rs1, Rs2-a-a-a // Rs2 indexes into ultra-fast on-chip SRAM

    I think "direct addressing" with an immediate index instead of via an
    index in a register is not needed.

    How do you access a different register each loop iteration ???
    if you don't have indexing ???


    It's meant as:

    ld Rd, qregs[Rd] and
    st Rs1, qregs[Rs2],

    OK, that solves the indexing issue.

    i.e. the second register as index into the "quick regs" local SRAM bank, Only aligned full word access possible should be sufficient, so that
    these are really indices, not addresses.

    I must be missing something. Doesn't this quick regs memory have to be saved and restored on each context switch? If so, that is very expensive.

    qregs[] is (IS) the actual register file (or files)--so, no added state.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu May 14 12:19:43 2026
    From Newsgroup: comp.arch

    Stephen Fuld [2026-05-11 23:11:07] wrote:
    Let me give one possible implementation. There are certainly others. Say
    you have 32 registers. They are "memory mapped" into the first 32 addresses of memory. So programs would have to start not at zero, but at 32 (I know this can cause other problems - I clearly have not thought through all of
    the details.) So now when the CPU encounters a load (or store) instruction where the virtual address is less than 32, it is resolved not by the memory system, but by the appropriate register. i.e. if the virtual address was say 4, the load would be from register R4, not memory location 4. Yes, the virtual addressing mechanism would have to be sensitive to whether the address was below 32 or not, but that is simple within the CPU. Note that the load instruction in this case would not touch the memory system at all, so no cache lookups, no TLB lookups, etc.

    That solves the problem of encoding an indirect register access as
    a LD/ST instruction, but I highly doubt that's the main problem
    introduced by indirect register access.

    It'd actually be easier to just add a new instruction for indirect
    register access (no need to burden the load/store unit, no need to worry
    about access size and alignment, memory remapping, and whatnot).

    The implementation problem, AFAIK comes in with OoO: by the time your instruction (whether a load or a dedicated instruction) gets to know
    which register it needs to read, we're in the middle of the OoO engine,
    and the first thing it needs to do is to figure out which physical
    register corresponds to this logical register (and it needs to find out
    also if that physical register's value has already been delivered).
    The needed information is definitely out there somewhere in the CPU,
    but I'm not sure it can be made available cheaply at that time&place.

    Some GPUs offer such indirect register addressing, but they have a very different microarchitecture, much more like a barrel processor, with no
    OoO execution in sight.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri May 15 06:29:37 2026
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    It'd actually be easier to just add a new instruction for indirect
    register access (no need to burden the load/store unit, no need to worry >about access size and alignment, memory remapping, and whatnot).

    Yes.

    The implementation problem, AFAIK comes in with OoO: by the time your >instruction (whether a load or a dedicated instruction) gets to know
    which register it needs to read, we're in the middle of the OoO engine,
    and the first thing it needs to do is to figure out which physical
    register corresponds to this logical register (and it needs to find out
    also if that physical register's value has already been delivered).
    The needed information is definitely out there somewhere in the CPU,
    but I'm not sure it can be made available cheaply at that time&place.

    Moreover, even if that problem is solved, what is the supposed benefit
    compared to accessing the data in the D-cache? I guess that Stephen
    Fuld expects a shorter latency for read accesses, but I doubt that one
    would get this from such instructions. The reason why D-cache
    accesses take 3-4 cycles is mainly the size of the D-cache at 4KB per
    way (on recent AMD64 processors) or 16KB per way (on Apple Silicon),
    plus way selection from 8-12 ways. The size of the register files in
    bytes is a little smaller in total, but larger than a single way
    (e.g., on Strix Point (Zen 5 narrow), ~20KB for the ~234 zmm and ~150 additional ymm registers, and the area is comparable (each register
    bit takes more area than a cache bit, because register files have more
    ports):

    https://www.guru3d.com/data/publish/223/54520bdd20560bcbc963979637025fd69682f6/afnviogfwsce6yxo.webp

    For the register-indexing instruction, no way selection is necessary,
    but first the architectural-to-physical-register translation has to be
    done, and unlike the TLB access for memory accesses, this cannot be
    done in parallel to the actual access.

    Plus, for stores the load/store units have the store buffer, which can
    also forward data to loads (if this optimization is not applicable,
    the latency of store-to-load forwarding is around 20 cycles in recent
    uarchs, see <https://www.complang.tuwien.ac.at/anton/stwlf/>). What
    kind of forwarding will there be that involves indexed register
    accesses. Current CPUs have lots of forwarding paths, but they all
    work with register numbers coming out of the register renamer.

    So overall, I don't expect the read latency of such a
    register-indexing instruction to be smaller. Why is this not an issue
    in non-indexed register accesses? Because the actual register numbers
    are known early (after register renaming), so the register reads can
    start early, and forwarding paths (also working with the register
    numbers that are known early) eliminate the result-out to operand-in
    latency.

    Some GPUs offer such indirect register addressing, but they have a very >different microarchitecture, much more like a barrel processor, with no
    OoO execution in sight.

    I know little about GPU architectures, but they work on problems with
    lots of data parallelism, so minimizing latency is probably a lot less important than for CPUs. AFAIK they use in-order execution and no
    register renaming, which makes Stephen Fuld's idea a lot more
    workable.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri May 15 09:19:18 2026
    From Newsgroup: comp.arch

    On 2026-May-14 12:19, Stefan Monnier wrote:
    Stephen Fuld [2026-05-11 23:11:07] wrote:
    Let me give one possible implementation. There are certainly others. Say
    you have 32 registers. They are "memory mapped" into the first 32 addresses >> of memory. So programs would have to start not at zero, but at 32 (I know >> this can cause other problems - I clearly have not thought through all of
    the details.) So now when the CPU encounters a load (or store) instruction >> where the virtual address is less than 32, it is resolved not by the memory >> system, but by the appropriate register. i.e. if the virtual address was say >> 4, the load would be from register R4, not memory location 4. Yes, the
    virtual addressing mechanism would have to be sensitive to whether the
    address was below 32 or not, but that is simple within the CPU. Note that >> the load instruction in this case would not touch the memory system at all, >> so no cache lookups, no TLB lookups, etc.

    That solves the problem of encoding an indirect register access as
    a LD/ST instruction, but I highly doubt that's the main problem
    introduced by indirect register access.

    It'd actually be easier to just add a new instruction for indirect
    register access (no need to burden the load/store unit, no need to worry about access size and alignment, memory remapping, and whatnot).

    The implementation problem, AFAIK comes in with OoO: by the time your instruction (whether a load or a dedicated instruction) gets to know
    which register it needs to read, we're in the middle of the OoO engine,
    and the first thing it needs to do is to figure out which physical
    register corresponds to this logical register (and it needs to find out
    also if that physical register's value has already been delivered).
    The needed information is definitely out there somewhere in the CPU,
    but I'm not sure it can be made available cheaply at that time&place.

    Some GPUs offer such indirect register addressing, but they have a very different microarchitecture, much more like a barrel processor, with no
    OoO execution in sight.


    === Stefan

    Yes, you don't want the front end to be dependent on the back end
    or it would have to stall at decode entry time it saw one of
    these dependencies.

    IIUC it needs to do something akin to macro substitution of the
    instruction source or destination register numbers, but in Decode.
    Something like having a small set of register number registers
    with copies managed by Decode,
    that can be set or added to with small constant values,
    and copied into the instruction's register fields in Decode.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri May 15 16:59:02 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    There could be several variants:

    a) Fixed offets allowed only. Matmul loops would have to be
    unrolled then.

    b) Indexed relative to the pointer, possibly with offset.
    This would need value predicition for the index.
    For something repetetive such as matrix multipliation,
    this should be doable.

    This would basically extend the ISA with extra "registers".

    What about conflicts? Any load or store to that memory region is
    likely to cause problems, so it might be better to protect this
    memory so that even another thread cannot access it. I do not
    think that any current architecture supports this.

    Would this help? It would allow indexing; otherwise it would
    just be another class of almost-registers. The base ISA would
    not have to be changed, but probably the ABI (which is almost
    as bad).

    Is it a good idea? Yes. No. Maybe. I don't know.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri May 15 22:30:26 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size. A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat May 16 10:22:34 2026
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.

    That could compete with cache, and still cause memory traffic.
    I am not sure how this would compare with just loading the
    values into the cache on the first iteration.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 16 18:09:17 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >>>>> "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work? >>>
    No. In the vast majority of cases, you reference registers as you do >>> now, with register numbers in assigned places in the instruction. But >>> you do have an "alternate" way of referencing them that allows you to >>> use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    That would prevent thread[k] from allowing thread[j] access to its
    thread local store via shared pointer.

    A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.

    That could compete with cache, and still cause memory traffic.
    I am not sure how this would compare with just loading the
    values into the cache on the first iteration.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat May 16 18:11:56 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work? >> >>>
    No. In the vast majority of cases, you reference registers as you do >> >>> now, with register numbers in assigned places in the instruction. But >> >>> you do have an "alternate" way of referencing them that allows you to >> >>> use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    That would prevent thread[k] from allowing thread[j] access to its
    thread local store via shared pointer.

    Not for all fo the thread's memory, I was thinking of this as a
    separate flag, to be set only for special purposes (such as above).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 16 22:59:22 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >> >>>>> "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do >> >>> now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to >> >>> use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    That would prevent thread[k] from allowing thread[j] access to its
    thread local store via shared pointer.

    Not for all fo the thread's memory, I was thinking of this as a
    separate flag, to be set only for special purposes (such as above).

    How does one {programmer or OS} glean that the bit can be set ??
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun May 17 07:51:02 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >> >> >>>>> "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    That would prevent thread[k] from allowing thread[j] access to its
    thread local store via shared pointer.

    Not for all fo the thread's memory, I was thinking of this as a
    separate flag, to be set only for special purposes (such as above).

    How does one {programmer or OS} glean that the bit can be set ??

    The OS could learn by special argument to mmap(), for example.

    ABIs could specify a second stack for local variables which are
    known, by language rules, not to be accessed by other threads -
    an alloca-version, for example.

    Renaming could then be done relative to that second stack pointer.

    Drawback: This would increase calling overhead.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun May 17 18:51:12 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >>>>>> "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work? >>>>
    No. In the vast majority of cases, you reference registers as you do >>>> now, with register numbers in assigned places in the instruction. But >>>> you do have an "alternate" way of referencing them that allows you to >>>> use an index, just as you can with memory. That mechanism would only be >>>> used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.

    That could compete with cache, and still cause memory traffic.

    The OS can designate that page as 'noncacheble', so no
    coherency traffic necessary. It would simply be a faster
    page of memory, with access times closer to cache than DRAM
    and shared by multiple cores (with appropriate software care).


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon May 18 09:56:04 2026
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >>>>>>> "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work? >>>>>
    No. In the vast majority of cases, you reference registers as you do >>>>> now, with register numbers in assigned places in the instruction. But >>>>> you do have an "alternate" way of referencing them that allows you to >>>>> use an index, just as you can with memory. That mechanism would only be >>>>> used in rare circumstances.

    Maybe one way to implement this would be to treat a special region, >>>>like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.

    That could compete with cache, and still cause memory traffic.

    The OS can designate that page as 'noncacheble', so no
    coherency traffic necessary. It would simply be a faster
    page of memory, with access times closer to cache than DRAM
    and shared by multiple cores (with appropriate software care).

    That is of course a possibility.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 18 17:50:49 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the >>>>>>> registers as an alternative accessing mechanism. This allows you to >>>>>>> "index" the registers, similarly to indexing a memory array.

    DoesnrCOt this defeat the point of how registers are supposed to work? >>>>>
    No. In the vast majority of cases, you reference registers as you do >>>>> now, with register numbers in assigned places in the instruction. But >>>>> you do have an "alternate" way of referencing them that allows you to >>>>> use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    Maybe one way to implement this would be to treat a special region, >>>>like local variable addressed in a certain range relative to the >>>>stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.

    That could compete with cache, and still cause memory traffic.

    The OS can designate that page as 'noncacheble', so no
    coherency traffic necessary.

    The uncacheable page should not show up in any cache; and on most
    machines travels around the system in data-unit-sizes rather than
    cache-line sizes.

    It would simply be a faster
    page of memory, with access times closer to cache than DRAM

    I cannot see how an uncacheable unit of data can approach L1 cache
    latency.

    and shared by multiple cores (with appropriate software care).

    That is of course a possibility.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 18 11:02:17 2026
    From Newsgroup: comp.arch

    On 5/2/2026 11:46 AM, MitchAlsup wrote:

    big snip


    Thomas Koenig <tkoenig@netcologne.de> posted:
    One problem I see is memory traffic. In the SIMD version, A is
    loaded once at the beginning of the loop. Here, it is loaded N**2
    times, with different offsets each VVM iteration, vs only once
    for the AVX512 version. Also, C is loaded and stored N**2 times,
    vs. only once. (The AVX version also loads B only once).

    The LDD using R6 as an index can be hoisted into Loop2 prologue.
    {I did miss that}.

    With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
    the loop is 6-cycles, so the 8-wide machine would run the loop in
    8-cycles of latency.

    Plus, the setup time for VVM...

    I have been thinking about this overnight and may have a solution
    that alters only the VEC instruction.

    Any progress?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 18 20:20:54 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/2/2026 11:46 AM, MitchAlsup wrote:

    big snip


    Thomas Koenig <tkoenig@netcologne.de> posted:
    One problem I see is memory traffic. In the SIMD version, A is
    loaded once at the beginning of the loop. Here, it is loaded N**2
    times, with different offsets each VVM iteration, vs only once
    for the AVX512 version. Also, C is loaded and stored N**2 times,
    vs. only once. (The AVX version also loads B only once).

    The LDD using R6 as an index can be hoisted into Loop2 prologue.
    {I did miss that}.

    With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
    the loop is 6-cycles, so the 8-wide machine would run the loop in
    8-cycles of latency.

    Plus, the setup time for VVM...

    I have been thinking about this overnight and may have a solution
    that alters only the VEC instruction.

    Any progress?

    A bit.

    To recover from interrupts while performing multi-memory operation*,
    there is a count register (line aligned) in Thread.Header. By using
    this register instead of the Rd supplied by VEC, exceptions and
    interrupts can be recovered--leaving me 5-bits to more fully express
    VEC functionality.

    (*) MM {memory to memory move} and MS {memory set}

    I was thinking of using some of Rd's bits to describe the width of the
    loop in lanes.

    By using 0 to mean "as many as you have" and other numbers to indirectly specify a loop-recurrence that prevents running wider than Rd used as
    an immediate. Thus, if the compiler found a recurrence preventing width
    it is expressed and the HW does not have to go looking {simplifying
    DECODE a bit}.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 18 14:18:17 2026
    From Newsgroup: comp.arch

    On 5/14/2026 9:19 AM, Stefan Monnier wrote:
    Stephen Fuld [2026-05-11 23:11:07] wrote:
    Let me give one possible implementation. There are certainly others. Say
    you have 32 registers. They are "memory mapped" into the first 32 addresses >> of memory. So programs would have to start not at zero, but at 32 (I know >> this can cause other problems - I clearly have not thought through all of
    the details.) So now when the CPU encounters a load (or store) instruction >> where the virtual address is less than 32, it is resolved not by the memory >> system, but by the appropriate register. i.e. if the virtual address was say >> 4, the load would be from register R4, not memory location 4. Yes, the
    virtual addressing mechanism would have to be sensitive to whether the
    address was below 32 or not, but that is simple within the CPU. Note that >> the load instruction in this case would not touch the memory system at all, >> so no cache lookups, no TLB lookups, etc.

    That solves the problem of encoding an indirect register access as
    a LD/ST instruction, but I highly doubt that's the main problem
    introduced by indirect register access.

    It'd actually be easier to just add a new instruction for indirect
    register access (no need to burden the load/store unit, no need to worry about access size and alignment, memory remapping, and whatnot).

    Fair enough. I was motivated by saving an op code. But the confusion
    that has generated, has led me to agree with you about using new op
    codes. But a note - I was assuming it wouldn't actually be executed by
    the load/store unit - the use of load/store was "syntactical sugar"


    The implementation problem, AFAIK comes in with OoO: by the time your instruction (whether a load or a dedicated instruction) gets to know
    which register it needs to read, we're in the middle of the OoO engine,
    and the first thing it needs to do is to figure out which physical
    register corresponds to this logical register (and it needs to find out
    also if that physical register's value has already been delivered).
    The needed information is definitely out there somewhere in the CPU,
    but I'm not sure it can be made available cheaply at that time&place.

    Good point. I have some ideas about how to do it, but they are not
    cheap. :-(. But if the savings in a common application of VVM is big
    enough it might be worth it. I just don't know.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 18 14:22:25 2026
    From Newsgroup: comp.arch

    On 5/14/2026 5:17 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/14/2026 3:03 PM, Bernd Linsel wrote:
    On 5/13/26 22:52, MitchAlsup wrote:

    Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> posted:

    On 5/13/26 14:02, Bernd Linsel wrote:
    Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, >>>>>> accessible in 1 or 2 clocks, and two transfer instructions

    ldqr Rd, <index>
    stqr Rd, <index>

    This should work our perfectly even in a tight vVM loop.


    Should of course read

    ldqr Rd, Rs-a-a-a-a-a // Rs indexes into ultra-fast on-chip SRAM
    stqr Rs1, Rs2-a-a-a // Rs2 indexes into ultra-fast on-chip SRAM

    I think "direct addressing" with an immediate index instead of via an >>>>> index in a register is not needed.

    How do you access a different register each loop iteration ???
    if you don't have indexing ???


    It's meant as:

    ld Rd, qregs[Rd] and
    st Rs1, qregs[Rs2],

    OK, that solves the indexing issue.

    i.e. the second register as index into the "quick regs" local SRAM bank, >>> Only aligned full word access possible should be sufficient, so that
    these are really indices, not addresses.

    I must be missing something. Doesn't this quick regs memory have to be
    saved and restored on each context switch? If so, that is very expensive.

    qregs[] is (IS) the actual register file (or files)--so, no added state.

    Huh? In Bernd's post above, he expressly says adding a 4K fast SRAM to
    the core. I don't think he was talking about the register file.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2