• What I did on my summer vacation

    From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Aug 21 20:49:51 2025
    From Newsgroup: comp.arch


    Greetings everyone !

    Since Google closed down comp.ach on google groups, I had been using
    Real World Technologies as a portal. About 8 weeks ago it crashed for
    the first time, then a couple weeks later it crashed a second time,
    apparently terminally, or Dave Kanter's interest has waned ...

    With help from Terje and SFuld, we have located this portal, and this
    is my first attempt at posting here.

    Anyone familiar with my comp.arch record over the years, understands
    that I participate "a lot"; probably more than it good for my interests,
    but it is energy I seem to have on a continuous basis. My unanticipated
    down time gave my energy a time to work on stuff that I had been neglect-
    ing for quite some time--that is the non-ISA parts of my architecture.
    {{I should probably learn something for this down time and my productivity}}

    My 66000 ISA is in "pretty good shape" having almost no changes over
    the last 6 months, with only specification clarifications. So it was time
    to work on the non-ISA parts.
    ----------------------------
    First up was/is the System Binary Interface: which for the most part is
    "just like" Application Binary Interaface, except that it uses supervisor
    call SVC and supervisor return SVR instead of CALL, CALX, and RET. This
    method gives uniform across the 4 privilege levels {HyperVisor, Host OS,
    Guest OS, and Application}.

    I decided to integrate exception, check, and interrupts with SVC since
    they all involve a privilege transfer of control, and a dispatcher.
    Instead of having something like <Interrupt> vector table, I decided,
    under EricP's tutelage, to use a software dispatcher. This allows the
    vector table to be positioned anywhere in memory, on any boundary, and
    be of any size; such that each Guest OS, Host OS, HyperVisor can have
    their own table organized anyway they please. Control arrives with a
    re-entrant Thread State and register file at the dispatcher with R0
    holding "why" and enough temporary registers for dispatch to perform
    its duties immediately.

    {It ends up that even Linux "signals" can use this means with very
    slight modification to the software dispatcher--it merely has to
    be cognizant that signal privilege == thread waiting privilege and
    thus "save the non-preserved registers".}

    Dispatch extracts R0<38:32>, compares this to the size of the table,
    and if it is within the table, CALX's the entry point in the table.
    This performs an ABI control transfer to the required "handler".
    Upon return, Dispatcher performs SVR to return control whence it came.
    The normal path through Dispatcher is 7 instructions.

    In My 66000 Architecture, SVR also checks pending interrupts of higher
    priority than where SVR is going; thus, softIRQ's are popped off the
    deferred call list and processed before control is delivered to lower
    priority levels.
    ----------------------------
    Next up was the System Programming model: I modeled Chip Resources after
    PCIe Peripherals. {{I had to use the term Peripheral, because with SR-IOV
    and MR-IOV; with physical Functions, virtual Functions, and base Functions
    and Bus; Device. Function being turned into a routing code--none of those
    terms made sense and required to many words to describe. So, I use the term Peripheral as anything that performs an I/O service on behalf of system.}}

    My 66000 uses nested paging with Application and Guest OS using Level-1 translation while Host OS and HyperVisor using Level-2 translation.

    My 66000 translation projects a 64-bit virtual address space into a
    66-bit universal address space with {DRAM, Configuration, MM I/O, and
    ROM} spaces.

    Since My 66000 comes out of reset with the MMU turned on. Boot software assesses virtual Configuration space, which is mapped to {Chip, DRAM,
    and PCIe} configuration spaces. Resources are identified by Type 0
    PCIe Configuration headers, and programmed the "obvious" way (later)
    assigning a page of MM I/O address space to/for each Resource.

    Chip Configuration headers have the Built-In Self-Test BIST control
    port. Chip-resources use BIST to clear and initialize the internal
    stores for normal operation. Prior to writing to BIST these resources
    can be read using the diagnostic port and dumped as desired. BIST is
    assumed to "take some time" so BOOT SW might cause most Chip resources
    to BIST while it goes about getting DRAM up and running.

    In all cases:: Control Registers exist--it is only whether SW can access
    them that is in question. A control registers that does not exist, reads
    as 0 and discards any write, while a control register that does exist
    absorbs the write, and returns the last write or the last HW update. Configuration control registers are accessible in <physical> configuration space, The BAR registers in particular are used to assign MM I/O addresses
    to the rest of the control registers no addressable in configuration space.

    Chip resources {Cores, on-Die Interconnect, {L3, DRAM}, {HostBridge,
    I/O MMU, PCIe Segmenter}} have the first 32 DoubleWords of the
    assigned MM I/O space defined as a "file" containing R0..R31. In all
    cases:
    R0 contains the Voltage and Frequency control terms of the resource,
    R1..R27 contains any general purpose control registers of resource.
    R28..R30 contains the debug port,
    R31 contains the Performance Counter port.
    The remaining 480 DoubleWords are defined by the resource itself
    (or not).

    Because My 66000 ISA has memory instructions that "touch" multiple
    memory locations, these instructions take on special significance
    when using the debug and performance counter ports. Single memory
    instructions access the control registers themselves, while multi-
    memory instructions access "through" the port to the registers
    the port controls.

    For example: each resource has 8 performance counters and 1 control
    register (R31) governing that port.
    a STB Rd,[R31] writes a selection into the PC selectors
    a STD Rd,[R31] writes 8 selections into the PC selectors
    a LDB Rd,[R31] reads a selection from a PC selectors
    a LDD Rd,[R31] reads 8 selections from the PC selectors
    while:
    a LDM Rd,Rd+7,[R31] reads 8 Performance Counters,
    a STM Rd,Rd+7,[R31] writes 8 Performance Counters,
    a MS #0,[R31],#64 clears 8 Performance Counters.

    The Diagnostic port provides access to storage within the resource.
    R28 is roughly the "address" control register
    R29 is roughly the "data" control register
    R30 is roughly the "other" control register
    For a Core; one can access the following components from this port:
    ICache Tag
    ICache Data
    ICache TLB
    DCache Tag
    DCache Data
    DCache TLB
    Level-1 Miss Buffer
    L2Cache Tag
    L2Cache Data
    L2Cache TLB
    L2Cache MMU
    Level-2 Miss Buffer

    Accesses through this port come in single-memory and multi-memory
    flavors. Accessing these control registers as single memory actions
    allows raw access to the data and associated ECC. Reads tell you
    what HW has stored, writes allow SW to write "bad" ECC, should it
    so choose. Multi-memory accesses allow SW to read or write cache
    line sized chunks. The Core tags are configured so that every line
    has a state where this line neither hits nor participates in set
    allocation (when a line needs allocated on miss or replacement.)
    So, a single bad line in a 16KB cache 4-way set looses 64-bytes
    and one line becomes 3-way set associative.
    ----------------------------
    By using the fact that cores come out of reset with MMU turned on,
    and BOOT ROM supplying the translation tables, I was able to achieve
    that all resources come out of reset with all control register flip-
    flops = 0, except for Core[0].Hypervisor_Context.v = 1.

    Core[0] I$, D$, and L2$ come out of reset in the "allocated" state,
    so Boot SW has a small amount of memory from which to find DRAM,
    configure, initialize, tune the pin interface, and clear; so that
    one can proceed to walk and configure the PCIe trees of peripherals. ----------------------------
    Guest OS can configure its translation tables to emit {Configuration
    and MM I/O} space accesses. Now that these are so easy to recognize:
    Host OS and HyperVisor have the ability to translate Guest Physical {Configuration and MM I/O} accesses into Universal {Config or MM I/O}
    accesses. This requires that the PTE KNOW how SR-IOV was set up on
    that virtual Peripheral. All we really want is a) the "routing" code
    of the physical counterpart of the virtual Function, and b) whether
    the access is to be allowed (valid & present). Here, the routing code
    contains the PCIe physical Segment, whether the access is physical
    or virtual, and whether the routing code uses {Bus, Device, *},
    {Bus, *, *} or {*, *, *}. The rest is PCIe transport engines.

    Anyway: School is back in session !

    Mitch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Aug 22 01:20:40 2025
    From Newsgroup: comp.arch

    On Thu, 21 Aug 2025 20:49:51 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Greetings everyone !

    Since Google closed down comp.ach on google groups, I had been using
    Real World Technologies as a portal. About 8 weeks ago it crashed for
    the first time, then a couple weeks later it crashed a second time, apparently terminally, or Dave Kanter's interest has waned ...


    Mitch, you are mixing Usenet portal with conventional Internet Forum.

    RWT is a forum. It didn't crash in recent months. David Kanter didn't
    lose interest. Not that he has a whole lot of interest, but that is
    another story. He is interested enough to pay the bill for hosting and
    that is the most important thing as far as participants are concerned.
    I don't know why you stopped posting there. Would guess that you forgot
    the address.
    BTW, https://www.realworldtech.com/forum/?roomid=1

    For Usenet you were using i2pn2.org server, probably via www.novabbs.com
    web portal created with Rocksolid Light software. The server and portal
    were maintained by Retro Guy (Thom). 2025-04-26 Thom passed away from pancreatic cancer. His Usenet server and his portal continued to work
    without maintenance until late July. But eventually they stopped.
    So it goes.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Aug 21 23:15:52 2025
    From Newsgroup: comp.arch

    On Fri, 22 Aug 2025 01:20:40 +0300, Michael S wrote:

    For Usenet you were using i2pn2.org server, probably via www.novabbs.com
    web portal created with Rocksolid Light software. The server and portal
    were maintained by Retro Guy (Thom). 2025-04-26 Thom passed away from pancreatic cancer. His Usenet server and his portal continued to work
    without maintenance until late July. But eventually they stopped.
    So it goes.

    I knew that novaBBS recently stopped working, but I had no idea of the
    tragic reason why this was the case. Thank you.

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Aug 22 01:39:38 2025
    From Newsgroup: comp.arch

    On 8/21/2025 3:49 PM, MitchAlsup wrote:

    Greetings everyone !

    Since Google closed down comp.ach on google groups, I had been using
    Real World Technologies as a portal. About 8 weeks ago it crashed for
    the first time, then a couple weeks later it crashed a second time, apparently terminally, or Dave Kanter's interest has waned ...

    With help from Terje and SFuld, we have located this portal, and this
    is my first attempt at posting here.

    Anyone familiar with my comp.arch record over the years, understands
    that I participate "a lot"; probably more than it good for my interests,
    but it is energy I seem to have on a continuous basis. My unanticipated
    down time gave my energy a time to work on stuff that I had been neglect-
    ing for quite some time--that is the non-ISA parts of my architecture.
    {{I should probably learn something for this down time and my productivity}}

    My 66000 ISA is in "pretty good shape" having almost no changes over
    the last 6 months, with only specification clarifications. So it was time
    to work on the non-ISA parts.


    Good that you are back.

    For a while I had mostly been using 'news.eternal-september.org' via Thunderbird...

    Had once been using a different usenet server ('albasani'), but it
    seemingly stopped working around 5 years ago. Seems like the server
    still exists, but I don't see any messages more recent than 2020.




    I haven't have all that many recent core ISA changes either.

    There was BITMOV a few months ago.
    Then an instruction for packed FP8 vectors.
    Had looked at adding a Binary16 Horizontal Add instr,
    but too expensive for now.

    Found and fixed a few bugs mostly related to RISC-V and RV-C.

    BGBCC now supports RV-C.
    Determined that RV64GC + Jumbo Extensions is fairly competitive in terms
    of code density.
    XG3 is still getting worse code density than its predecessors.

    Where, RV64GC+Jx has instruction sizes: 16/32/64
    With optional additional sizes: 48 and 96.
    LI Imm32, Rn5 //48-bit
    LI Imm64, Rn //96-bit

    For now (apart from LI and SHORI) not much else for 48-bit encodings;
    and would need to choose between Huawei, Qualcomm, or custom encodings
    for the rest (the 48-bit encoding space, didn't exactly go very far with
    this stuff...). My encoding scheme worked different in that it basically shoehorned a subset of the 64-bit jumbo encodings into the 48-bit space, mostly encoding Imm24/Disp24 ops; and optionally synthesizing Imm17s
    forms of some 3R instructions. It is ugly, but does well for code density.


    I did experiment with a pair encoding for XG3, where a pair of "compact" instructions could be encoded into a 32-bit instruction word. Gains were
    very modest, not enough to change the ranking or to make a strong reason
    to have it.

    Despite being weaker on the code density front, it seems to do well for performance.
    XG2 and XG3 fight for the top in terms of speed.
    XG1 and RV64GC+Jx seem to be the top 2 for code density.

    But both XG1 and RV64GC are worse for performance as my CPU core doesn't
    favor 16-bit ops. Note that Jx does offer a notable code density and performance improvement over RV64GC by itself.

    For XG2 vs XG3 performance, there seems to be a discrepancy between
    emulator and CPU core which is faster. Emulator says XG2, CPU core says XG3.


    Otherwise:
    Was messing around a little bit in the CPU core trying to get the "use
    traps to emulate proper IEEE-754" support working.



    Partly debating formally moving FPSR from GBR[63:48] to SP[63:48].

    The current location (in GBR) results in FPSR getting stomped by GBR
    reloads; which also isn't ideal. Though, if I did move it, might still
    want some way to keep it being saved along with GP/GBR in the ABI
    (though, debatable as to whether it is better to have the FENV_ACCESS
    stuff as global or dynamically bound, traditionally global is assumed,
    with dynamic scoped FENV as a bit of an oddity, even if it makes more
    sense IMHO).

    Well, or give it its own CR, but then would still need to add additional
    logic to save/restore it (which is a non-zero added cost).

    Though, information online is a bit ambiguous as to the expected scoping behavior of FENV_ACCESS (with some stuff implying that the scoping
    behavior depends on where the pragma is used).

    Relocating it to the HOB's of SP would make sense if I want to assume
    that it is global by default (within a given thread). Which is,
    possibly, sane; and code doesn't usually stomp SP. Would need to do
    something extra if I want dynamically scoping rather than global (but,
    as-is; it isn't terribly useful if it typically just gets stomped anyways).


    Well, for now, I have gone and added it to a flag to my Verilog code (to
    await a more final decision). Relatively little impact on the Verilog
    code either way (and doesn't appear to have much effect on resource cost).

    Note that RISC-V accesses it via a CSR (and with the bits in a different order), but this is orthogonal.

    Yes, I am well aware that this sort of wonk is kinda ugly...

    There is always a non-zero risk that code will notice or care about this
    stuff (say, computes a difference between two stack-derived pointers and notices that they are wildly far apart because FPU flags differ).

    Arguably a greater risk for RISC-V though which uses plain ALU ops;
    though one possible workaround could be making the HOBs of SP (and GP)
    always read as 0 in RV mode.


    Otherwise, had also been working slightly on BGBCC support for my very informal FP-SIMD extension for RISC-V (there may be reason to use it;
    beyond its existence as an implementation quirk).

    Will need to deal with a few things to make things play well with
    supporting IEEE semantics. Like, ideally, "FADD.S" should support doing IEEE-754 properly. But, the logic for this currently only exists in the
    main FPU, which (ideally) means needing some way to know that it is a
    scalar FPU operation at decode time.


    Can note that the DYN rounding mode is currently scalar only, which,
    along with the flags updates means, possibly:
    RNE/RTZ: FADD.S still gives non-IEEE behavior (FPSR still ignored);
    DYN: FADD.S goes through main FPU, gives IEEE behavior via traps if flag
    is set;
    ...

    Or, IOW, rounding mode in FADD.S/FMUL.S/... instructions:
    000: RNE, SIMD Unit, implicitly DAZ/FTZ
    001: RTZ, SIMD Unit, implicitly DAZ/FTZ
    01z: Main FPU
    100: Main FPU
    101: SIMD 128-bit RTZ, DAZ/FTZ
    110: SIMD 128-bit RNE, DAZ/FTZ
    111: DYN, Main FPU

    Where DYN is used if FENV_ACCESS is enabled in BGBCC; and DYN is the
    default rounding mode in GCC. Assumed to be always scalar...

    Can note that opt-it makes more sense as otherwise one would need to
    have trap-handlers in place before one could safely use any
    floating-point operations (vs the non-IEEE mode never traps).

    ...




    ----------------------------
    First up was/is the System Binary Interface: which for the most part is
    "just like" Application Binary Interaface, except that it uses supervisor call SVC and supervisor return SVR instead of CALL, CALX, and RET. This method gives uniform across the 4 privilege levels {HyperVisor, Host OS, Guest OS, and Application}.

    I decided to integrate exception, check, and interrupts with SVC since
    they all involve a privilege transfer of control, and a dispatcher.
    Instead of having something like <Interrupt> vector table, I decided,
    under EricP's tutelage, to use a software dispatcher. This allows the
    vector table to be positioned anywhere in memory, on any boundary, and
    be of any size; such that each Guest OS, Host OS, HyperVisor can have
    their own table organized anyway they please. Control arrives with a re-entrant Thread State and register file at the dispatcher with R0
    holding "why" and enough temporary registers for dispatch to perform
    its duties immediately.


    Hmm.

    Admittedly still seems less complicated (from a hardware design POV) to
    just do everything with software traps; even if not the best option for performance.


    {It ends up that even Linux "signals" can use this means with very
    slight modification to the software dispatcher--it merely has to
    be cognizant that signal privilege == thread waiting privilege and
    thus "save the non-preserved registers".}

    Dispatch extracts R0<38:32>, compares this to the size of the table,
    and if it is within the table, CALX's the entry point in the table.
    This performs an ABI control transfer to the required "handler".
    Upon return, Dispatcher performs SVR to return control whence it came.
    The normal path through Dispatcher is 7 instructions.

    In My 66000 Architecture, SVR also checks pending interrupts of higher priority than where SVR is going; thus, softIRQ's are popped off the
    deferred call list and processed before control is delivered to lower priority levels.
    ----------------------------
    Next up was the System Programming model: I modeled Chip Resources after
    PCIe Peripherals. {{I had to use the term Peripheral, because with SR-IOV
    and MR-IOV; with physical Functions, virtual Functions, and base Functions and Bus; Device. Function being turned into a routing code--none of those terms made sense and required to many words to describe. So, I use the term Peripheral as anything that performs an I/O service on behalf of system.}}

    My 66000 uses nested paging with Application and Guest OS using Level-1 translation while Host OS and HyperVisor using Level-2 translation.

    My 66000 translation projects a 64-bit virtual address space into a
    66-bit universal address space with {DRAM, Configuration, MM I/O, and
    ROM} spaces.

    Since My 66000 comes out of reset with the MMU turned on. Boot software assesses virtual Configuration space, which is mapped to {Chip, DRAM,
    and PCIe} configuration spaces. Resources are identified by Type 0
    PCIe Configuration headers, and programmed the "obvious" way (later) assigning a page of MM I/O address space to/for each Resource.

    Chip Configuration headers have the Built-In Self-Test BIST control
    port. Chip-resources use BIST to clear and initialize the internal
    stores for normal operation. Prior to writing to BIST these resources
    can be read using the diagnostic port and dumped as desired. BIST is
    assumed to "take some time" so BOOT SW might cause most Chip resources
    to BIST while it goes about getting DRAM up and running.

    In all cases:: Control Registers exist--it is only whether SW can access
    them that is in question. A control registers that does not exist, reads
    as 0 and discards any write, while a control register that does exist
    absorbs the write, and returns the last write or the last HW update. Configuration control registers are accessible in <physical> configuration space, The BAR registers in particular are used to assign MM I/O addresses
    to the rest of the control registers no addressable in configuration space.

    Chip resources {Cores, on-Die Interconnect, {L3, DRAM}, {HostBridge,
    I/O MMU, PCIe Segmenter}} have the first 32 DoubleWords of the
    assigned MM I/O space defined as a "file" containing R0..R31. In all
    cases:
    R0 contains the Voltage and Frequency control terms of the resource,
    R1..R27 contains any general purpose control registers of resource.
    R28..R30 contains the debug port,
    R31 contains the Performance Counter port.
    The remaining 480 DoubleWords are defined by the resource itself
    (or not).

    Because My 66000 ISA has memory instructions that "touch" multiple
    memory locations, these instructions take on special significance
    when using the debug and performance counter ports. Single memory instructions access the control registers themselves, while multi-
    memory instructions access "through" the port to the registers
    the port controls.

    For example: each resource has 8 performance counters and 1 control
    register (R31) governing that port.
    a STB Rd,[R31] writes a selection into the PC selectors
    a STD Rd,[R31] writes 8 selections into the PC selectors
    a LDB Rd,[R31] reads a selection from a PC selectors
    a LDD Rd,[R31] reads 8 selections from the PC selectors
    while:
    a LDM Rd,Rd+7,[R31] reads 8 Performance Counters,
    a STM Rd,Rd+7,[R31] writes 8 Performance Counters,
    a MS #0,[R31],#64 clears 8 Performance Counters.

    The Diagnostic port provides access to storage within the resource.
    R28 is roughly the "address" control register
    R29 is roughly the "data" control register
    R30 is roughly the "other" control register
    For a Core; one can access the following components from this port:
    ICache Tag
    ICache Data
    ICache TLB
    DCache Tag
    DCache Data
    DCache TLB
    Level-1 Miss Buffer
    L2Cache Tag
    L2Cache Data
    L2Cache TLB
    L2Cache MMU
    Level-2 Miss Buffer

    Accesses through this port come in single-memory and multi-memory
    flavors. Accessing these control registers as single memory actions
    allows raw access to the data and associated ECC. Reads tell you
    what HW has stored, writes allow SW to write "bad" ECC, should it
    so choose. Multi-memory accesses allow SW to read or write cache
    line sized chunks. The Core tags are configured so that every line
    has a state where this line neither hits nor participates in set
    allocation (when a line needs allocated on miss or replacement.)
    So, a single bad line in a 16KB cache 4-way set looses 64-bytes
    and one line becomes 3-way set associative.
    ----------------------------
    By using the fact that cores come out of reset with MMU turned on,
    and BOOT ROM supplying the translation tables, I was able to achieve
    that all resources come out of reset with all control register flip-
    flops = 0, except for Core[0].Hypervisor_Context.v = 1.

    Core[0] I$, D$, and L2$ come out of reset in the "allocated" state,
    so Boot SW has a small amount of memory from which to find DRAM,
    configure, initialize, tune the pin interface, and clear; so that
    one can proceed to walk and configure the PCIe trees of peripherals. ----------------------------
    Guest OS can configure its translation tables to emit {Configuration
    and MM I/O} space accesses. Now that these are so easy to recognize:
    Host OS and HyperVisor have the ability to translate Guest Physical {Configuration and MM I/O} accesses into Universal {Config or MM I/O} accesses. This requires that the PTE KNOW how SR-IOV was set up on
    that virtual Peripheral. All we really want is a) the "routing" code
    of the physical counterpart of the virtual Function, and b) whether
    the access is to be allowed (valid & present). Here, the routing code contains the PCIe physical Segment, whether the access is physical
    or virtual, and whether the routing code uses {Bus, Device, *},
    {Bus, *, *} or {*, *, *}. The rest is PCIe transport engines.

    Anyway: School is back in session !

    Mitch

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Aug 22 14:57:56 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Greetings everyone !


    Chip resources {Cores, on-Die Interconnect, {L3, DRAM}, {HostBridge,
    I/O MMU, PCIe Segmenter}} have the first 32 DoubleWords of the
    assigned MM I/O space defined as a "file" containing R0..R31. In all
    cases:
    R0 contains the Voltage and Frequency control terms of the resource,
    R1..R27 contains any general purpose control registers of resource.
    R28..R30 contains the debug port,
    R31 contains the Performance Counter port.
    The remaining 480 DoubleWords are defined by the resource itself
    (or not).

    I'd allow for regions larger than 4096 bytes. It's not uncommmon
    for specialized on-board DMA engines to require 20 bits of
    address space to define the complete set of device resources,
    even for on-chip devices (A DMA engine may support a large number
    of "ring" structures, for example, and one might group the
    ring configuration registers into 4k regions (so they can be assigned
    to a guest in a SRIOV-type device)).

    I've seen devices with dozens of performance registers (both
    direct-access and indirect-access).



    Because My 66000 ISA has memory instructions that "touch" multiple
    memory locations, these instructions take on special significance
    when using the debug and performance counter ports. Single memory >instructions access the control registers themselves, while multi-
    memory instructions access "through" the port to the registers
    the port controls.

    That level of indirection may cause difficulties when virtualizing
    a device.


    For example: each resource has 8 performance counters and 1 control
    register (R31) governing that port.
    a STB Rd,[R31] writes a selection into the PC selectors
    a STD Rd,[R31] writes 8 selections into the PC selectors
    a LDB Rd,[R31] reads a selection from a PC selectors
    a LDD Rd,[R31] reads 8 selections from the PC selectors
    while:
    a LDM Rd,Rd+7,[R31] reads 8 Performance Counters,
    a STM Rd,Rd+7,[R31] writes 8 Performance Counters,
    a MS #0,[R31],#64 clears 8 Performance Counters.

    The Diagnostic port provides access to storage within the resource.
    R28 is roughly the "address" control register
    R29 is roughly the "data" control register
    R30 is roughly the "other" control register
    For a Core; one can access the following components from this port:
    ICache Tag
    ICache Data
    ICache TLB
    DCache Tag
    DCache Data
    DCache TLB
    Level-1 Miss Buffer
    L2Cache Tag
    L2Cache Data
    L2Cache TLB
    L2Cache MMU
    Level-2 Miss Buffer

    Accesses through this port come in single-memory and multi-memory
    flavors. Accessing these control registers as single memory actions
    allows raw access to the data and associated ECC. Reads tell you
    what HW has stored, writes allow SW to write "bad" ECC, should it
    so choose. Multi-memory accesses allow SW to read or write cache
    line sized chunks. The Core tags are configured so that every line
    has a state where this line neither hits nor participates in set
    allocation (when a line needs allocated on miss or replacement.)
    So, a single bad line in a 16KB cache 4-way set looses 64-bytes
    and one line becomes 3-way set associative.
    ----------------------------

    The KISS principle applies.

    By using the fact that cores come out of reset with MMU turned on,
    and BOOT ROM supplying the translation tables, I was able to achieve
    that all resources come out of reset with all control register flip-
    flops = 0, except for Core[0].Hypervisor_Context.v = 1.

    Where is the ROM? Modern SoCs have an on-board ROM, which
    cannot be changed without a re-spin and new tapeout. That
    ROM needs to be rock-solid and provide just enough capability
    to securely load a trusted blob from a programmable device
    (e.g. SPI flash device).

    I'm really leary about the idea of starting with MMU enabled,
    I don't see any advantage to doing that.


    Core[0] I$, D$, and L2$ come out of reset in the "allocated" state,
    so Boot SW has a small amount of memory from which to find DRAM,
    configure, initialize, tune the pin interface, and clear; so that
    one can proceed to walk and configure the PCIe trees of peripherals.

    You don't need to configure peripherals before DRAM is initialized
    (other than the DRAM controller itself). All other peripheral
    initialization should be done in loadable firmware or a secure
    monitor, hypervisor or bare-metal kernel.

    ----------------------------
    Guest OS can configure its translation tables to emit {Configuration
    and MM I/O} space accesses. Now that these are so easy to recognize:

    Security. Guest OS should only be able to access resources
    granted to it by the HV.

    Host OS and HyperVisor have the ability to translate Guest Physical >{Configuration and MM I/O} accesses into Universal {Config or MM I/O} >accesses. This requires that the PTE KNOW how SR-IOV was set up on
    that virtual Peripheral.

    This seems unnecessarily complicated. Every SR-IOV capable device
    is different and aside the standard PCIe defined configuration space
    registers, everything else is device-specific.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Aug 22 16:17:24 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Greetings everyone !


    Chip resources {Cores, on-Die Interconnect, {L3, DRAM}, {HostBridge,
    I/O MMU, PCIe Segmenter}} have the first 32 DoubleWords of the
    assigned MM I/O space defined as a "file" containing R0..R31. In all
    cases:
    R0 contains the Voltage and Frequency control terms of the resource, >R1..R27 contains any general purpose control registers of resource. >R28..R30 contains the debug port,
    R31 contains the Performance Counter port.
    The remaining 480 DoubleWords are defined by the resource itself
    (or not).

    I'd allow for regions larger than 4096 bytes. It's not uncommmon
    for specialized on-board DMA engines to require 20 bits of
    address space to define the complete set of device resources,
    even for on-chip devices (A DMA engine may support a large number
    of "ring" structures, for example, and one might group the
    ring configuration registers into 4k regions (so they can be assigned
    to a guest in a SRIOV-type device)).

    This is a fair point, but none of my current on-Die need more than 6 DWs,
    So allocating a 4096 byte address space to them seems generous--remember
    these are NOT PCIe peripherals, but resources in a My 66000 implementation. They all use std PCIe confi headers so a STD #-1 to BAR{01} and a LDD
    tells how big the allocation should be, so even tho the spaces are sparse
    they still follow PCIe Config conventions.

    I've seen devices with dozens of performance registers (both
    direct-access and indirect-access).



    Because My 66000 ISA has memory instructions that "touch" multiple
    memory locations, these instructions take on special significance
    when using the debug and performance counter ports. Single memory >instructions access the control registers themselves, while multi-
    memory instructions access "through" the port to the registers
    the port controls.

    That level of indirection may cause difficulties when virtualizing
    a device.

    These are on-Die resources not PCIe peripherals.


    For example: each resource has 8 performance counters and 1 control >register (R31) governing that port.
    a STB Rd,[R31] writes a selection into the PC selectors
    a STD Rd,[R31] writes 8 selections into the PC selectors
    a LDB Rd,[R31] reads a selection from a PC selectors
    a LDD Rd,[R31] reads 8 selections from the PC selectors
    while:
    a LDM Rd,Rd+7,[R31] reads 8 Performance Counters,
    a STM Rd,Rd+7,[R31] writes 8 Performance Counters,
    a MS #0,[R31],#64 clears 8 Performance Counters.

    The Diagnostic port provides access to storage within the resource.
    R28 is roughly the "address" control register
    R29 is roughly the "data" control register
    R30 is roughly the "other" control register
    For a Core; one can access the following components from this port:
    ICache Tag
    ICache Data
    ICache TLB
    DCache Tag
    DCache Data
    DCache TLB
    Level-1 Miss Buffer
    L2Cache Tag
    L2Cache Data
    L2Cache TLB
    L2Cache MMU
    Level-2 Miss Buffer

    Accesses through this port come in single-memory and multi-memory
    flavors. Accessing these control registers as single memory actions
    allows raw access to the data and associated ECC. Reads tell you
    what HW has stored, writes allow SW to write "bad" ECC, should it
    so choose. Multi-memory accesses allow SW to read or write cache
    line sized chunks. The Core tags are configured so that every line
    has a state where this line neither hits nor participates in set
    allocation (when a line needs allocated on miss or replacement.)
    So, a single bad line in a 16KB cache 4-way set looses 64-bytes
    and one line becomes 3-way set associative.
    ----------------------------

    The KISS principle applies.

    By using the fact that cores come out of reset with MMU turned on,
    and BOOT ROM supplying the translation tables, I was able to achieve
    that all resources come out of reset with all control register flip-
    flops = 0, except for Core[0].Hypervisor_Context.v = 1.

    Where is the ROM? Modern SoCs have an on-board ROM, which
    cannot be changed without a re-spin and new tapeout. That
    ROM needs to be rock-solid and provide just enough capability
    to securely load a trusted blob from a programmable device
    (e.g. SPI flash device).

    ROM is external FLASH in the envisioned implementations.

    I'm really leary about the idea of starting with MMU enabled,
    I don't see any advantage to doing that.


    Core[0] I$, D$, and L2$ come out of reset in the "allocated" state,
    so Boot SW has a small amount of memory from which to find DRAM,
    configure, initialize, tune the pin interface, and clear; so that
    one can proceed to walk and configure the PCIe trees of peripherals.

    You don't need to configure peripherals before DRAM is initialized
    (other than the DRAM controller itself). All other peripheral initialization should be done in loadable firmware or a secure
    monitor, hypervisor or bare-metal kernel.

    Agreed, you can't use the peripherals until they have DRAM in which
    to perform I/O, and send interrupts{Thus, acting normal}.

    ----------------------------
    Guest OS can configure its translation tables to emit {Configuration
    and MM I/O} space accesses. Now that these are so easy to recognize:

    Security. Guest OS should only be able to access resources
    granted to it by the HV.

    Yes, Guest physical MM I/O Space is translated by Host MM I/O Translation tables. Real OS setup its translation tables to emit MM I/O Accesses, so
    Guest OS should too, then that Guest Physical is considered Host virtual
    and translated and protected again.

    As far as I am concerned, Guest OS thinks it has 32 Devices each of which
    have 8 Functions all on Bus 0... So, a Guest OS with fewer than 256 Fctns
    sees only 1 BUS and can short circuit the virtual Config discovery.
    These virtual Guest OS accesses, then, get redistributed to the
    Segments and Busses on which the VFs actually reside by Level-2 SW.

    Host OS and HyperVisor have the ability to translate Guest Physical >{Configuration and MM I/O} accesses into Universal {Config or MM I/O} >accesses. This requires that the PTE KNOW how SR-IOV was set up on
    that virtual Peripheral.

    This seems unnecessarily complicated.

    So did IEEE 754 in 1982...

    What I have done is to virtualize Config and MM I/O spaces, so Guest OS
    does not even see that it is not Real OS running on bare metal--and doing
    so without HV intervention on any of the Config or MM I/O accesses.

    Every SR-IOV capable device
    is different and aside the standard PCIe defined configuration space registers, everything else is device-specific.

    Only requires 3 bits in the MM I/O PTE.
    Only requires 1 bit in Config PTE, a bit that already had to be there.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Aug 22 12:47:35 2025
    From Newsgroup: comp.arch

    On 8/22/2025 11:17 AM, MitchAlsup wrote:

    scott@slp53.sl.home (Scott Lurndal) posted:


    <snip>

    Host OS and HyperVisor have the ability to translate Guest Physical
    {Configuration and MM I/O} accesses into Universal {Config or MM I/O}
    accesses. This requires that the PTE KNOW how SR-IOV was set up on
    that virtual Peripheral.

    This seems unnecessarily complicated.

    So did IEEE 754 in 1982...


    Still is...

    Denormals, Inf/NaN, ... tend to accomplish relatively little in
    practice; apart from making FPUs more expensive, often slower, and
    requiring programmers to go through extra hoops to specify DAZ/FTZ in
    cases where they need more performance.

    Likewise, +/- 0.5 ULP, accomplishes little beyond adding cost; whereas
    +/- 0.63 ULP would be a lot cheaper, and accomplishes nearly the same
    effect.

    Well, apart from the seeming failure of being unable to fully converge
    the last few bits of N-R, which seems to depend primarily on sub-ULP bits.


    But, there is a tradeoff:
    Doing a faster FPU which uses trap-and-emulate.

    Still isn't free, as detecting cases tat will require trap-and-emulate
    still has a higher cost than merely not bothering in the first place
    (and now requires trickery of routing FPSR bits into the instruction
    decoder depending on whether they need to be routed in a way that will
    allow the FPU to detect violations of IEEE semantics).

    And finding some other issues in the process, ...

    ...


    What I have done is to virtualize Config and MM I/O spaces, so Guest OS
    does not even see that it is not Real OS running on bare metal--and doing
    so without HV intervention on any of the Config or MM I/O accesses.


    Still seems unnecessarily complicated.

    Could be like:
    Machine/ISR Mode: Bare metal, no MMU.
    Supervisor Mode: Full Access, MMU.
    User: Limited Access, MMU

    VM Guest OS then runs in User Mode. and generates a fault whenever a privileged operation is encountered. The VM can then fake the rest of
    the system in software...

    And/Or: Ye Olde Interpreter or JIT compiler (sorta like DOSBox and similar).

    Nested Translation? Fake it in software.
    Unlike real VMs, SW address translation can more easily scale to N
    levels of VM, even if this also means N levels of slow...


    Every SR-IOV capable device
    is different and aside the standard PCIe defined configuration space
    registers, everything else is device-specific.

    Only requires 3 bits in the MM I/O PTE.
    Only requires 1 bit in Config PTE, a bit that already had to be there.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Aug 22 18:58:59 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 8/22/2025 11:17 AM, MitchAlsup wrote:

    scott@slp53.sl.home (Scott Lurndal) posted:


    <snip>

    Host OS and HyperVisor have the ability to translate Guest Physical
    {Configuration and MM I/O} accesses into Universal {Config or MM I/O}
    accesses. This requires that the PTE KNOW how SR-IOV was set up on
    that virtual Peripheral.

    This seems unnecessarily complicated.

    So did IEEE 754 in 1982...


    Still is...

    Denormals, Inf/NaN, ... tend to accomplish relatively little in
    practice; apart from making FPUs more expensive, often slower, and
    requiring programmers to go through extra hoops to specify DAZ/FTZ in
    cases where they need more performance.

    You can (and have) argued this until the cows come home. But you cannot
    deny that IEEE 754 is here to stay, and was produced by a democratic
    process.

    Likewise, +/- 0.5 ULP, accomplishes little beyond adding cost; whereas
    +/- 0.63 ULP would be a lot cheaper, and accomplishes nearly the same effect.

    I am only partially willing to buy that argument. I was able to get my Transcendentals down into the 0.502-0.505 range and the only major
    difference is that FU also does integer multiplication.

    I have also lived the messes of IBM FP and CDC FP. Luckily I missed
    out on CRAY FP.

    Well, apart from the seeming failure of being unable to fully converge
    the last few bits of N-R, which seems to depend primarily on sub-ULP bits.

    DIV and SQRT in Goldschmidt form require 57|u57; in N-R form require 56|u57
    in order to get IEEE 754 accuracy.

    J. M. Muler has a chapter where he investigates how many bits prior to
    rounding are needed in order to achieve IEEE 754 accuracy. It turns out
    that EXP requires 117-118-bits to achieve 0.5 ULP, and there are some
    other nasty transcendentals. This requires something longer than 128-bit
    IEEE FP in order to get properly round 64-bit FP transcendentals.

    On the other hand, if the multiplier FU can perform integer |u, then one
    can achieve 0.502-0.505 ULP with just the 64|u64 |u, and have 3-4 cycle
    integer multiply {you could say "for free", or you can say "since I|u
    is there, transcendental accuracy came for free"}

    But, there is a tradeoff:
    Doing a faster FPU which uses trap-and-emulate.

    All of the big guys do full speed FMUL with IEEE accuracy. Only FPGA implementations have an argument to stop short.

    Still isn't free, as detecting cases tat will require trap-and-emulate
    still has a higher cost than merely not bothering in the first place
    (and now requires trickery of routing FPSR bits into the instruction
    decoder depending on whether they need to be routed in a way that will
    allow the FPU to detect violations of IEEE semantics).

    Oh, BTW, My 66000 transcendentals detect that the rounding might not
    be to IEEE accuracy and have an Enable to trap and emulate the ~1/1273
    that cannot be properly rounded. And, here, that capability is patented.

    And finding some other issues in the process, ...

    ...


    What I have done is to virtualize Config and MM I/O spaces, so Guest OS does not even see that it is not Real OS running on bare metal--and doing so without HV intervention on any of the Config or MM I/O accesses.


    Still seems unnecessarily complicated.

    The MMU does not even have a bit that allows it to be turned off.
    Indeed, turning off the Root Pointer (validity) makes that privilege
    level unable to function--so you could not even Boot.

    The MMU has to work for CPU do anything reasonable, I just extend this all
    the way into the exit from Reset.

    Could be like:
    Machine/ISR Mode: Bare metal, no MMU.
    Supervisor Mode: Full Access, MMU.
    User: Limited Access, MMU

    If you don't trust Boot; how do you get to Secure Boot ?!?

    But if you WANT to appear to have turned off the MMU, you can use a
    single SuperPage PTE to map 8EB {or all of potential Flash ROM}

    VM Guest OS then runs in User Mode. and generates a fault whenever a privileged operation is encountered. The VM can then fake the rest of
    the system in software...

    In My 66000, Config, MM I/O and ROM spaces are not part of the DRAM
    address space. So, without an active MMU, Boot has nowhere to fetch instructions, no way to access Configuration space, and no way to
    get started Booting the system.

    Benefit: DRAM can be as big as 64-bits with no config or MM I/O
    apertures.
    Benefit: Can run Real OS as a Guest OS.
    Complexity: MMU is never turned off. {I would call this a lessening of
    complexity not an increase.}

    And/Or: Ye Olde Interpreter or JIT compiler (sorta like DOSBox and similar).

    Nested Translation? Fake it in software.

    Performance sucks, and you end up having higher privilege SW have to look
    at tables setup by lower privilege software. It is cleaner to run nested
    paging from the exit of Reset.

    Unlike real VMs, SW address translation can more easily scale to N
    levels of VM, even if this also means N levels of slow...

    That is what the Host OS privilege level is for.


    Every SR-IOV capable device
    is different and aside the standard PCIe defined configuration space
    registers, everything else is device-specific.

    Only requires 3 bits in the MM I/O PTE.
    Only requires 1 bit in Config PTE, a bit that already had to be there.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Aug 22 21:51:21 2025
    From Newsgroup: comp.arch

    BGB wrote:
    On 8/22/2025 11:17 AM, MitchAlsup wrote:

    scott@slp53.sl.home (Scott Lurndal) posted:


    <snip>

    Host OS and HyperVisor have the ability to translate Guest Physical
    {Configuration and MM I/O} accesses into Universal {Config or MM I/O}
    accesses. This requires that the PTE KNOW how SR-IOV was set up on
    that virtual Peripheral.

    This seems unnecessarily complicated.

    So did IEEE 754 in 1982...


    Still is...

    Denormals, Inf/NaN, ... tend to accomplish relatively little in
    practice; apart from making FPUs more expensive, often slower, and
    requiring programmers to go through extra hoops to specify DAZ/FTZ in
    cases where they need more performance.

    Likewise, +/- 0.5 ULP, accomplishes little beyond adding cost; whereas
    +/- 0.63 ULP would be a lot cheaper, and accomplishes nearly the same effect.

    Well, apart from the seeming failure of being unable to fully converge
    the last few bits of N-R, which seems to depend primarily on sub-ULP bits.

    Having spent ~10 years (very much part time!) working on 754 standards I strongly believe you are wrong:

    Yes, there are a few small issues, some related to grandfather clauses
    that might go away at some point, but the zero/subnorm/normal/inf/nan
    setup is not one of them.

    Personally I think it would have been a huge win if the original
    standard had defined inf/nan a different way:

    What we have is Inf == Maximal exponent, all-zero mantissa, while all
    other mantissa values indicates a NaN.

    For binary FP it is totally up to the CPU vendor how to define Quiet NaN
    vs Signalling NaN, most common seems to be to set the top bit in the
    mantissa.

    What we have been missing for 40 years now is a fourth category:

    None (or Null/Missing)

    This would have simplified all sorts of array/matrix sw where both
    errors (NaN) and missing (None) items are possible.

    The easiest way to implement it would also make the FPU hardware simpler:

    The top two bits of the mantissa define

    11 : SNaN
    10 : QNaN
    01 : None
    00 : Inf

    The rest of the mantissa bits could then carry any payload you want,
    including optional debug info for Infinities.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Aug 22 19:55:07 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Greetings everyone !



    Because My 66000 ISA has memory instructions that "touch" multiple
    memory locations, these instructions take on special significance
    when using the debug and performance counter ports. Single memory
    instructions access the control registers themselves, while multi-
    memory instructions access "through" the port to the registers
    the port controls.

    That level of indirection may cause difficulties when virtualizing
    a device.

    These are on-Die resources not PCIe peripherals.

    If it quacks like a duck - you're making them _look_
    like PCIe peripherals (e.g. with a PCI/PCIe compatible
    configuration space, BARs, MSI-X interrupts, etc.),
    right?


    By using the fact that cores come out of reset with MMU turned on,
    and BOOT ROM supplying the translation tables, I was able to achieve
    that all resources come out of reset with all control register flip-
    flops = 0, except for Core[0].Hypervisor_Context.v = 1.

    Where is the ROM? Modern SoCs have an on-board ROM, which
    cannot be changed without a re-spin and new tapeout. That
    ROM needs to be rock-solid and provide just enough capability
    to securely load a trusted blob from a programmable device
    (e.g. SPI flash device).

    ROM is external FLASH in the envisioned implementations.

    Which begs the question about how the flash controller
    is initialized, if there is no ROM on-chip to do that;
    and which flash controller is used - SPI, embedded MMC,
    I2C/I3C - or do you envision something like the intel low-pin-count
    devices that are directly exposed on the system address
    so the processor can just start fetching instructions
    directly from a custom flash controller like 8086?

    Customers often wish to use specific technology in the
    boot path, targeted to their use-case.


    ----------------------------
    Guest OS can configure its translation tables to emit {Configuration
    and MM I/O} space accesses. Now that these are so easy to recognize:

    Security. Guest OS should only be able to access resources
    granted to it by the HV.

    Yes, Guest physical MM I/O Space is translated by Host MM I/O Translation >tables. Real OS setup its translation tables to emit MM I/O Accesses, so >Guest OS should too, then that Guest Physical is considered Host virtual
    and translated and protected again.

    As far as I am concerned, Guest OS thinks it has 32 Devices each of which >have 8 Functions all on Bus 0... So, a Guest OS with fewer than 256 Fctns >sees only 1 BUS and can short circuit the virtual Config discovery.
    These virtual Guest OS accesses, then, get redistributed to the
    Segments and Busses on which the VFs actually reside by Level-2 SW.

    Generally speaking the guest device configuration (e.g. ECAM)
    is completely emulated by the hypervisor (by either supporting
    the legacy CF8/CF8 intel peek/poke mechanism (traping IN/OUT
    instructions on x86) or by taking a page fault to the hypervisor
    when the guest accesses a designated ECAM region (the address of
    which is provided to the guest by the HV via ACPI or Device Tree tables).


    Host OS and HyperVisor have the ability to translate Guest Physical
    {Configuration and MM I/O} accesses into Universal {Config or MM I/O}
    accesses. This requires that the PTE KNOW how SR-IOV was set up on
    that virtual Peripheral.

    This seems unnecessarily complicated.

    So did IEEE 754 in 1982...

    What I have done is to virtualize Config and MM I/O spaces, so Guest OS
    does not even see that it is not Real OS running on bare metal--and doing
    so without HV intervention on any of the Config or MM I/O accesses.

    Config accesses are quite rare. Mainly for device discovery and
    initial BAR setup, there is no benefit to supporting virtualization
    of that in the hardware. All existing hypervisors provide emulated
    ECAM regions to the guest.

    MMI/O is handled efficiently by the HV using a nested page table.


    Every SR-IOV capable device
    is different and aside the standard PCIe defined configuration space
    registers, everything else is device-specific.

    Only requires 3 bits in the MM I/O PTE.
    Only requires 1 bit in Config PTE, a bit that already had to be there.

    You need MY66000 HV/Kernel software to support that.

    I can tell you, from experience, that custom hardware support
    for third-party standard devices in the kernels (windows, linux, et al)
    is frowned upon by the linux community. They're called quirks, supporting
    them complicates the driver implementations in the operating
    software.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Aug 22 20:28:56 2025
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    BGB wrote:
    On 8/22/2025 11:17 AM, MitchAlsup wrote:

    scott@slp53.sl.home (Scott Lurndal) posted:


    <snip>

    Host OS and HyperVisor have the ability to translate Guest Physical
    {Configuration and MM I/O} accesses into Universal {Config or MM I/O} >>>> accesses. This requires that the PTE KNOW how SR-IOV was set up on
    that virtual Peripheral.

    This seems unnecessarily complicated.

    So did IEEE 754 in 1982...


    Still is...

    Denormals, Inf/NaN, ... tend to accomplish relatively little in
    practice; apart from making FPUs more expensive, often slower, and requiring programmers to go through extra hoops to specify DAZ/FTZ in cases where they need more performance.

    Likewise, +/- 0.5 ULP, accomplishes little beyond adding cost; whereas
    +/- 0.63 ULP would be a lot cheaper, and accomplishes nearly the same effect.

    Well, apart from the seeming failure of being unable to fully converge
    the last few bits of N-R, which seems to depend primarily on sub-ULP bits.

    Having spent ~10 years (very much part time!) working on 754 standards I strongly believe you are wrong:

    Not to mention the rest of the FP numerical community--even Posits.

    Yes, there are a few small issues, some related to grandfather clauses
    that might go away at some point, but the zero/subnorm/normal/inf/nan
    setup is not one of them.

    Personally I think it would have been a huge win if the original
    standard had defined inf/nan a different way:

    What we have is Inf == Maximal exponent, all-zero mantissa, while all
    other mantissa values indicates a NaN.

    For binary FP it is totally up to the CPU vendor how to define Quiet NaN
    vs Signalling NaN, most common seems to be to set the top bit in the mantissa.

    What we have been missing for 40 years now is a fourth category:

    None (or Null/Missing)

    This would have simplified all sorts of array/matrix sw where both
    errors (NaN) and missing (None) items are possible.

    The easiest way to implement it would also make the FPU hardware simpler:

    The top two bits of the mantissa define

    11 : SNaN
    10 : QNaN
    01 : None
    00 : Inf

    The rest of the mantissa bits could then carry any payload you want, including optional debug info for Infinities.

    And a way where overflow saturates below infinity and underflow
    saturates above zero. That would make it clear if the operand/
    result was an actual infinity, or whether it overflowed the
    container to become infinity.

    zero < underflowed < denorm < norm < overflowed < infinity

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Aug 23 06:05:03 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    BGB wrote:
    What we have been missing for 40 years now is a fourth category:

    None (or Null/Missing)

    My understanding has been that SNaNs are intended to be used for
    elements that should not be used as computation operands, e.g., for
    otherwise uninitialized array elements.

    This would have simplified all sorts of array/matrix sw where both
    errors (NaN) and missing (None) items are possible.

    In what ways would None behave differently from SNaN?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 23 23:43:43 2025
    From Newsgroup: comp.arch

    On 8/23/2025 5:44 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 8/23/2025 10:11 AM, Terje Mathisen wrote:
    BGB wrote:
    -------------

    Mitch and I have repeated this too many times already:

    If you are implementing a current-standards FPU, including FMAC support, >>> then you already have the very wide normalizer which is the only
    expensive item needed to allow zero-cycle denorm cost.


    Errm, no single-rounded FMA in my case, as single rounded FMA (for
    Binary64) would also require Trap-and-Emulate...

    But, yeah, Free if you have FMA, is not the same as FMA being free.

    Partial issue is that single rounded FMA would effectively itself have
    too high of cost (and an FMA unit would require higher latency than
    separate FMUL and FADD units).

    FMA latency < (FMUL + FADD) latency
    FMA latency >= FMUL latency
    FMA latency >= FADD latency


    Yeah.

    As noted, as-is I have an FMUL unit and FADD unit.

    The FMAC op in this case basically pipes the FMUL output through FADD,
    at roughly FMUL+FADD latency. It then stalls the pipeline for twice as long.

    But, doesn't give single rounding, nor could affordably be made to do so.


    Ironically, what FMA operations exist tend to be slower for Binary32 ops
    than using separate MUL and ADD ops in the default (non-IEEE) mode.
    Though for Binary64, it would be slightly faster, though still
    double-rounded-ish. They can mimic Single-Rounded behavior with Binary32
    and Binary16 though mostly for sake of internally operating on Binary64.

    You must accept that::

    FMA Rd,Rs1,Rs2,Rs3
    FSUB Re,Rd,Rs3

    leaves all the proper bits in Re; whereas you cannot even argue::

    FMUL Rd,Rs1,Rs2
    FADD Re,Rd,Rs3
    RSUB Re,Re,R3

    leaves all the proper bits in Re !! in all cases !!

    Granted...

    But, normal C works fine without FMA; and if the ISA doesn't provide it,
    then the FPU doesn't need to deal with it.

    Nominally, C rules tend to assume that every operator operates
    individually, so X*Y+Z turning into an FMA is by no means a required
    behavior in C (and doing so may actually itself introduce unexpected
    results).


    And, if one does:
    w = fma(x, y, z);

    One can also implement "fma()" in a way that doesn't depend on having
    native ISA level support.

    Though, if supporting RISC-V, there is an issue:
    RISC-V does have these instructions...


    But, there are two possibilities here:

    Double-rounded result; but this may violate IEEE semantics, and some
    programs may depend on the assumption of it being single-rounded.

    Or, trap and emulate: Little real hardware cost, but instruction now
    takes roughly 500 or so clock cycles.


    Though, was recently at least working on having full IEEE-754 semantics
    as a possibility in my CPU core; though... mostly via trapping and
    similar (sorta like the original MIPS FPUs or similar).


    But, as can be noted:
    Seemingly even high-end PC style FPUs are not immune.

    Like, if even Intel and AMD can't fully avoid FPU performance issues due
    to things like denormals and similar; it seems like everything is
    basically doomed.

    It almost seems more like to me:
    IEEE-754 aimed too high;
    And, now, everyone is paying for it.

    Nevermind if "maybe we should all just do math slightly worse" is kind
    of a weak sounding argument.


    Sometimes it is also kind of lame that fixed point also kinda sucks, but
    for different reasons...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 24 01:34:41 2025
    From Newsgroup: comp.arch

    On 8/23/2025 5:59 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 8/22/2025 2:51 PM, Terje Mathisen wrote:
    BGB wrote:
    On 8/22/2025 11:17 AM, MitchAlsup wrote:
    --------------

    Often, seemingly works well enough to either assume the 0 exponent is
    either all 0, or the all 0's value is 0.

    Decided to leave out going into a detour about aggressive corner-cutting
    (or, more aggressive than I tend to use).


    But, the issue is partly asking what exactly Denormals and Inf/NaN tend
    to meaningfully offer to calculations...

    A slow and meaningful death instead of a quick and unsightly death.


    If your math goes NaN, it means math was wrong.
    But, if the math is not wrong, there are no NaN's.



    As is, it seems like a case of:
    Denormals:
    Usually too small to matter;
    Often their only visible effect is making performance worse.
    Inf/NaN:
    Often only appear if something has gone wrong.
    Used in initialization.

    Not usually:
    Usually ".bss" is initialized to 0.

    Local variables are "whatever random garbage happens to be in the
    register", except in Java, in which case 0.



    If the FPU were to behave like, say:
    Exponent 0 is 0 (DAZ/FTZ);
    Max exponent is treated like an extra level of huge values.
    Inf and NaN are then just huge values understood as Inf/NaN.

    Likely relatively little software would notice in practice.

    So, you do not follow standards, or even agree that they bring value
    to the computing community !?!


    Usual idea is that the value in following standards matters so long as
    there is a tangible benefit.

    But, as I see it, there is less issue in violating a standard if one
    notes that they do so and in which ways.


    The value of a rule isn't very large if one can throw the rule out the
    window and hardly anything seems to notice the difference.

    A lot depends on the specific rule, and the costs/benefits of violating
    it. Like, I think life in general works this way, like people just sort
    of evaluating rules, the cost benefit tradeoffs, etc, and deciding what tradeoffs will bring the most benefit (or, whatever is cheapest, most expedient, etc).

    Results may be variable, but usually stuff works OK.

    ...




    Say, for example, a video that came up not too long ago:
    https://www.youtube.com/watch?v=y-NOz94ZEOA

    Or, in effect, even with all the powers of shiny/expensive Desktop PC
    CPUs, denormal numbers still aren't free.

    Neither are seat belts or air-bags or 5MPH bumpers.


    I think idea was these are only around because law requires them.
    But, law requires them to reduce traffic fatalities.
    Otherwise, car makers probably wouldn't bother.

    But, they didn't go so far as requiring some way to be able to get the
    window open in the case where one drives into a body of water and the
    power windows fail (so, drive into water and car is still a death trap).

    Has anyone really been hurt by using DAZ/FTZ or similar? I doubt it.


    And, people need to manually opt out far more often than programs are
    likely to notice, and if it were a case of "opt-in for slightly more
    accurate math at the cost of worse performance"; how many people would
    make *that* choice?...




    Contrast, at least rounding has a useful/visible effect:

    If calculations were to always truncate, values tend to drift towards 0.
    In a lot of scenarios, this effect is, at least, visible.


    As for strict 0.5 ULP?
    Mostly seems to makes hardware more expensive.
    And, determinism could have been achieved in cheaper ways.
    Say, by specifying use of truncation instead.

    One could achieve 0.63 ULP by having 4 bits below the ULP.
    And, say, 54*54->58 FMUL is cheaper than 54*54->108 bit (or, more, if
    one wants to be able to support single-rounded FMAC).

    But then it takes 11 instructions to get double-double |u accuracy
    instead of 2 instructions::

    {double, double} TwoProduct( double a, double b )
    { // Knuth
    x = FPMUL( a * b );
    { ahi, alo } = Split( a );
    { bhi, blo } = Split( b );
    q = FPMUL( ahi * bhi );
    r = FPMUL( alo * bhi );
    s = FPMUL( ahi * blo );
    t = FPMUL( alo * blo );
    u = FPADD( x - q );
    v = FPADD( u - r );
    w = FPADD( v - s );
    y = FPADD( w - t );
    return { x, y };
    }
    versus::
    {double, double} TwoProduct( double a, double b )
    {
    double x = FPMUL( a * b );
    double y = FPMAC( a * b - x );
    return { x, y };
    }


    It is a tradeoff...


    Not saying that there are no downsides either, but in that case, a big question over whather or not it can be done affordably, or if the
    "better" version scales well (say, for example, up to more SIMD lanes, etc).

    Sometimes it is "poor" versus "nothing at all", and usually "poor" still
    beats "nothing at all".



    <snip rest>

    No need to be condescending about it.

    I consider it more of a philosophical disagreement here.

    But, most of computing has historically followed a pattern:
    "worse is better" or "perfect is the enemy of good".


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Aug 24 13:41:29 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    This would have simplified all sorts of array/matrix sw where both
    errors (NaN) and missing (None) items are possible.

    In what ways would None behave differently from SNaN?

    It would be transparently ignored in reductions, with zero overhead.

    In matrix calculations I simply padded matrices with zeros.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Aug 24 17:34:42 2025
    From Newsgroup: comp.arch

    EricP wrote:
    Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    This would have simplified all sorts of array/matrix sw where both
    errors (NaN) and missing (None) items are possible.

    In what ways would None behave differently from SNaN?

    It would be transparently ignored in reductions, with zero overhead.

    There is also the behavior with operators - how is it different from xNan? xNan behaves like an error and poisons any calculation it is in,
    which is also how SQL behaves wrt NULL values:

    -a value + xNan => xNan
    -a value * xNan => xNan

    whereas Null is typically thought of as a missing value:

    -a value + Null => value?
    -a value * Null => 0?

    It could also have different operator instruction options that select different behaviors similar to rounding mode or exception handling bits.
    All those option bits would take up a lot of instruction space.
    I'm used to the Mill None, where a store becomes a NOP, a mul behaves
    like x * 1 (or a NOP), same for other operations.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 24 11:51:18 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    This would have simplified all sorts of array/matrix sw where both
    errors (NaN) and missing (None) items are possible.
    In what ways would None behave differently from SNaN?
    It would be transparently ignored in reductions, with zero overhead.
    There is also the behavior with operators - how is it different from xNan? >> xNan behaves like an error and poisons any calculation it is in,
    which is also how SQL behaves wrt NULL values:

    value + xNan => xNan
    value * xNan => xNan

    whereas Null is typically thought of as a missing value:

    value + Null => value?
    value * Null => 0?


    I think they would want::

    value + xNaN => <same> xNaN
    value |u xNaN => <same> xNaN
    value / xNaN => <same> xNaN
    xNaN / value => <same> xNaN

    Where the non-existent operand is treated as turning the calculation
    into a copy of the xNaN. Some architectures put a payload into the
    xNaN such as a 3-bit code for why the xNaN was created, others also IP<low-bits> to help identify the instruction the xNaN first occurred.

    That's how Nan propagation works now, to poison the calculation.
    The Nan propagation rules were designed back when people thought
    the using traps for fixing individual calculations was a good idea.
    That way Nan could serve as either an error or missing value
    and your exception handler could customize the behavior you want.

    "6.2.3 NaN propagation
    An operation that propagates a NaN operand to its result and has a single
    NaN as an input should produce a NaN with the payload of the input NaN
    if representable in the destination format.

    If two or more inputs are NaN, then the payload of the resulting NaN
    should be identical to the payload of one of the input NaNs if
    representable in the destination format. This standard does not
    specify which of the input NaNs will provide the payload."

    Traps are expensive for pipelines, vectors, gpu's, so I'd want
    None to behave differently - I'm just not sure what.
    And I recognize (below) that there may be different ways that users
    want None to behave so suggest there might be control bits to select
    among multiple None behaviors on an each instruction.

    It could also have different operator instruction options that select
    different behaviors similar to rounding mode or exception handling bits.
    All those option bits would take up a lot of instruction space.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Sun Aug 24 11:53:05 2025
    From Newsgroup: comp.arch

    On Sun, 24 Aug 2025 01:34:41 -0500, BGB <cr88192@gmail.com> wrote:


    If your math goes NaN, it means math was wrong.
    But, if the math is not wrong, there are no NaN's.

    Not exactly. A NaN result means that the computation has failed. It
    may be due to limited precision or range rather than incorrect math.

    For most use cases, a result of INF or IND[*] similarly means the
    computation has failed and there is no point trying to continue.


    [*] IEEE 754-2008.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 24 12:09:16 2025
    From Newsgroup: comp.arch

    Terje Mathisen wrote:
    EricP wrote:
    Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    This would have simplified all sorts of array/matrix sw where both
    errors (NaN) and missing (None) items are possible.

    In what ways would None behave differently from SNaN?

    It would be transparently ignored in reductions, with zero overhead.

    There is also the behavior with operators - how is it different from
    xNan?
    xNan behaves like an error and poisons any calculation it is in,
    which is also how SQL behaves wrt NULL values:

    value + xNan => xNan
    value * xNan => xNan

    whereas Null is typically thought of as a missing value:

    value + Null => value?
    value * Null => 0?

    It could also have different operator instruction options that select
    different behaviors similar to rounding mode or exception handling bits.
    All those option bits would take up a lot of instruction space.

    I'm used to the Mill None, where a store becomes a NOP, a mul behaves
    like x * 1 (or a NOP), same for other operations.

    Terje


    How does Mill store a None value if they change to NOP?

    I was thinking of spreadsheet style rules for missing cells.
    Something that's compatible with dsp's, simd, vector, and gpu's,
    but I don't know enough about all their calculations to know the
    different ways calculations handle missing values.

    And there can be different None rules just like different roundings.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 24 12:46:52 2025
    From Newsgroup: comp.arch

    BGB wrote:
    On 8/23/2025 5:59 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    But, the issue is partly asking what exactly Denormals and Inf/NaN tend
    to meaningfully offer to calculations...

    A slow and meaningful death instead of a quick and unsightly death.


    If your math goes NaN, it means math was wrong.
    But, if the math is not wrong, there are no NaN's.

    It depends on whether Nan is generated or propagated.
    If Nan is generated by an instruction then yes it means error.

    If Nan means missing value in which case it enters the
    calculation as input data and currently propagates along it
    which may not be what everyone wants.

    Thus the need for a separate None value to mean missing.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 24 12:08:06 2025
    From Newsgroup: comp.arch

    On 8/24/2025 8:41 AM, Waldek Hebisch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    This would have simplified all sorts of array/matrix sw where both
    errors (NaN) and missing (None) items are possible.

    In what ways would None behave differently from SNaN?

    It would be transparently ignored in reductions, with zero overhead.

    In matrix calculations I simply padded matrices with zeros.


    Yes, this is fairly standard.


    But, yeah, in most normal uses, apart from edge cases little outside of
    the normal range tends to be used all that much in practice.



    Subnormal numbers exist, but are usually infrequent.
    NaN and Inf rarely appear outside of error conditions.

    NaN could make sense for things like uninitialized values, except:
    Languages like Java use 0.0 here;
    Languages like C and C++ give garbage is already here.
    And, if you malloc something, it is some mix of 0s and garbage.

    NaN is sometimes used as a value encoding scheme in dynamically typed languages, but this is independent of what the FPU does with NaN.



    Looking around, it seems that some compilers and targets use or specify
    use of DAZ/FTZ as the default.
    Eg: ICC and apparently the Apple OS's default to DAZ/FTZ on ARM.
    GCC apparently enables it if compiling with "-ffast-math";
    ...

    And, on the other side, apparently lots of (not exactly hobbyist grade)
    CPUs still handle subnormal numbers in firmware or using traps (hence
    why anyone has reason to care). If it was nearly free in terms of
    performance on mainline CPUs, no one would have reason to care; and,
    people have reason to care, because the hidden traps are slow.


    Looking around, it would appear that at least my handling of FP8 and
    similar (FP8S, FP8U, and FP8A/A-Law) is similar to DEC floating point.

    Apparently, DEC (PDP-11) used a scheme where:
    All zeroes was understood as 0, everything else was normal range.
    Overflows saturated at the maximum value;
    The bias was 128 rather than 127;
    Double was basically just float with a larger mantissa.

    This differs from, say:
    Entire 0 exponent range understood as 0;
    Inf/NaN range still exists, but the behavior may differ.

    Apparently ARM had used a DEC like approach for Binary16/Half support,
    vs handling it like the other IEEE types.

    It is a tradeoff, as having values between 65536.0 and 131008.0 could be
    nice. As-is, maximum value is 65504.0, as the next value up is Inf. I
    had gone with more IEEE like handling of Binary16.

    ...


    Some other variants still have denormals for FP8 variants, but leave off Inf/NaN in favor of slightly more dynamic range (eg, NVIDIA).

    I had made the slightly non-standard feature of (sometimes) using -0 as
    a NaN placeholder. Ironically, this is partly because -0 seems to be
    rare in practice for other reasons. It is possible -0 as NaN could be
    used more, at least allowing for some level of error detection.

    In this case, it is possible that, for converters:
    Everything other than +/- 0 are normal range;
    0 is a special case, mapping to 0 on widening;
    -0 maps to NaN on widening.
    Both Inf and NaN map to -0/NaN on narrowing;
    Overflow still clamps to 0x7F/0xFF on narrowing.

    Or, basically leaving the NaN scenario for "something has gone wrong on
    the Binary16 side of things".


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Aug 24 20:19:53 2025
    From Newsgroup: comp.arch

    On Sun, 24 Aug 2025 11:51:18 -0400
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    MitchAlsup wrote:
    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    This would have simplified all sorts of array/matrix sw where
    both errors (NaN) and missing (None) items are possible.
    In what ways would None behave differently from SNaN?
    It would be transparently ignored in reductions, with zero
    overhead.
    There is also the behavior with operators - how is it different
    from xNan? xNan behaves like an error and poisons any calculation
    it is in, which is also how SQL behaves wrt NULL values:

    value + xNan => xNan
    value * xNan => xNan

    whereas Null is typically thought of as a missing value:

    value + Null => value?
    value * Null => 0?


    I think they would want::

    value + xNaN => <same> xNaN
    value - xNaN => <same> xNaN
    value / xNaN => <same> xNaN
    xNaN / value => <same> xNaN

    Where the non-existent operand is treated as turning the calculation
    into a copy of the xNaN. Some architectures put a payload into the
    xNaN such as a 3-bit code for why the xNaN was created, others also IP<low-bits> to help identify the instruction the xNaN first
    occurred.

    That's how Nan propagation works now, to poison the calculation.
    The Nan propagation rules were designed back when people thought
    the using traps for fixing individual calculations was a good idea.
    That way Nan could serve as either an error or missing value
    and your exception handler could customize the behavior you want.

    "6.2.3 NaN propagation
    An operation that propagates a NaN operand to its result and has a
    single NaN as an input should produce a NaN with the payload of the
    input NaN if representable in the destination format.

    If two or more inputs are NaN, then the payload of the resulting NaN
    should be identical to the payload of one of the input NaNs if
    representable in the destination format. This standard does not
    specify which of the input NaNs will provide the payload."

    Traps are expensive for pipelines, vectors, gpu's,
    Exceptions themselves are inexpensive. Misuses of exceptions for 'fixup
    and continue' antipattern are expensive.
    Invalid Operand exception is the only IEEE-754 exception which I do not
    find misdesigned* and even consider it potentially useful despite that
    it never helped me in practice.
    so I'd want
    None to behave differently - I'm just not sure what.
    And I recognize (below) that there may be different ways that users
    want None to behave so suggest there might be control bits to select
    among multiple None behaviors on an each instruction.

    It could also have different operator instruction options that
    select different behaviors similar to rounding mode or exception
    handling bits. All those option bits would take up a lot of
    instruction space.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 24 12:34:24 2025
    From Newsgroup: comp.arch

    On 8/24/2025 10:53 AM, George Neuner wrote:
    On Sun, 24 Aug 2025 01:34:41 -0500, BGB <cr88192@gmail.com> wrote:


    If your math goes NaN, it means math was wrong.
    But, if the math is not wrong, there are no NaN's.

    Not exactly. A NaN result means that the computation has failed. It
    may be due to limited precision or range rather than incorrect math.

    For most use cases, a result of INF or IND[*] similarly means the
    computation has failed and there is no point trying to continue.

    [*] IEEE 754-2008.

    Yeah, pretty much.

    As noted, what I was saying before probably should not have been taken
    as arguing for a complete elimination of NaN or that the formats
    themselves should change here...

    Rather that NaN typically merely signals "Well, the math has broken";
    and usually serves little role beyond this (apart from giving certain JavaScript VMs a place to store their pointers; but if a JS VM were to
    use an object pointer as an input to a computation, this itself is an
    error condition).


    Inf/NaN is usually at least more neutral (the dynamic range of Binary32
    and Binary64 being so large, that losing one level doesn't cost much).

    For Binary16, it is iffy (due to Binary16 having limited dynamic range).
    For 8-bit formats, had already stated my stance here (only a single
    optional NaN value exists).


    Though, if I were to tweak Binary16 here, possibly might be:
    Largest exponent becomes normal range again;
    Inf/NaN is folded into the -0 range.
    8000..81FF: Assume -0
    8200..8201: +Inf/-Inf
    8202..83FF: NaNs

    But, debatable, as this sort of thing could create interoperability issues.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Aug 24 19:48:07 2025
    From Newsgroup: comp.arch

    EricP wrote:
    Terje Mathisen wrote:
    EricP wrote:
    Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    This would have simplified all sorts of array/matrix sw where both>>>>>> errors (NaN) and missing (None) items are possible.

    In what ways would None behave differently from SNaN?

    It would be transparently ignored in reductions, with zero overhead.>>> >>> There is also the behavior with operators - how is it different from >>> xNan?
    xNan behaves like an error and poisons any calculation it is in,
    which is also how SQL behaves wrt NULL values:

    -a-a value + xNan => xNan
    -a-a value * xNan => xNan

    whereas Null is typically thought of as a missing value:

    -a-a value + Null => value?
    -a-a value * Null => 0?

    It could also have different operator instruction options that select>>> different behaviors similar to rounding mode or exception handling bits.
    All those option bits would take up a lot of instruction space.

    I'm used to the Mill None, where a store becomes a NOP, a mul behaves >> like x * 1 (or a NOP), same for other operations.

    Terje


    How does Mill store a None value if they change to NOP?
    It does not!
    Storing None value behaves just like a NOP, in that nothing really happens.
    It it like having predicated SIMD stores, with one or more entries (or individual bytes) masked out.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 24 16:09:40 2025
    From Newsgroup: comp.arch

    EricP wrote:
    Terje Mathisen wrote:
    EricP wrote:
    Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    This would have simplified all sorts of array/matrix sw where both >>>>>> errors (NaN) and missing (None) items are possible.

    In what ways would None behave differently from SNaN?

    It would be transparently ignored in reductions, with zero overhead.

    There is also the behavior with operators - how is it different from
    xNan?
    xNan behaves like an error and poisons any calculation it is in,
    which is also how SQL behaves wrt NULL values:

    value + xNan => xNan
    value * xNan => xNan

    whereas Null is typically thought of as a missing value:

    value + Null => value?
    value * Null => 0?

    It could also have different operator instruction options that select
    different behaviors similar to rounding mode or exception handling bits. >>> All those option bits would take up a lot of instruction space.

    I'm used to the Mill None, where a store becomes a NOP, a mul behaves
    like x * 1 (or a NOP), same for other operations.

    Terje


    I was thinking of spreadsheet style rules for missing cells.
    Something that's compatible with dsp's, simd, vector, and gpu's,
    but I don't know enough about all their calculations to know the
    different ways calculations handle missing values.

    And there can be different None rules just like different roundings.

    Musing about errors...

    As exceptions can be masked one might also be want to make a
    distinction between generated and propagated Nan,
    as well as the reason for Nan.

    Something like this for Nan high order fraction bits:
    - 1 bit to indicate 0=Quite or 1=Signalled
    - 1 bit to indicate 0=Generated or 1=Propagated
    - 4 bits to indicate error code with
    0 = Missing or None
    1 = Invalid operand format
    2 = Invalid operation
    3 = Divide by zero
    etc

    If any source operand is a Nan marked Generated the the result is a Nan
    with the same error code but Propagated. If multiple source operands
    are Nan then some rules on how to propagate the Nan error value
    - if any is Signalled then result is Signalled,
    - if all Nan source operands are error code None then the result is None
    otherwise the error code is one of the >0 codes.

    And (assuming instruction bits are free and infinitely available)
    instruction bits to control how each deals with Nan source values
    and how to handle each (fault, trap, propagate, substitute),
    how it generates Nan (quite, signalled) for execution errors,
    and various exception condition enable flags (denorm, overflow,
    underflow, Inexact, etc).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Aug 24 23:57:34 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    EricP wrote:
    Terje Mathisen wrote:
    EricP wrote:
    Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    This would have simplified all sorts of array/matrix sw where both >>>>>> errors (NaN) and missing (None) items are possible.

    In what ways would None behave differently from SNaN?

    It would be transparently ignored in reductions, with zero overhead.

    There is also the behavior with operators - how is it different from
    xNan?
    xNan behaves like an error and poisons any calculation it is in,
    which is also how SQL behaves wrt NULL values:

    value + xNan => xNan
    value * xNan => xNan

    whereas Null is typically thought of as a missing value:

    value + Null => value?
    value * Null => 0?

    It could also have different operator instruction options that select
    different behaviors similar to rounding mode or exception handling bits. >>> All those option bits would take up a lot of instruction space.

    I'm used to the Mill None, where a store becomes a NOP, a mul behaves
    like x * 1 (or a NOP), same for other operations.

    Terje


    I was thinking of spreadsheet style rules for missing cells.
    Something that's compatible with dsp's, simd, vector, and gpu's,
    but I don't know enough about all their calculations to know the
    different ways calculations handle missing values.

    And there can be different None rules just like different roundings.

    Musing about errors...

    As exceptions can be masked one might also be want to make a
    distinction between generated and propagated Nan,
    as well as the reason for Nan.

    Something like this for Nan high order fraction bits:
    - 1 bit to indicate 0=Quite or 1=Signalled
    - 1 bit to indicate 0=Generated or 1=Propagated
    - 4 bits to indicate error code with
    0 = Missing or None
    1 = Invalid operand format
    2 = Invalid operation
    3 = Divide by zero
    etc

    When My 66000 FPU generates a NaN it puts a 3-bit code in fraction<50:48>
    to record "why" the NaN was generated, and puts IP<47:2> in the LoB to indicate where the NaN was generated. Since there are only 5-IEEE 754
    kinds of errors, a 3-bit code suffices.

    If any source operand is a Nan marked Generated the the result is a Nan
    with the same error code but Propagated. If multiple source operands
    are Nan then some rules on how to propagate the Nan error value
    - if any is Signalled then result is Signalled,
    - if all Nan source operands are error code None then the result is None
    otherwise the error code is one of the >0 codes.

    One can tell propagated/generated by IP in fraction.

    And (assuming instruction bits are free and infinitely available)
    instruction bits to control how each deals with Nan source values
    and how to handle each (fault, trap, propagate, substitute),
    how it generates Nan (quite, signalled) for execution errors,
    and various exception condition enable flags (denorm, overflow,
    underflow, Inexact, etc).

    Yeah, that's not gonna 'appen.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Aug 26 15:37:59 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Greetings everyone !

    --------------------

    I'm really leary about the idea of starting with MMU enabled,
    I don't see any advantage to doing that.

    Boot ROM is encrypted, and the MMU tables provide access to
    the keys.

    DRAM can also be encrypted, using the same mechanism(s).

    And that is all I should say about this for now.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Aug 26 16:45:47 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Greetings everyone !

    --------------------

    I'm really leary about the idea of starting with MMU enabled,
    I don't see any advantage to doing that.

    Boot ROM is encrypted,

    Not uncommon.

    and the MMU tables provide access to
    the keys.

    I'd consider using e-fuses for the keys; burned
    during SLT, they'll be unique per chip.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Aug 28 19:52:42 2025
    From Newsgroup: comp.arch

    On 8/21/2025 1:49 PM, MitchAlsup wrote:

    Greetings everyone !

    snip

    My 66000 ISA is in "pretty good shape" having almost no changes over
    the last 6 months, with only specification clarifications. So it was time
    to work on the non-ISA parts.

    You mention two non-ISA parts that you have been working on. I thought
    I would ask you for your thoughts on another non-ISA part. Timers and
    clocks. Doing a "clean slate" ISA frees you from being compatible with
    lots of old features that might have been the right thing to do back
    then, but aren't now.

    So, how many clocks/timers should a system have? What precision? How
    fast does the software need to be able to access them? I presume you
    need some comparitors (unless you use count down to zero). Should the comparisons be one time or recurring? What about syncing with an
    external timer? There are many such decisions to make, and I am curious
    as to your thinking on the subject.

    If you haven't gotten around to working on this part of the system, just
    say so.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Aug 29 15:26:02 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 8/21/2025 1:49 PM, MitchAlsup wrote:

    Greetings everyone !

    snip

    My 66000 ISA is in "pretty good shape" having almost no changes over
    the last 6 months, with only specification clarifications. So it was time to work on the non-ISA parts.

    You mention two non-ISA parts that you have been working on. I thought
    I would ask you for your thoughts on another non-ISA part. Timers and clocks. Doing a "clean slate" ISA frees you from being compatible with
    lots of old features that might have been the right thing to do back
    then, but aren't now.

    So, how many clocks/timers should a system have?

    Lots. Every major resource should have its own clock as part of its
    performance counter set.

    Every interruptible resource should have its own timer which is programmed
    to throw interrupts to one thing or another.

    What precision?

    Clocks need to be as fast as the fastest event, right now that means
    16 GHz since PCIe 5.0 and 6.0 use 16 GHz clock bases. But, realistically,
    if you can count 0..16 events per ns, its fine.

    How
    fast does the software need to be able to access them?

    1 instruction-then the latency of actual access.
    2 instructions back-to-back to perform an ATOMIC-like read-update.
    LDD Rd,[timer]
    STD Rs,[timer]

    I presume you
    need some comparitors (unless you use count down to zero).

    You can count down to zero, count up to zero, or use a comparator.
    Zeroes cost less HW than comparators. Comparators also require
    an additional register and an additional instruction at swap time.

    Should the comparisons be one time or recurring?

    I have no opinion at this time.

    What about syncing with an
    external timer?

    A necessity--that is what the ATOMIC-like comment above is for.

    There are many such decisions to make, and I am curious
    as to your thinking on the subject.

    If you haven't gotten around to working on this part of the system, just
    say so.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Aug 29 12:05:19 2025
    From Newsgroup: comp.arch

    On 8/28/2025 9:52 PM, Stephen Fuld wrote:
    On 8/21/2025 1:49 PM, MitchAlsup wrote:

    Greetings everyone !

    snip

    My 66000 ISA is in "pretty good shape" having almost no changes over
    the last 6 months, with only specification clarifications. So it was time
    to work on the non-ISA parts.

    You mention two non-ISA parts that you have been working on.-a I thought
    I would ask you for your thoughts on another non-ISA part.-a Timers and clocks.-a Doing a "clean slate" ISA frees you from being compatible with lots of old features that might have been the right thing to do back
    then, but aren't now.

    So, how many clocks/timers should a system have?-a What precision?-a How fast does the software need to be able to access them?-a I presume you
    need some comparitors (unless you use count down to zero).-a Should the comparisons be one time or recurring?-a What about syncing with an
    external timer?-a There are many such decisions to make, and I am curious
    as to your thinking on the subject.

    If you haven't gotten around to working on this part of the system, just
    say so.


    I don't know as much about his case, but can note in my case, I ended up
    going with:
    A 1kHz timer interrupt;
    A 64-bit clock register with a 1us precision.

    Also:
    A cycle counter;
    A random noise source.

    Maybe a case could be made for a (probably virtual) 1ns clock register.
    Though, one could in theory (on a faster CPU) estimate the ns time from
    the cycle counter relative to the 1us timer; assuming the cycle counter remains at a constant speed (note that it counts raw cycles, so isn't necessarily tied to how many instruction-cycles have been performed).


    Can note that MSP430 had the option of a 32kHz timer interrupt, but my
    ISA design at 50 MHz can't manage a 32kHz interrupt without it eating
    too much of the CPU.

    I suspect this may be an issue of having a higher interrupt-handling
    overhead than an MSP430.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Aug 29 17:56:22 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 8/21/2025 1:49 PM, MitchAlsup wrote:

    Greetings everyone !

    snip

    My 66000 ISA is in "pretty good shape" having almost no changes over
    the last 6 months, with only specification clarifications. So it was time >> > to work on the non-ISA parts.

    You mention two non-ISA parts that you have been working on. I thought
    I would ask you for your thoughts on another non-ISA part. Timers and
    clocks. Doing a "clean slate" ISA frees you from being compatible with
    lots of old features that might have been the right thing to do back
    then, but aren't now.

    So, how many clocks/timers should a system have?

    Lots. Every major resource should have its own clock as part of its >performance counter set.

    Every interruptible resource should have its own timer which is programmed
    to throw interrupts to one thing or another.

    Guest OS needs timers synced to, but not duplicates of the main
    system timer. There should be a mechanism to apply an offset to
    the timer in the guest view of the main system timer (e.g.
    as stored in the AMD SVM VMCB).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Aug 29 11:21:14 2025
    From Newsgroup: comp.arch

    On 8/29/2025 8:26 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 8/21/2025 1:49 PM, MitchAlsup wrote:

    Greetings everyone !

    snip

    My 66000 ISA is in "pretty good shape" having almost no changes over
    the last 6 months, with only specification clarifications. So it was time >>> to work on the non-ISA parts.

    You mention two non-ISA parts that you have been working on. I thought
    I would ask you for your thoughts on another non-ISA part. Timers and
    clocks. Doing a "clean slate" ISA frees you from being compatible with
    lots of old features that might have been the right thing to do back
    then, but aren't now.

    So, how many clocks/timers should a system have?

    Lots. Every major resource should have its own clock as part of its performance counter set.

    Every interruptible resource should have its own timer which is programmed
    to throw interrupts to one thing or another.

    So if I want to keep track of how much CPU time a task has accumulated, someone has to save the value of the CPU clock when the task gets
    interrupted or switched out. Is this done by the HW or by code in the
    task switcher? Later, when the task gets control of the CPU again,
    there needs to be a mechanism to resume adding time to its saved value.
    How is this done?

    What precision?

    Clocks need to be as fast as the fastest event, right now that means
    16 GHz since PCIe 5.0 and 6.0 use 16 GHz clock bases. But, realistically,
    if you can count 0..16 events per ns, its fine.

    How
    fast does the software need to be able to access them?

    1 instruction-then the latency of actual access.
    2 instructions back-to-back to perform an ATOMIC-like read-update.
    LDD Rd,[timer]
    STD Rs,[timer]

    Good.

    I presume you
    need some comparitors (unless you use count down to zero).

    You can count down to zero, count up to zero, or use a comparator.
    Zeroes cost less HW than comparators. Comparators also require
    an additional register and an additional instruction at swap time.

    Should the
    comparisons be one time or recurring?

    I have no opinion at this time.

    What about syncing with an
    external timer?

    A necessity--that is what the ATOMIC-like comment above is for.

    Got it.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Aug 29 19:31:58 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 8/29/2025 8:26 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 8/21/2025 1:49 PM, MitchAlsup wrote:

    Greetings everyone !

    snip

    My 66000 ISA is in "pretty good shape" having almost no changes over
    the last 6 months, with only specification clarifications. So it was time >>> to work on the non-ISA parts.

    You mention two non-ISA parts that you have been working on. I thought
    I would ask you for your thoughts on another non-ISA part. Timers and
    clocks. Doing a "clean slate" ISA frees you from being compatible with
    lots of old features that might have been the right thing to do back
    then, but aren't now.

    So, how many clocks/timers should a system have?

    Lots. Every major resource should have its own clock as part of its performance counter set.

    Every interruptible resource should have its own timer which is programmed to throw interrupts to one thing or another.

    So if I want to keep track of how much CPU time a task has accumulated, someone has to save the value of the CPU clock when the task gets interrupted or switched out. Is this done by the HW or by code in the
    task switcher?

    Task switcher.

    Later, when the task gets control of the CPU again,
    there needs to be a mechanism to resume adding time to its saved value.
    How is this done?

    One reads the old performance counters:
    LDM Rd,Rd+7,[MMIO+R29]
    and saves with Thread:
    STM Rd,Rd+7,[Thread+128]

    Side Note: LDM and STM are ATOMIC wrt the 8 doublewords loaded/stored.

    To put them back:
    LDM Rd,Rd+7,[Thread+128]
    STM Rd,Rd+7,[MMIO+R29]

    I should also note that all major resources {Interconnect, L3 Cache,
    DAM, HostBridge, I/O MMU, PCIe Segmenter, PCIe Root Complexes} have performance counters; and when there are more than one of them, each
    has its own performance counter set (8 counters each with 1-byte
    selectors.) Each resource defines what encoding counts what event.
    Core currently counts 35 different events, L2 28 events, Miss Buffers
    currently counts 25 events.
    ------------------
    There might be other ways to keep a clock constantly running and sample
    it at task switch, instead of replacing it at task switch.

    What precision?

    All counters and timers are 64-bits, and can run as fast as core/PCIe

    Clocks need to be as fast as the fastest event, right now that means
    16 GHz since PCIe 5.0 and 6.0 use 16 GHz clock bases. But, realistically, if you can count 0..16 events per ns, its fine.

    How
    fast does the software need to be able to access them?

    1 instruction-then the latency of actual access.
    2 instructions back-to-back to perform an ATOMIC-like read-update.
    LDD Rd,[timer]
    STD Rs,[timer]

    Good.

    I presume you
    need some comparitors (unless you use count down to zero).

    You can count down to zero, count up to zero, or use a comparator.
    Zeroes cost less HW than comparators. Comparators also require
    an additional register and an additional instruction at swap time.

    Should the
    comparisons be one time or recurring?

    I have no opinion at this time.

    What about syncing with an
    external timer?

    A necessity--that is what the ATOMIC-like comment above is for.

    Got it.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Aug 29 23:22:42 2025
    From Newsgroup: comp.arch

    On 8/29/2025 12:31 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 8/29/2025 8:26 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 8/21/2025 1:49 PM, MitchAlsup wrote:

    Greetings everyone !

    snip

    My 66000 ISA is in "pretty good shape" having almost no changes over >>>>> the last 6 months, with only specification clarifications. So it was time >>>>> to work on the non-ISA parts.

    You mention two non-ISA parts that you have been working on. I thought >>>> I would ask you for your thoughts on another non-ISA part. Timers and >>>> clocks. Doing a "clean slate" ISA frees you from being compatible with >>>> lots of old features that might have been the right thing to do back
    then, but aren't now.

    So, how many clocks/timers should a system have?

    Lots. Every major resource should have its own clock as part of its
    performance counter set.

    Every interruptible resource should have its own timer which is programmed >>> to throw interrupts to one thing or another.

    So if I want to keep track of how much CPU time a task has accumulated,
    someone has to save the value of the CPU clock when the task gets
    interrupted or switched out. Is this done by the HW or by code in the
    task switcher?

    Task switcher.

    Later, when the task gets control of the CPU again,
    there needs to be a mechanism to resume adding time to its saved value.
    How is this done?

    One reads the old performance counters:
    LDM Rd,Rd+7,[MMIO+R29]
    and saves with Thread:
    STM Rd,Rd+7,[Thread+128]

    Side Note: LDM and STM are ATOMIC wrt the 8 doublewords loaded/stored.

    To put them back:
    LDM Rd,Rd+7,[Thread+128]
    STM Rd,Rd+7,[MMIO+R29]

    I think there is a problem with that. lets say there are two OSs
    running under a hypervisor. And I want to collect CPU time for an
    application running under one of those OSs. Now consider a timer
    interrupt to the hypervisor that causes it to switch out the OS that our program is running under and switch in the other OS. The mechanism you described takes care of getting the correct CPU time for the OS that is switched out, but I don't think it "switches out" the application
    program, so the application's CPU time is too high (it includes the time
    spent in the other OS). I don't know how other systems with hypervisors handle this situation.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Aug 30 08:16:04 2025
    From Newsgroup: comp.arch

    On 8/29/2025 11:22 PM, Stephen Fuld wrote:
    On 8/29/2025 12:31 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 8/29/2025 8:26 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 8/21/2025 1:49 PM, MitchAlsup wrote:

    Greetings everyone !

    snip

    My 66000 ISA is in "pretty good shape" having almost no changes over >>>>>> the last 6 months, with only specification clarifications. So it
    was time
    to work on the non-ISA parts.

    You mention two non-ISA parts that you have been working on.-a I
    thought
    I would ask you for your thoughts on another non-ISA part.-a Timers and >>>>> clocks.-a Doing a "clean slate" ISA frees you from being compatible >>>>> with
    lots of old features that might have been the right thing to do back >>>>> then, but aren't now.

    So, how many clocks/timers should a system have?

    Lots. Every major resource should have its own clock as part of its
    performance counter set.

    Every interruptible resource should have its own timer which is
    programmed
    to throw interrupts to one thing or another.

    So if I want to keep track of how much CPU time a task has accumulated,
    someone has to save the value of the CPU clock when the task gets
    interrupted or switched out.-a Is this done by the HW or by code in the
    task switcher?

    Task switcher.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Later, when the task gets control of the CPU again,
    there needs to be a mechanism to resume adding time to its saved value.
    How is this done?

    One reads the old performance counters:
    -a-a-a-a LDM-a-a Rd,Rd+7,[MMIO+R29]
    and saves with Thread:
    -a-a-a-a STM-a-a Rd,Rd+7,[Thread+128]

    Side Note: LDM and STM are ATOMIC wrt the 8 doublewords loaded/stored.

    To put them back:
    -a-a-a-a LDM-a-a Rd,Rd+7,[Thread+128]
    -a-a-a-a STM-a-a Rd,Rd+7,[MMIO+R29]

    I think there is a problem with that.-a lets say there are two OSs
    running under a hypervisor.-a And I want to collect CPU time for an application running under one of those OSs.-a Now consider a timer
    interrupt to the hypervisor that causes it to switch out the OS that our program is running under and switch in the other OS.-a The mechanism you described takes care of getting the correct CPU time for the OS that is switched out, but I don't think it "switches out" the application
    program, so the application's CPU time is too high (it includes the time spent in the other OS).

    I apologize for the self follow-up, but upon further thought, I believe
    I got this backward. The interrupt will cause the accumulated CPU time
    for the application that is running to be saved correctly, but nothing
    will touch the value for the OS, so it will continue accumulating time,
    even though the CPU is running the hypervisor, or even the other OS.


    I don't know how other systems with hypervisors
    handle this situation.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2