• Re: DMA is obsolete

    From Lars Poulsen@21:1/5 to John Levine on Sat Apr 26 16:28:47 2025
    On 2025-04-26, John Levine <johnl@taugh.com> wrote:
    Well, not entirely. This preprint argues that in environments with
    lots of cores and where latency is an issue, programmed I/O can outperform DMA.

    Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

    Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe

    Conventional wisdom holds that an efficient interface between an OS
    running on a CPU and a high-bandwidth I/O device should use Direct
    Memory Access (DMA) to offload data transfer, descriptor rings for
    buffering and queuing, and interrupts for asynchrony between cores and device. In this paper we question this wisdom in the light of two
    trends: modern and emerging cache-coherent interconnects like CXL3.0,
    and workloads, particularly microservices and serverless computing.
    Like some others before us, we argue that the assumptions of the
    DMA-based model are obsolete, and in many use-cases programmed I/O,
    where the CPU explicitly transfers data and control information to and
    from a device via loads and stores, delivers a more efficient system. However, we push this idea much further. We show, in a real hardware implementation, the gains in latency for fine-grained communication achievable using an open cache-coherence protocol which exposes cache transitions to a smart device, and that throughput is competitive with
    DMA over modern interconnects. We also demonstrate three use-cases: fine-grained RPC-style invocation of functions on an accelerator,
    offloading of operators in a streaming dataflow engine, and a network interface targeting serverless functions, comparing our use of
    coherence with both traditional DMA-style interaction and a
    highly-optimized implementation using memory-mapped programmed I/O
    over PCIe.

    https://arxiv.org/abs/2409.08141

    What is the difference between DMA and message-passing to another core
    doing CMOV loop at the ISA level?

    DMA means doing that it the micro-engine instead of at the ISA level.
    Same difference.

    What am I missing?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Apr 26 16:19:45 2025
    Well, not entirely. This preprint argues that in environments with
    lots of cores and where latency is an issue, programmed I/O can outperform
    DMA.

    Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

    Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe

    Conventional wisdom holds that an efficient interface between an OS
    running on a CPU and a high-bandwidth I/O device should use Direct
    Memory Access (DMA) to offload data transfer, descriptor rings for
    buffering and queuing, and interrupts for asynchrony between cores and
    device. In this paper we question this wisdom in the light of two
    trends: modern and emerging cache-coherent interconnects like CXL3.0,
    and workloads, particularly microservices and serverless computing.
    Like some others before us, we argue that the assumptions of the
    DMA-based model are obsolete, and in many use-cases programmed I/O,
    where the CPU explicitly transfers data and control information to and
    from a device via loads and stores, delivers a more efficient system.
    However, we push this idea much further. We show, in a real hardware implementation, the gains in latency for fine-grained communication
    achievable using an open cache-coherence protocol which exposes cache transitions to a smart device, and that throughput is competitive with
    DMA over modern interconnects. We also demonstrate three use-cases: fine-grained RPC-style invocation of functions on an accelerator,
    offloading of operators in a streaming dataflow engine, and a network
    interface targeting serverless functions, comparing our use of
    coherence with both traditional DMA-style interaction and a
    highly-optimized implementation using memory-mapped programmed I/O
    over PCIe.

    https://arxiv.org/abs/2409.08141
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lars Poulsen on Sat Apr 26 19:28:21 2025
    Lars Poulsen wrote:
    On 2025-04-26, John Levine <johnl@taugh.com> wrote:
    Well, not entirely. This preprint argues that in environments with
    lots of cores and where latency is an issue, programmed I/O can outperform >> DMA.

    Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

    Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
    [snip]

    https://arxiv.org/abs/2409.08141

    What is the difference between DMA and message-passing to another core
    doing CMOV loop at the ISA level?

    DMA means doing that it the micro-engine instead of at the ISA level.
    Same difference.

    What am I missing?


    I think, in the end it all comes down to power:

    If the DMA engine can move n GB of data using less total power than
    having a regular core do it with programmed IO, then the DMA engine wins.

    OTOH, I have argued here in c.arch that for most data input streams, a
    regular core is going to look at the data eventually, and in that case
    the same core can do the work and either process it directly (in
    register file sized or smaller blocks)or work as a prefetcher to first
    load up $L1-sized blocks and then process that chunk.

    On the gripping hand, if this is either going out, or you only need to
    look at a small percentage of the incoming cache lines worth of data,
    then the more power-efficient DMA engine can still win.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Levine on Sat Apr 26 17:29:06 2025
    John Levine <johnl@taugh.com> writes:
    Well, not entirely. This preprint argues that in environments with
    lots of cores and where latency is an issue, programmed I/O can outperform >DMA.

    Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

    Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe

    <snip abstract>


    https://arxiv.org/abs/2409.08141

    Interesting article, thanks for posting.

    Their conclusion is not at all surprising for the operations they target
    in the paper. PCI express throughput has increased with each
    generation, and PCI express latencies have decreased with each
    generation. There are certain workloads enabled by CXL that
    benefit from reduced PCIe latency, but those are primarily
    aimed at increasing directly accessible memory.

    However, I expect there are still benefits in using DMA for bulk data transfer, particularly for network packet handling where
    throughput is more interesting than PCI MMIO latency.

    One concern that arises from the paper are the security
    implications of device access to the cache coherency
    protocol. Not an issue for a well-behaved device, but
    potentially problematic in a secure environment with
    third-party CXL-mem devices.

    At 3Leaf systems, we extended the coherency domain over
    IB or 10Gbe Ethernet to encompass multiple servers in a
    single coherency domain, which both facilitated I/O
    and provided a single shared physical address space across
    multiple servers (up to 16). CXL-mem is basically the same
    but using PCIe instead of IB.

    Granted, that was close to 20 years ago, and switch latencies
    were significant (100ns for IB, far more for Ethernet).

    CXL-mem is a similar technology with a different transport (we
    looked at Infiniband, 10Ge ethernet and "advanced switching"
    (a flavor of PCIe)). Infiniband was the most mature of the
    three technologies and switch latencies were signifincantly lower
    for IB than the competing transports.

    Today, my CPOE sells a couple of CXL2.0 enabled PCIe devices for
    memory expansion; one has 16 high-end ARM V cores.

    Quoting from the article (p.2)
    " As a second example: for throughput-oriented workloads
    DMA has evolved to efficiently transfer data to and from
    main memory without polluting the CPU cache. However, for
    small, fine-grained interactions, it is important that almost all
    the data gets into the right CPU cache as quickly as possible."

    Most modern CPU's support "allocate" hints on inbound DMA
    that will automatically place the data in the right CPU cache as
    quickly as possible.

    Decomposing that packet transfer into CPU loads and stores
    in a coherent fabric doesn't gain much, and burns more power
    on the "device" than a DMA engine.

    It's interesting they use one of the processors (designed in 2012) that we built over a decade ago (and they mispell our company name :-)
    in their research computer. That processor does have
    a mechanism to allocate data in cache on inbound DMA[*]; it's
    worth noting that the 48 cores on that processor are in-order
    cores. The text comparing it with a modern i7 at 3.6ghz
    doesn't note that.

    [*] Although I don't recall if that mechanism was documented
    in the public processor technical documentation.

    Their description of the Thunder X-1 processor cache is not accurate;
    it's not PIPT, it is PIVT (implemented in such a way as to
    appear to software as if it were PIPT). The V drops a cycle
    off the load-to-use latency.

    It was also the only ARM64 processor chip we built with a cache-coherent interconnect until the recent CXL based products.

    Overall, a very interesting paper.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Apr 26 19:25:05 2025
    On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:

    John Levine <johnl@taugh.com> writes:
    Well, not entirely. This preprint argues that in environments with
    lots of cores and where latency is an issue, programmed I/O can
    outperform
    DMA.

    Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent
    Interconnects

    Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe

    <snip abstract>


    https://arxiv.org/abs/2409.08141

    Interesting article, thanks for posting.

    Their conclusion is not at all surprising for the operations they target
    in the paper. PCI express throughput has increased with each
    generation, and PCI express latencies have decreased with each
    generation. There are certain workloads enabled by CXL that
    benefit from reduced PCIe latency, but those are primarily
    aimed at increasing directly accessible memory.

    However, I expect there are still benefits in using DMA for bulk data transfer, particularly for network packet handling where
    throughput is more interesting than PCI MMIO latency.

    I would like to add a though to the concept under discussion::

    Does the paper's conclusion hold better or worse if/when the
    core ISA contains both LDM/STM and MM instructions. LDM/STM
    allow for several sequential registers to move to/from MMI/O
    memory in a single interconnect transaction, while MM allows
    for up-to page-sized transfers in a single instruction and
    only 2 interconnect transactions.

    One concern that arises from the paper are the security
    implications of device access to the cache coherency
    protocol. Not an issue for a well-behaved device, but
    potentially problematic in a secure environment with
    third-party CXL-mem devices.

    Citation please !?!

    Also note:: device DMA goes through I/O MMU which adds a
    modicum of security-fencing around device DMA accesses
    but also adding latency.

    At 3Leaf systems, we extended the coherency domain over
    IB or 10Gbe Ethernet to encompass multiple servers in a
    single coherency domain, which both facilitated I/O
    and provided a single shared physical address space across
    multiple servers (up to 16). CXL-mem is basically the same
    but using PCIe instead of IB.

    IB == InfiniBand ?!?

    Granted, that was close to 20 years ago, and switch latencies
    were significant (100ns for IB, far more for Ethernet).

    CXL-mem is a similar technology with a different transport (we
    looked at Infiniband, 10Ge ethernet and "advanced switching"
    (a flavor of PCIe)). Infiniband was the most mature of the
    three technologies and switch latencies were signifincantly lower
    for IB than the competing transports.

    Today, my CPOE sells a couple of CXL2.0 enabled PCIe devices for
    memory expansion; one has 16 high-end ARM V cores.

    Quoting from the article (p.2)
    " As a second example: for throughput-oriented workloads
    DMA has evolved to efficiently transfer data to and from
    main memory without polluting the CPU cache. However, for
    small, fine-grained interactions, it is important that almost all
    the data gets into the right CPU cache as quickly as possible."

    Most modern CPU's support "allocate" hints on inbound DMA
    that will automatically place the data in the right CPU cache as
    quickly as possible.

    Decomposing that packet transfer into CPU loads and stores
    in a coherent fabric doesn't gain much, and burns more power
    on the "device" than a DMA engine.

    That was my initial thought--core performing lots of LD/ST to
    MMI/O is bound to consume more power than device DMA.

    Secondarily, using 1-few cores to perform PIO is not going to
    have the data land in the cache of the core that will run when
    the data has been transferred. The data lands in the cache doing
    PIO and not in the one to receive control after I/O is done.
    {{It may still be "closer than" memory--but several cache
    coherence protocols take longer cache-cache than dram-cache.}}

    It's interesting they use one of the processors (designed in 2012) that
    we
    built over a decade ago (and they mispell our company name :-)
    in their research computer. That processor does have
    a mechanism to allocate data in cache on inbound DMA[*]; it's
    worth noting that the 48 cores on that processor are in-order
    cores. The text comparing it with a modern i7 at 3.6ghz
    doesn't note that.

    [*] Although I don't recall if that mechanism was documented
    in the public processor technical documentation.

    Their description of the Thunder X-1 processor cache is not accurate;
    it's not PIPT, it is PIVT (implemented in such a way as to
    appear to software as if it were PIPT). The V drops a cycle
    off the load-to-use latency.

    Generally its VIPT virtual index-physical tag with a few bits
    of virtual-aliasing to disambiguate P&s.

    It was also the only ARM64 processor chip we built with a cache-coherent interconnect until the recent CXL based products.

    Overall, a very interesting paper.

    Reminds me of trying to sell a micro x86-64 to AMD as a project.
    The µ86 is a small x86-64 core made available as IP in Verilog
    where it has/runs the same ISA as main GBOoO x86, but is placed
    "out in the PCIe" interconnect--performing I/O services topo-
    logically adjacent to the device itself. This allows 1ns access
    latencies to DCRs and performing OS queueing of DPCs,... without
    bothering the GBOoO cores.

    AMD didn't buy the arguments.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sun Apr 27 14:01:22 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:

    John Levine <johnl@taugh.com> writes:
    Well, not entirely. This preprint argues that in environments with
    lots of cores and where latency is an issue, programmed I/O can
    outperform
    DMA.

    Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent
    Interconnects

    Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe

    <snip abstract>


    https://arxiv.org/abs/2409.08141

    Interesting article, thanks for posting.

    Their conclusion is not at all surprising for the operations they target
    in the paper. PCI express throughput has increased with each
    generation, and PCI express latencies have decreased with each
    generation. There are certain workloads enabled by CXL that
    benefit from reduced PCIe latency, but those are primarily
    aimed at increasing directly accessible memory.

    However, I expect there are still benefits in using DMA for bulk data
    transfer, particularly for network packet handling where
    throughput is more interesting than PCI MMIO latency.

    I would like to add a though to the concept under discussion::

    Does the paper's conclusion hold better or worse if/when the
    core ISA contains both LDM/STM and MM instructions. LDM/STM
    allow for several sequential registers to move to/from MMI/O
    memory in a single interconnect transaction, while MM allows
    for up-to page-sized transfers in a single instruction and
    only 2 interconnect transactions.

    The article discusses using the intel 64-byte store
    instructions, if I recall correctly. ARM also has
    a 64-byte store in the latest versions of the ISA.


    One concern that arises from the paper are the security
    implications of device access to the cache coherency
    protocol. Not an issue for a well-behaved device, but
    potentially problematic in a secure environment with
    third-party CXL-mem devices.

    Citation please !?!

    Also note:: device DMA goes through I/O MMU which adds a
    modicum of security-fencing around device DMA accesses
    but also adding latency.

    Clearly any device that participates in the coherency
    protocol can snoop addresses - a covert channel.


    At 3Leaf systems, we extended the coherency domain over
    IB or 10Gbe Ethernet to encompass multiple servers in a
    single coherency domain, which both facilitated I/O
    and provided a single shared physical address space across
    multiple servers (up to 16). CXL-mem is basically the same
    but using PCIe instead of IB.

    IB == InfiniBand ?!?

    Yes.


    <snip>

    Most modern CPU's support "allocate" hints on inbound DMA
    that will automatically place the data in the right CPU cache as
    quickly as possible.

    Decomposing that packet transfer into CPU loads and stores
    in a coherent fabric doesn't gain much, and burns more power
    on the "device" than a DMA engine.

    That was my initial thought--core performing lots of LD/ST to
    MMI/O is bound to consume more power than device DMA.

    Secondarily, using 1-few cores to perform PIO is not going to
    have the data land in the cache of the core that will run when
    the data has been transferred. The data lands in the cache doing
    PIO and not in the one to receive control after I/O is done.

    The point of the paper is that the PIO is done into a local
    cache line (owned/exclusive) by the device. The entire cache
    line eventually makes its way to memory (or a cache associated
    with the processor processing the data) as a bulk transfer
    rather than individual stores.

    {{It may still be "closer than" memory--but several cache
    coherence protocols take longer cache-cache than dram-cache.}}

    That would reduce the benefits, to be sure.


    It's interesting they use one of the processors (designed in 2012) that
    we
    built over a decade ago (and they mispell our company name :-)
    in their research computer. That processor does have
    a mechanism to allocate data in cache on inbound DMA[*]; it's
    worth noting that the 48 cores on that processor are in-order
    cores. The text comparing it with a modern i7 at 3.6ghz
    doesn't note that.

    [*] Although I don't recall if that mechanism was documented
    in the public processor technical documentation.

    Their description of the Thunder X-1 processor cache is not accurate;
    it's not PIPT, it is PIVT (implemented in such a way as to
    appear to software as if it were PIPT). The V drops a cycle
    off the load-to-use latency.

    Generally its VIPT virtual index-physical tag with a few bits
    of virtual-aliasing to disambiguate P&s.

    It was also the only ARM64 processor chip we built with a cache-coherent
    interconnect until the recent CXL based products.

    Overall, a very interesting paper.

    Reminds me of trying to sell a micro x86-64 to AMD as a project.
    The µ86 is a small x86-64 core made available as IP in Verilog
    where it has/runs the same ISA as main GBOoO x86, but is placed
    "out in the PCIe" interconnect--performing I/O services topo-
    logically adjacent to the device itself. This allows 1ns access
    latencies to DCRs and performing OS queueing of DPCs,... without
    bothering the GBOoO cores.

    AMD didn't buy the arguments.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sun Apr 27 14:02:04 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:


    Their description of the Thunder X-1 processor cache is not accurate;
    it's not PIPT, it is PIVT (implemented in such a way as to
    appear to software as if it were PIPT). The V drops a cycle
    off the load-to-use latency.

    Generally its VIPT virtual index-physical tag with a few bits
    of virtual-aliasing to disambiguate P&s.

    Yes, you are correct, I meant VIPT.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Scott Lurndal on Sun Apr 27 16:12:35 2025
    scott@slp53.sl.home (Scott Lurndal) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:

    John Levine <johnl@taugh.com> writes:
    Well, not entirely. This preprint argues that in environments with >>>>lots of cores and where latency is an issue, programmed I/O can
    outperform
    DMA.

    Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent
    Interconnects

    Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe

    <snip abstract>


    https://arxiv.org/abs/2409.08141

    Interesting article, thanks for posting.

    Their conclusion is not at all surprising for the operations they target >>> in the paper. PCI express throughput has increased with each
    generation, and PCI express latencies have decreased with each
    generation. There are certain workloads enabled by CXL that
    benefit from reduced PCIe latency, but those are primarily
    aimed at increasing directly accessible memory.

    However, I expect there are still benefits in using DMA for bulk data
    transfer, particularly for network packet handling where
    throughput is more interesting than PCI MMIO latency.

    I would like to add a though to the concept under discussion::

    Does the paper's conclusion hold better or worse if/when the
    core ISA contains both LDM/STM and MM instructions. LDM/STM
    allow for several sequential registers to move to/from MMI/O
    memory in a single interconnect transaction, while MM allows
    for up-to page-sized transfers in a single instruction and
    only 2 interconnect transactions.

    The article discusses using the intel 64-byte store
    instructions, if I recall correctly. ARM also has
    a 64-byte store in the latest versions of the ISA.

    They also mentioned that vector instructions weren't helpful
    in the Thunder X1 case, due to internal 128-bit bus limitations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to mitchalsup@aol.com on Sun Apr 27 20:13:47 2025
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:

    However, I expect there are still benefits in using DMA for bulk data transfer, particularly for network packet handling where
    throughput is more interesting than PCI MMIO latency.

    I would like to add a though to the concept under discussion::

    Does the paper's conclusion hold better or worse if/when the
    core ISA contains both LDM/STM and MM instructions. LDM/STM
    allow for several sequential registers to move to/from MMI/O
    memory in a single interconnect transaction, while MM allows
    for up-to page-sized transfers in a single instruction and
    only 2 interconnect transactions.

    I think this depends on the scale of your core. For say a NIC <-> CPU,
    maybe the CPU has MM instructions, but perhaps the microcontroller on the
    NIC doesn't. That means eg the CPU can push packets to transmit, but the
    NIC is not setup to push packets it has received - it has to ask the CPU to pull which will slow things down.

    You can add that feature of course, but then isn't it just becoming a DMA engine?

    ie it's about control path and datapath. A controller doesn't need a wide datapath (it isn't doing much compute) but the data transfer does need a
    wide datapath. If you size a CPU for a wide datapath then you end up
    paying costs for that (eg wide GP registers when you don't need them).

    One concern that arises from the paper are the security
    implications of device access to the cache coherency
    protocol. Not an issue for a well-behaved device, but
    potentially problematic in a secure environment with
    third-party CXL-mem devices.

    Citation please !?!

    CXL's protection model isn't very good: https://dl.acm.org/doi/pdf/10.1145/3617580
    (declaration: I'm a coauthor)

    Also note:: device DMA goes through I/O MMU which adds a
    modicum of security-fencing around device DMA accesses
    but also adding latency.

    Indeed, and page-based lookups are both slow (if you miss in the IOTLB) and have spatial and temporal security issues.

    Most modern CPU's support "allocate" hints on inbound DMA
    that will automatically place the data in the right CPU cache as
    quickly as possible.

    Decomposing that packet transfer into CPU loads and stores
    in a coherent fabric doesn't gain much, and burns more power
    on the "device" than a DMA engine.

    That was my initial thought--core performing lots of LD/ST to
    MMI/O is bound to consume more power than device DMA.

    Secondarily, using 1-few cores to perform PIO is not going to
    have the data land in the cache of the core that will run when
    the data has been transferred. The data lands in the cache doing
    PIO and not in the one to receive control after I/O is done.
    {{It may still be "closer than" memory--but several cache
    coherence protocols take longer cache-cache than dram-cache.}}

    I think this is an 'it depends'. If you're doing RPC type operations, it
    takes more work to warm up the DMA than it does to just do PIO. If you're
    an SSD pulling a large file from flash, DMA is more efficient. If you're moving network packets, which involve multiple scatter-gathers per packet,
    then maybe some heavy lifting is useful for the address handling.

    It was also the only ARM64 processor chip we built with a cache-coherent interconnect until the recent CXL based products.

    Overall, a very interesting paper.

    Reminds me of trying to sell a micro x86-64 to AMD as a project.
    The µ86 is a small x86-64 core made available as IP in Verilog
    where it has/runs the same ISA as main GBOoO x86, but is placed
    "out in the PCIe" interconnect--performing I/O services topo-
    logically adjacent to the device itself. This allows 1ns access
    latencies to DCRs and performing OS queueing of DPCs,... without
    bothering the GBOoO cores.

    AMD didn't buy the arguments.

    Intel tried that with the Quark line of 'microcontrollers', which appeared
    to be a warmed over P54 Pentium (whether it shared microarchitecture or RTL
    I'm not sure). They were too power hungry and unwieldy to be
    microcontrollers - they also couldn't run Debian/x86 despite having an MMU because they were too old for the LOCK CMPXCHG instruction Debian used (P54 didn't need to worry about concurrency, but we do now).

    I think at the end of the day there isn't actually a whole lot of benefit to running the same ISA on your I/O as on your CPU - there tends to be a fairly hard line between 'drivers' (on the CPU) and 'firmware' (on the device), and
    on the firmware side it's easier to throw in a small RISC (eg RISC-V
    nowadays) than anything complicated.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to Lars Poulsen on Sun Apr 27 19:35:08 2025
    Lars Poulsen <lars@cleo.beagle-ears.com> wrote:
    What is the difference between DMA and message-passing to another core
    doing CMOV loop at the ISA level?

    DMA means doing that it the micro-engine instead of at the ISA level.
    Same difference.

    What am I missing?

    Width and specialisation.

    You can absolutely write a DMA engine in software. One thing that is troublesome is that the CPU datapath might be a lot narrower than the number
    of bits you can move in a single cycle. eg on FPGA we can't clock logic anywhere near the DRAM clock so we end up making a very wide memory bus that runs at a lower clock - 512/1024/2048/... bits wide. You can do that in a regular ISA using vector registers/instructions but it adds complexity you don't need.

    The other is that there's often some degree of marshalling that needs to
    happen - reading scatter/gather lists, formatting packets the right way for PCIe, filling in the right header fields, etc. It's more efficient to do
    that in hardware than it is to spend multiple instructions per packet doing
    it. Meanwhile the DRAM bandwidth is being wasted.

    Of course you can customise your ISA with extra instructions for doing the heavy lifting, but then arguably it's not really a CPU any more, it's a 'programmable DMA engine'. The line between the two becomes very blurred.

    What you might also have is a hard datapath that's orchestrated by a tightly-coupled microcontroller. Then you get the ability to have an engine with good performance while being able to more flexibly program it, without having to make a strange vector CPU.

    It's easy to make 'a' DMA, but you want to push the maximum bandwidth that
    the memory/interconnect can achieve and all the tricks are about getting
    there.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Theo on Sun Apr 27 20:49:50 2025
    On Sun, 27 Apr 2025 18:35:08 +0000, Theo wrote:

    Lars Poulsen <lars@cleo.beagle-ears.com> wrote:
    What is the difference between DMA and message-passing to another core
    doing CMOV loop at the ISA level?

    DMA means doing that it the micro-engine instead of at the ISA level.
    Same difference.

    What am I missing?

    Width and specialisation.

    You can absolutely write a DMA engine in software. One thing that is troublesome is that the CPU datapath might be a lot narrower than the
    number
    of bits you can move in a single cycle. eg on FPGA we can't clock logic anywhere near the DRAM clock so we end up making a very wide memory bus
    that
    runs at a lower clock - 512/1024/2048/... bits wide. You can do that
    in a
    regular ISA using vector registers/instructions but it adds complexity
    you
    don't need.

    With anything at 7nm or smaller, the main core interconnect should be
    1 cache line wide (512 bits = 64 bytes :: although IBM's choice of 256
    byte cache lines might be troublesome for now.)

    The other is that there's often some degree of marshalling that needs to happen - reading scatter/gather lists, formatting packets the right way
    for
    PCIe, filling in the right header fields, etc. It's more efficient to
    do
    that in hardware than it is to spend multiple instructions per packet
    doing
    it. Meanwhile the DRAM bandwidth is being wasted.

    SW is nadda-verrryyy guud at twiddling bits like HW is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Theo on Sun Apr 27 20:45:32 2025
    On Sun, 27 Apr 2025 19:13:47 +0000, Theo wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
    --------------
    One concern that arises from the paper are the security
    implications of device access to the cache coherency
    protocol. Not an issue for a well-behaved device, but
    potentially problematic in a secure environment with
    third-party CXL-mem devices.

    Citation please !?!

    CXL's protection model isn't very good: https://dl.acm.org/doi/pdf/10.1145/3617580
    (declaration: I'm a coauthor)

    Thanks for the URL !!

    Secondarily, using 1-few cores to perform PIO is not going to
    have the data land in the cache of the core that will run when
    the data has been transferred. The data lands in the cache doing
    PIO and not in the one to receive control after I/O is done.
    {{It may still be "closer than" memory--but several cache
    coherence protocols take longer cache-cache than dram-cache.}}

    I think this is an 'it depends'. If you're doing RPC type operations,
    it
    takes more work to warm up the DMA than it does to just do PIO.

    Yes, it takes more cycles for a CPU to tell a device to move memory
    from here to there than it takes CPU to just move the memory from
    here to there. I was, instead, referring to a CPU where it has an
    MM (move memory to memory) instruction, where the instruction is
    allowed to be sent over the interconnect (say CRM controller) and
    have DRC perform the M2M movement locally.

    That is:: all major blocks in the system have their own DMA sequencer.

    If you're
    an SSD pulling a large file from flash, DMA is more efficient. If
    you're
    moving network packets, which involve multiple scatter-gathers per
    packet,
    then maybe some heavy lifting is useful for the address handling.

    It was also the only ARM64 processor chip we built with a cache-coherent >>> interconnect until the recent CXL based products.

    Overall, a very interesting paper.

    Reminds me of trying to sell a micro x86-64 to AMD as a project.
    The µ86 is a small x86-64 core made available as IP in Verilog
    where it has/runs the same ISA as main GBOoO x86, but is placed
    "out in the PCIe" interconnect--performing I/O services topo-
    logically adjacent to the device itself. This allows 1ns access
    latencies to DCRs and performing OS queueing of DPCs,... without
    bothering the GBOoO cores.

    AMD didn't buy the arguments.

    Intel tried that with the Quark line of 'microcontrollers', which
    appeared
    to be a warmed over P54 Pentium (whether it shared microarchitecture or
    RTL I'm not sure).

    In my case, the LBIO core ran exactly the same ISA as the big central
    cores. If the remote cores were present, they would field the interrupt
    access the device, schedule further cleanup work, and then kick the main
    cores in their side.

    In order to be viable, the same OS SW has to run whether the LBIO cores
    are present or not.

    They were too power hungry and unwieldy to be microcontrollers - they also couldn't run Debian/x86 despite having an
    MMU because they were too old for the LOCK CMPXCHG instruction Debian
    used (P54 didn't need to worry about concurrency, but we do now).

    It is not generally known, but back in ~2006 when Opteron was in full
    swing, the HT fabric to the SouthBridge was actually coherent, we just
    did not publish the coherence spec and thus devices could not use it.
    But it was present, and if someone happened to know the protocol they
    could have used the coherent nature of it.

    My LBIO core would have made use of that.

    And unlike P54, I was being designed for low power operations, and
    it was going to use Opteron building blocks to simplify bug-for-bug compatibilities.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sun Apr 27 22:44:06 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 27 Apr 2025 19:13:47 +0000, Theo wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:

    I think this is an 'it depends'. If you're doing RPC type operations,
    it
    takes more work to warm up the DMA than it does to just do PIO.

    Yes, it takes more cycles for a CPU to tell a device to move memory
    from here to there than it takes CPU to just move the memory from
    here to there. I was, instead, referring to a CPU where it has an
    MM (move memory to memory) instruction, where the instruction is
    allowed to be sent over the interconnect (say CRM controller) and
    have DRC perform the M2M movement locally.

    If the instruction is not asynchronous, then the differences
    between a load-store loop and the MM instruction aren't
    significant. If MM is asynchronous, that complicates the
    kernel-user interface.


    They were too power hungry and unwieldy to be
    microcontrollers - they also couldn't run Debian/x86 despite having an
    MMU because they were too old for the LOCK CMPXCHG instruction Debian
    used (P54 didn't need to worry about concurrency, but we do now).

    It is not generally known, but back in ~2006 when Opteron was in full
    swing, the HT fabric to the SouthBridge was actually coherent, we just
    did not publish the coherence spec and thus devices could not use it.
    But it was present, and if someone happened to know the protocol they
    could have used the coherent nature of it.

    We had the coherent HT spec in 2005 and used it to design our
    ASIC that extended the coherency domain over infiniband. Taped
    out in 2008. Connected two Istanbul CPUs (IIRC) to our ASIC via
    HT and connected that mainboard to an IB fabric. Supported 16 such hosts
    in a single-system image with fully coherent distributed memory
    using a mellanox DDR IB switch.

    The AMD CTO at the time was an advisor to our startup.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sun Apr 27 22:37:48 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 27 Apr 2025 18:35:08 +0000, Theo wrote:

    Lars Poulsen <lars@cleo.beagle-ears.com> wrote:
    What is the difference between DMA and message-passing to another core
    doing CMOV loop at the ISA level?

    DMA means doing that it the micro-engine instead of at the ISA level.
    Same difference.

    What am I missing?

    Width and specialisation.

    You can absolutely write a DMA engine in software. One thing that is
    troublesome is that the CPU datapath might be a lot narrower than the
    number
    of bits you can move in a single cycle. eg on FPGA we can't clock logic
    anywhere near the DRAM clock so we end up making a very wide memory bus
    that
    runs at a lower clock - 512/1024/2048/... bits wide. You can do that
    in a
    regular ISA using vector registers/instructions but it adds complexity
    you
    don't need.

    With anything at 7nm or smaller, the main core interconnect should be
    1 cache line wide (512 bits = 64 bytes :: although IBM's choice of 256
    byte cache lines might be troublesome for now.)

    The other is that there's often some degree of marshalling that needs to
    happen - reading scatter/gather lists, formatting packets the right way
    for
    PCIe, filling in the right header fields, etc. It's more efficient to
    do
    that in hardware than it is to spend multiple instructions per packet
    doing
    it. Meanwhile the DRAM bandwidth is being wasted.

    SW is nadda-verrryyy guud at twiddling bits like HW is.

    IME, the use of content addressible memory[*] is a key component required
    for efficient packet processing, and that can only be done in hardware.

    [*] For RSS and per-flow serialization et alia.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Theo on Mon Apr 28 01:20:04 2025
    On 27 Apr 2025 19:35:08 +0100 (BST), Theo wrote:

    Of course you can customise your ISA with extra instructions for doing
    the heavy lifting, but then arguably it's not really a CPU any more,
    it's a 'programmable DMA engine'. The line between the two becomes very blurred.

    In the old mainframe world, they called that an “I/O channel”. See also
    the RP2040 chip from the Raspberry Pi Foundation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Thu May 1 13:07:07 2025
    In article <da5b3dea460370fc1fe8ad2323da9bc4@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
    [snip]
    Reminds me of trying to sell a micro x86-64 to AMD as a project.
    The µ86 is a small x86-64 core made available as IP in Verilog
    where it has/runs the same ISA as main GBOoO x86, but is placed
    "out in the PCIe" interconnect--performing I/O services topo-
    logically adjacent to the device itself. This allows 1ns access
    latencies to DCRs and performing OS queueing of DPCs,... without
    bothering the GBOoO cores.

    AMD didn't buy the arguments.

    I can see it either way; I suppose the argument as to whether I
    buy it or not comes down to, "in depends". How much control do
    I, as the OS implementer, have over this core?

    If it is yet another hidden core embedded somewhere deep in the
    SoC complex and I can't easily interact with it from the OS,
    then no thanks: we've got enough of those between MP0, MP1, MP5,
    etc, etc.

    On the other hand, if it's got a "normal" APIC ID, the OS has
    control over it like any other LP, and its coherent with the big
    cores, then yeah, sign me up: I've been wanting something like
    that for a long time now.

    Consider a virtualization application. A problem with, say,
    SR-IOV is that very often the hypervisor wants to interpose some
    sort of administrative policy between the virtual function and
    whatever it actually corresponds to, but get out of the fast
    path for most IO. This implies a kind of offload architecture
    where there's some (presumably software) agent dedicated to
    handling IO that can be parameterized with such a policy. A
    core very close to the device could handle that swimmingly,
    though I'm not sure it would be enough to do it at (say) line
    rate for a 400Gbps NIC or Gen5 NVMe device.

    ...but why x86_64? It strikes me that as long as the _data_
    formats vis the software-visible ABI are the same, it doesn't
    need to use the same ISA. In fact, I can see advantages to not
    doing so.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Thu May 1 22:03:08 2025
    On Thu, 1 May 2025 13:07:07 +0000, Dan Cross wrote:

    In article <da5b3dea460370fc1fe8ad2323da9bc4@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
    [snip]
    Reminds me of trying to sell a micro x86-64 to AMD as a project.
    The µ86 is a small x86-64 core made available as IP in Verilog
    where it has/runs the same ISA as main GBOoO x86, but is placed
    "out in the PCIe" interconnect--performing I/O services topo-
    logically adjacent to the device itself. This allows 1ns access
    latencies to DCRs and performing OS queueing of DPCs,... without
    bothering the GBOoO cores.

    AMD didn't buy the arguments.

    I can see it either way; I suppose the argument as to whether I
    buy it or not comes down to, "in depends". How much control do
    I, as the OS implementer, have over this core?

    Other than it being placed "away" from the centralized cores,
    it runs the same ISA as the main cores has longer latency to
    coherent memory and shorter latency to device control registers
    --which is why it is placed close to the device itself:: latency.
    The big fast centralized core is going to get microsecond latency
    from MMI/O device whereas ASIC version will have handful of nano-
    second latencies. So the 5 GHZ core sees ~1 microsecond while the
    little ASIC sees 10 nanoseconds. ...

    If it is yet another hidden core embedded somewhere deep in the
    SoC complex and I can't easily interact with it from the OS,
    then no thanks: we've got enough of those between MP0, MP1, MP5,
    etc, etc.

    On the other hand, if it's got a "normal" APIC ID, the OS has
    control over it like any other LP, and its coherent with the big
    cores, then yeah, sign me up: I've been wanting something like
    that for a long time now.

    It is just a core that is cheap enough to put in ASICs, that
    can offload some I/O burden without you having to do anything
    other than setting some bits in some CRs so interrupts are
    routed to this core rather than some more centralized core.

    Consider a virtualization application. A problem with, say,
    SR-IOV is that very often the hypervisor wants to interpose some
    sort of administrative policy between the virtual function and
    whatever it actually corresponds to, but get out of the fast
    path for most IO. This implies a kind of offload architecture
    where there's some (presumably software) agent dedicated to
    handling IO that can be parameterized with such a policy. A

    Interesting:: Could you cite any literature, here !?!

    core very close to the device could handle that swimmingly,
    though I'm not sure it would be enough to do it at (say) line
    rate for a 400Gbps NIC or Gen5 NVMe device.

    I suspect the 400 GHz NIC needs a rather BIG core to handle the
    traffic loads.

    ....but why x86_64? It strikes me that as long as the _data_
    formats vis the software-visible ABI are the same, it doesn't
    need to use the same ISA. In fact, I can see advantages to not
    doing so.

    Having the remote core run the same OS code as every other core
    means the OS developers have fewer hoops to jump through. Bug-for
    bug compatibility means that clearing of those CRs just leaves
    the core out in the periphery idling and bothering no one.

    On the other hand, you buy a motherboard with said ASIC core,
    and you can boot the MB without putting a big chip in the
    socket--but you may have to deal with scant DRAM since the
    big centralized chip contains teh memory controller.


    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Fri May 2 02:15:24 2025
    In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 1 May 2025 13:07:07 +0000, Dan Cross wrote:
    In article <da5b3dea460370fc1fe8ad2323da9bc4@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
    [snip]
    Reminds me of trying to sell a micro x86-64 to AMD as a project.
    The µ86 is a small x86-64 core made available as IP in Verilog
    where it has/runs the same ISA as main GBOoO x86, but is placed
    "out in the PCIe" interconnect--performing I/O services topo-
    logically adjacent to the device itself. This allows 1ns access
    latencies to DCRs and performing OS queueing of DPCs,... without >>>bothering the GBOoO cores.

    AMD didn't buy the arguments.

    I can see it either way; I suppose the argument as to whether I
    buy it or not comes down to, "in depends". How much control do
    I, as the OS implementer, have over this core?

    Other than it being placed "away" from the centralized cores,
    it runs the same ISA as the main cores has longer latency to
    coherent memory and shorter latency to device control registers
    --which is why it is placed close to the device itself:: latency.
    The big fast centralized core is going to get microsecond latency
    from MMI/O device whereas ASIC version will have handful of nano-
    second latencies. So the 5 GHZ core sees ~1 microsecond while the
    little ASIC sees 10 nanoseconds. ...

    Yes, I get the argument for WHY you'd do it, I just want to make
    sure that it's an ordinary core (albeit one that is far away
    from the sockets with the main SoC complexes) that I interact
    with in the usual manner. Compare to, say, MP1 or MP0 on AMD
    Zen, where it runs its own (proprietary) firmware that I
    interact with via an RPC protocol over an AXI bus, if I interact
    with it at all: most OEMs just punt and run AGESA (we don't).

    If it is yet another hidden core embedded somewhere deep in the
    SoC complex and I can't easily interact with it from the OS,
    then no thanks: we've got enough of those between MP0, MP1, MP5,
    etc, etc.

    On the other hand, if it's got a "normal" APIC ID, the OS has
    control over it like any other LP, and its coherent with the big
    cores, then yeah, sign me up: I've been wanting something like
    that for a long time now.

    It is just a core that is cheap enough to put in ASICs, that
    can offload some I/O burden without you having to do anything
    other than setting some bits in some CRs so interrupts are
    routed to this core rather than some more centralized core.

    Sounds good.

    Consider a virtualization application. A problem with, say,
    SR-IOV is that very often the hypervisor wants to interpose some
    sort of administrative policy between the virtual function and
    whatever it actually corresponds to, but get out of the fast
    path for most IO. This implies a kind of offload architecture
    where there's some (presumably software) agent dedicated to
    handling IO that can be parameterized with such a policy. A

    Interesting:: Could you cite any literature, here !?!

    Sure. This paper is a bit older, but gets at the main points: https://www.usenix.org/system/files/conference/nsdi18/nsdi18-firestone.pdf

    I don't know if the details are public for similar technologies
    from Amazon or Google.

    core very close to the device could handle that swimmingly,
    though I'm not sure it would be enough to do it at (say) line
    rate for a 400Gbps NIC or Gen5 NVMe device.

    I suspect the 400 GHz NIC needs a rather BIG core to handle the
    traffic loads.

    Indeed. Part of the challenge for the hyperscalars is in
    meeting that demand while not burning too many host resources,
    which are the thing they're actually selling their customer in
    the first place. A lot of folks are pushing this off to the NIC
    itself, and I've seen at least one team that implemented NVMe in
    firmware on a 100Gbps NIC, exposed via SR-IOV, as part of a
    disaggregated storage architecture.

    Another option is to push this to the switch; things like Intel
    Tofino2 were well-position for this, but of course Intel, in its
    infinite wisdom and vision, canc'ed Tofino.

    ....but why x86_64? It strikes me that as long as the _data_
    formats vis the software-visible ABI are the same, it doesn't
    need to use the same ISA. In fact, I can see advantages to not
    doing so.

    Having the remote core run the same OS code as every other core
    means the OS developers have fewer hoops to jump through. Bug-for
    bug compatibility means that clearing of those CRs just leaves
    the core out in the periphery idling and bothering no one.

    Eh...Having to jump through hoops here matters less to me for
    this kind of use case than if I'm trying to use those cores for
    general-purpose compute. Having a separate ISA means I cannot
    accidentally run a program meant only for the big cores on the
    IO service processors. As long as the OS has total control over
    the execution of the core, and it participates in whatever cache
    coherency scheme the rest of the system uses, then the ISA just
    isn't that important.

    On the other hand, you buy a motherboard with said ASIC core,
    and you can boot the MB without putting a big chip in the
    socket--but you may have to deal with scant DRAM since the
    big centralized chip contains teh memory controller.

    A neat hack for bragging rights, but not terribly practical?

    Anyway, it's a neat idea. It's very reminiscent of IBM channel
    controllers, in a way.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Dan Cross on Fri May 2 05:34:50 2025
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    Other than it being placed "away" from the centralized cores,
    it runs the same ISA as the main cores has longer latency to
    coherent memory and shorter latency to device control registers
    --which is why it is placed close to the device itself:: latency.
    The big fast centralized core is going to get microsecond latency
    from MMI/O device whereas ASIC version will have handful of nano-
    second latencies. So the 5 GHZ core sees ~1 microsecond while the
    little ASIC sees 10 nanoseconds. ...

    Yes, I get the argument for WHY you'd do it, I just want to make
    sure that it's an ordinary core (albeit one that is far away
    from the sockets with the main SoC complexes) that I interact
    with in the usual manner.

    Intel has put 2 Crestmont cores (their then-current E-core, not at all
    tiny) on the SoC tile (not the compute tile) of Meteor Lake. The main
    idea there seems to be to save power (Meteor Lake is a laptop CPU)
    when doing low-load things like playing videos by keeping the compute
    tile powered down.

    core very close to the device could handle that swimmingly,
    though I'm not sure it would be enough to do it at (say) line
    rate for a 400Gbps NIC or Gen5 NVMe device.

    I suspect the 400 GHz NIC needs a rather BIG core to handle the
    traffic loads.

    Looking at
    https://chipsandcheese.com/p/arms-cortex-a53-tiny-but-important, a
    Cortex-A53 would not be up to it (at 1896MHz it can read <12GB/s and
    write <18GB/s even to the L1 cache). However, Chester Lam notes: "A53
    offers very low cache bandwidth compared to pretty much any other core
    we’ve analyzed." I think, though, that a small in-order core like the
    A53, but with enough load and store buffering and enough bandwidth to
    I/O and the memory controller should not have a problem shoveling data
    from or to a 400Gb/s NIC. With 128 bits/cycle in each direction one
    would need one transfer per cycle in each direction at 3125MHz to
    achieve 400Gb/s, or maybe 4GHz for a dual-issue core to allow for loop overhead. Given that the A53 typically only has 2GHz, supporting 256 bits/cycle of transfer width (for load and store instructions, i.e.,
    along the lines of AVX-256) would be more appropriate.

    Going for an OoO core (something like AMD's Bobcat or Intel's
    Silvermont) would help achieve the bandwidth goals without excessive fine-tuning of the software.

    Having the remote core run the same OS code as every other core
    means the OS developers have fewer hoops to jump through. Bug-for
    bug compatibility means that clearing of those CRs just leaves
    the core out in the periphery idling and bothering no one.

    Eh...Having to jump through hoops here matters less to me for
    this kind of use case than if I'm trying to use those cores for >general-purpose compute.

    I think it's the same thing as Greenspun's tenth rule: First you find
    that a classical DMA engine is too limiting, then you find that an A53
    is too limiting, and eventually you find that it would be practical to
    run the ISA of the main cores. In particular, it allows you to use
    the toolchain of the main cores for developing them, and you can also
    use the facilities of the main cores (e.g., debugging features that
    may be absent of the I/O cores) during development.

    Having a separate ISA means I cannot
    accidentally run a program meant only for the big cores on the
    IO service processors.

    Marking the binaries that should be able to run on the IO service
    processors with some flag, and letting the component of the OS that
    assigns processes to cores heed this flag is not rocket science. You
    probably also don't want to run programs for the I/O processors on the
    main cores; whether you use a separate flag for indicating that, or
    whether one flag indicates both is an interesting question.

    On the other hand, you buy a motherboard with said ASIC core,
    and you can boot the MB without putting a big chip in the
    socket--but you may have to deal with scant DRAM since the
    big centralized chip contains teh memory controller.

    A neat hack for bragging rights, but not terribly practical?

    Very practical for updating the firmware of the board to support the
    big chip you want to put in the socket (called "BIOS FlashBack" in
    connection with AMD big chips). In a case where we did not have that
    feature, and the board did not support the CPU, we had to buy another
    CPU to update the firmware <https://www.complang.tuwien.ac.at/anton/asus-p10s-c4l.html>. That's especially relevant for AM4 boards, because the support chips make it
    hard to use more than 16MB Flash for firmware, but the firmware for
    all supported big chips does not fit into 16MB. However, as the case
    mentioned above shows, it's also relevant for Intel boards.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Anton Ertl on Fri May 2 15:02:35 2025
    In article <2025May2.073450@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>,
    [snip]
    I suspect the 400 GHz NIC needs a rather BIG core to handle the
    traffic loads.

    Looking at
    https://chipsandcheese.com/p/arms-cortex-a53-tiny-but-important, a
    Cortex-A53 would not be up to it (at 1896MHz it can read <12GB/s and
    write <18GB/s even to the L1 cache). However, Chester Lam notes: "A53
    offers very low cache bandwidth compared to pretty much any other core >we’ve analyzed." I think, though, that a small in-order core like the
    A53, but with enough load and store buffering and enough bandwidth to
    I/O and the memory controller should not have a problem shoveling data
    from or to a 400Gb/s NIC. With 128 bits/cycle in each direction one
    would need one transfer per cycle in each direction at 3125MHz to
    achieve 400Gb/s, or maybe 4GHz for a dual-issue core to allow for loop >overhead. Given that the A53 typically only has 2GHz, supporting 256 >bits/cycle of transfer width (for load and store instructions, i.e.,
    along the lines of AVX-256) would be more appropriate.

    Going for an OoO core (something like AMD's Bobcat or Intel's
    Silvermont) would help achieve the bandwidth goals without excessive >fine-tuning of the software.

    Having the remote core run the same OS code as every other core
    means the OS developers have fewer hoops to jump through. Bug-for
    bug compatibility means that clearing of those CRs just leaves
    the core out in the periphery idling and bothering no one.

    Eh...Having to jump through hoops here matters less to me for
    this kind of use case than if I'm trying to use those cores for >>general-purpose compute.

    I think it's the same thing as Greenspun's tenth rule: First you find
    that a classical DMA engine is too limiting, then you find that an A53
    is too limiting, and eventually you find that it would be practical to
    run the ISA of the main cores. In particular, it allows you to use
    the toolchain of the main cores for developing them,

    These are issues solveable with the software architecture and
    build system for the host OS. The important characteristic is
    that the software coupling makes architectural sense, and that
    simply does not require using the same ISA across IPs.

    Indeed, consider AMD's Zen CPUs; the PSP/ASP/whatever it's
    called these days is an ARM core while the big CPUs are x86.
    I'm pretty sure there's an Xtensa DSP in there to do DRAM and
    timing and PCIe link training. Similarly with the ME on Intel.
    A BMC might be running on whatever. We increasingly see ARM
    based SBCs that have small RISC-V microcontroller-class cores
    embedded in the SoC for exactly this sort of thing.

    At work, our service processor (granted, outside of the SoC but
    tightly coupled at the board level) is a Cortex-M7, but we wrote
    the OS for that, and we control the host OS that runs on x86,
    so the SP and big CPUs can be mutually aware. Our hardware RoT
    is a smaller Cortex-M. We don't have a BMC on our boards;
    everything that it does is either done by the SP or built into
    the host OS, both of which are measured by the RoT.

    The problem is when such service cores are hidden (as they are
    in the case of the PSP, SMU, MPIO, and similar components, to
    use AMD as the example) and treated like black boxes by
    software. It's really cool that I can configure the IO crossbar
    in useful way tailored to specific configurations, but it's much
    less cool that I have to do what amounts to an RPC over the SMN
    to some totally undocumented entity somewhere in the SoC to do
    it. Bluntly, as an OS person, I do not want random bits of code
    running anywhere on my machine that I am not at least aware of
    (yes, this includes firmware blobs on devices).

    and you can also
    use the facilities of the main cores (e.g., debugging features that
    may be absent of the I/O cores) during development.

    This is interesting, but we've found it more useful going the
    other way around. We do most of our debugging via the SP.
    Since The SP is also responsible for system initialization and
    holding x86 in reset until we're reading for it to start
    running, it's the obvious nexus for debugging the system
    holistically.

    I must admit that, since we design our own boards, so we have
    options here that those buying from the consumer space or
    traditional enterprise vendors don't, but that's one of the
    considerable value-adds for hardware/software co-design.

    Having a separate ISA means I cannot
    accidentally run a program meant only for the big cores on the
    IO service processors.

    Marking the binaries that should be able to run on the IO service
    processors with some flag, and letting the component of the OS that
    assigns processes to cores heed this flag is not rocket science.

    I agree, that's easy. And yet, mistakes will be made, and there
    will be tension between wanting to dedicate those CPUs to IO
    services and wanting to use them for GP programs: I can easily
    imagine a paper where someone modifies a scheduler to move IO
    bound programs to those cores. Using a different ISA obviates
    most of that, and provides an (admittedly modest) security benefit.

    And if I already have to modify or configure the OS to
    accommodate the existence of these things in the first place,
    then accommodating an ISA difference really isn't that much
    extra work. The critical observation is that a typical SMP view
    of the world no longer makes sense for the system architecture,
    and trying to shoehorn that model onto the hardware reality is
    just going to cause frustration. Better to acknowledge that the

    You
    probably also don't want to run programs for the I/O processors on the
    main cores; whether you use a separate flag for indicating that, or
    whether one flag indicates both is an interesting question.

    On the other hand, you buy a motherboard with said ASIC core,
    and you can boot the MB without putting a big chip in the
    socket--but you may have to deal with scant DRAM since the
    big centralized chip contains teh memory controller.

    A neat hack for bragging rights, but not terribly practical?

    Very practical for updating the firmware of the board to support the
    big chip you want to put in the socket (called "BIOS FlashBack" in
    connection with AMD big chips).

    "BIOS", as loaded from the EFS by the ABL on the PSP on EPYC
    class chips, is usually stored in a QSPI flash on the main
    board (though starting with Turin you _can_ boot via eSPI).
    Strictly speaking, you don't _need_ an x86 core to rewrite that.
    On our machines, we do that from the SP, but we don't use AGESA
    or UEFI: all of the platform enablement stuff done in PEI and
    DXE we do directly in the host OS.

    Also, on AMD machines, again considering EPYC, it's up to system
    software running on x86 to direct either the SMU or MPIO to
    configure DXIO and the rest of the fabric before PCIe link
    training even begins (releasing PCIe from PERST is done by
    either the SMU or MPIO, depending on the specific
    microarchitecture). Where are these cores, again? If they're
    close to the devices, are they in the root complex or on the far
    side of a bridge? Can they even talk to the rest of the board?

    Also, since this is x86, there's the issue of starting them and
    getting them to run useful software. Usually on x86 it's the
    responsibility of the BSC to start APs (AGESA usually does CCX
    initialization and starts all the threads and do APIC ID
    assignment and so on, but then directs them to park and wait for
    the OS to do the usual INIT/SIPI/SIPI dance); but if the BSC is
    absent because the socket is unpopulated, what starts them? And
    what software are they running? Again, it's not even clear that
    they have access to QSPI to boot into e.g. AGESA; if they've got
    some little local ROM or flash or something, then how does the
    OS get control them?

    Perhaps there's some kind of electrical interlock that brings
    them up if the socket is empty, but one must answer the question
    of what's responsbile before assuming you can use them, and it
    seems like the _best_ course of action would be to leave them
    in reset (or even powered off) until explicitly enabled by
    software, probably via a write to some magic capability in
    config space on a bridge (like how one interacts with SMN now).

    In a case where we did not have that
    feature, and the board did not support the CPU, we had to buy another
    CPU to update the firmware ><https://www.complang.tuwien.ac.at/anton/asus-p10s-c4l.html>. That's >especially relevant for AM4 boards, because the support chips make it
    hard to use more than 16MB Flash for firmware, but the firmware for
    all supported big chips does not fit into 16MB. However, as the case >mentioned above shows, it's also relevant for Intel boards.

    You shouldn't need to boot the host operating system to do that,
    though I get on most consumer-grade machines you'll do it via
    something that interfaces with AGESA or UEFI. Most server-grade
    machines will have a BMC that can do this independently of the
    main CPU, and I should be clear that I'm discounting use cases
    for consumer grade boards, where I suspect something like this
    is less interesting than on server hardware.

    As I mentioned, we build our own boards, and this just isn't an
    issue for us, but it's not clear to me that these small cores
    would even have access to the QSPI to update flash with an new
    EFS image (again, to use the AMD example).

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Fri May 2 17:40:09 2025
    On Fri, 2 May 2025 2:15:24 +0000, Dan Cross wrote:

    In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 1 May 2025 13:07:07 +0000, Dan Cross wrote:
    In article <da5b3dea460370fc1fe8ad2323da9bc4@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
    [snip]
    Reminds me of trying to sell a micro x86-64 to AMD as a project.
    The µ86 is a small x86-64 core made available as IP in Verilog
    where it has/runs the same ISA as main GBOoO x86, but is placed
    "out in the PCIe" interconnect--performing I/O services topo-
    logically adjacent to the device itself. This allows 1ns access >>>>latencies to DCRs and performing OS queueing of DPCs,... without >>>>bothering the GBOoO cores.

    AMD didn't buy the arguments.

    I can see it either way; I suppose the argument as to whether I
    buy it or not comes down to, "in depends". How much control do
    I, as the OS implementer, have over this core?

    Other than it being placed "away" from the centralized cores,
    it runs the same ISA as the main cores has longer latency to
    coherent memory and shorter latency to device control registers
    --which is why it is placed close to the device itself:: latency.
    The big fast centralized core is going to get microsecond latency
    from MMI/O device whereas ASIC version will have handful of nano-
    second latencies. So the 5 GHZ core sees ~1 microsecond while the
    little ASIC sees 10 nanoseconds. ...

    Yes, I get the argument for WHY you'd do it, I just want to make
    sure that it's an ordinary core (albeit one that is far away
    from the sockets with the main SoC complexes) that I interact
    with in the usual manner. Compare to, say, MP1 or MP0 on AMD
    Zen, where it runs its own (proprietary) firmware that I
    interact with via an RPC protocol over an AXI bus, if I interact
    with it at all: most OEMs just punt and run AGESA (we don't).

    If it is yet another hidden core embedded somewhere deep in the
    SoC complex and I can't easily interact with it from the OS,
    then no thanks: we've got enough of those between MP0, MP1, MP5,
    etc, etc.

    On the other hand, if it's got a "normal" APIC ID, the OS has
    control over it like any other LP, and its coherent with the big
    cores, then yeah, sign me up: I've been wanting something like
    that for a long time now.

    It is just a core that is cheap enough to put in ASICs, that
    can offload some I/O burden without you having to do anything
    other than setting some bits in some CRs so interrupts are
    routed to this core rather than some more centralized core.

    Sounds good.

    Consider a virtualization application. A problem with, say,
    SR-IOV is that very often the hypervisor wants to interpose some
    sort of administrative policy between the virtual function and
    whatever it actually corresponds to, but get out of the fast
    path for most IO. This implies a kind of offload architecture
    where there's some (presumably software) agent dedicated to
    handling IO that can be parameterized with such a policy. A

    Interesting:: Could you cite any literature, here !?!

    Sure. This paper is a bit older, but gets at the main points: https://www.usenix.org/system/files/conference/nsdi18/nsdi18-firestone.pdf

    I don't know if the details are public for similar technologies
    from Amazon or Google.

    core very close to the device could handle that swimmingly,
    though I'm not sure it would be enough to do it at (say) line
    rate for a 400Gbps NIC or Gen5 NVMe device.

    I suspect the 400 GHz NIC needs a rather BIG core to handle the
    traffic loads.

    Indeed. Part of the challenge for the hyperscalars is in
    meeting that demand while not burning too many host resources,
    which are the thing they're actually selling their customer in
    the first place. A lot of folks are pushing this off to the NIC
    itself, and I've seen at least one team that implemented NVMe in
    firmware on a 100Gbps NIC, exposed via SR-IOV, as part of a
    disaggregated storage architecture.

    Another option is to push this to the switch; things like Intel
    Tofino2 were well-position for this, but of course Intel, in its
    infinite wisdom and vision, canc'ed Tofino.

    ....but why x86_64? It strikes me that as long as the _data_
    formats vis the software-visible ABI are the same, it doesn't
    need to use the same ISA. In fact, I can see advantages to not
    doing so.

    Having the remote core run the same OS code as every other core
    means the OS developers have fewer hoops to jump through. Bug-for
    bug compatibility means that clearing of those CRs just leaves
    the core out in the periphery idling and bothering no one.

    Eh...Having to jump through hoops here matters less to me for
    this kind of use case than if I'm trying to use those cores for general-purpose compute. Having a separate ISA means I cannot
    accidentally run a program meant only for the big cores on the
    IO service processors. As long as the OS has total control over
    the execution of the core, and it participates in whatever cache
    coherency scheme the rest of the system uses, then the ISA just
    isn't that important.

    On the other hand, you buy a motherboard with said ASIC core,
    and you can boot the MB without putting a big chip in the
    socket--but you may have to deal with scant DRAM since the
    big centralized chip contains teh memory controller.

    A neat hack for bragging rights, but not terribly practical?

    Anyway, it's a neat idea. It's very reminiscent of IBM channel
    controllers, in a way.

    It is more like the Peripheral Processors of CDC 6600 that run
    ISA of a CDC 6600 without as much fancy execution in periphery.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Dan Cross on Sat May 3 06:11:00 2025
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025May2.073450@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    I think it's the same thing as Greenspun's tenth rule: First you find
    that a classical DMA engine is too limiting, then you find that an A53
    is too limiting, and eventually you find that it would be practical to
    run the ISA of the main cores. In particular, it allows you to use
    the toolchain of the main cores for developing them,

    These are issues solveable with the software architecture and
    build system for the host OS.

    Certainly, one can work around many bad decisions, and in reality one
    has to work around some bad decisions, but the issue here is not
    whether "the issues are solvable", but which decision leads to better
    or worse consequences.

    The important characteristic is
    that the software coupling makes architectural sense, and that
    simply does not require using the same ISA across IPs.

    IP? Internet Protocol? Software Coupling sounds to me like a concept
    from Constantine out of my Software engineering class. I guess you
    did not mean either, but it's unclear what you mean.

    In any case, I have made arguments why it would make sense to use the
    same ISA as for the OS for programming the cores that replace DMA
    engines. I will discuss your counterarguments below, but the most
    important one to me seems to be that these cores would cost more than
    with a different ISA. There is something to that, but when the
    application ISA is cheap to implement (e.g., RV64GC), that cost is
    small; it may be more an argument for also selecting the
    cheap-to-implement ISA for the OS/application cores.

    Indeed, consider AMD's Zen CPUs; the PSP/ASP/whatever it's
    called these days is an ARM core while the big CPUs are x86.
    I'm pretty sure there's an Xtensa DSP in there to do DRAM and
    timing and PCIe link training.

    The PSPs are not programmable by the OS or application programmers, so
    using the same ISA would not benefit the OS or application
    programmers. By contrast, the idea for the DMA replacement engines is
    that they are programmable by the OS and maybe the application
    programmers, and that changes whether the same ISA is beneficial.

    What is "ASP/whatever"?

    Similarly with the ME on Intel.

    Last I read about it, ME uses a core developed by Intel with IA-32 or
    AMD64; but in any case, the ME is not programmable by OS or
    application programmers, either.

    A BMC might be running on whatever.

    Again, a BMC is not programmable by OS or application programmers.

    We increasingly see ARM
    based SBCs that have small RISC-V microcontroller-class cores
    embedded in the SoC for exactly this sort of thing.

    That's interesting; it points to RISC-V being cheaper to implement
    than ARM. As for "that sort of thing", they are all not programmable
    by OS or application programmers, so see above.

    Our hardware RoT

    ?

    The problem is when such service cores are hidden (as they are
    in the case of the PSP, SMU, MPIO, and similar components, to
    use AMD as the example) and treated like black boxes by
    software. It's really cool that I can configure the IO crossbar
    in useful way tailored to specific configurations, but it's much
    less cool that I have to do what amounts to an RPC over the SMN
    to some totally undocumented entity somewhere in the SoC to do
    it. Bluntly, as an OS person, I do not want random bits of code
    running anywhere on my machine that I am not at least aware of
    (yes, this includes firmware blobs on devices).

    Well, one goes with the other. If you design the hardware for being
    programmed by the OS programmers, you use the same ISA for all the
    cores that the OS programmers program, whereas if you design the
    hardware as programmed by "firmware" programmers, you use a
    cheap-to-implement ISA and design the whole thing such that it is
    opaque to OS programmers and only offers some certain capabilities to
    OS programmers.

    And that's not just limited to ISAs. A very successful example is the
    way that flash memory is usually exposed to OSs: as a block device
    like a plain old hard disk, and all the idiosyncracies of flash are
    hidden in the device behind a flash translation layer that is
    implemented by a microcontroller on the device.

    What's "SMN"?

    and you can also
    use the facilities of the main cores (e.g., debugging features that
    may be absent of the I/O cores) during development.

    This is interesting, but we've found it more useful going the
    other way around. We do most of our debugging via the SP.
    Since The SP is also responsible for system initialization and
    holding x86 in reset until we're reading for it to start
    running, it's the obvious nexus for debugging the system
    holistically.

    Sure, for debugging on the core-dump level that's useful. I was
    thinking about watchpoint and breakpoint registers and performance
    counters that one may not want to implement on the DMA-replacement
    core, but that is implemented on the OS/application cores.

    Marking the binaries that should be able to run on the IO service >>processors with some flag, and letting the component of the OS that
    assigns processes to cores heed this flag is not rocket science.

    I agree, that's easy. And yet, mistakes will be made, and there
    will be tension between wanting to dedicate those CPUs to IO
    services and wanting to use them for GP programs: I can easily
    imagine a paper where someone modifies a scheduler to move IO
    bound programs to those cores. Using a different ISA obviates
    most of that, and provides an (admittedly modest) security benefit.

    If there really is such tension, that indicates that such cores would
    be useful for general-purpose use. That makes the case for using the
    same ISA even stronger.

    As for "mistakes will be made", that also goes the other way: With a
    separate toolchain for the DMA-replacement ISA, there is lots of
    opportunity for mistakes.

    As for "security benefit", where is that supposed to come from? What
    attack scenario do you have in mind where that "security benefit"
    could materialize?

    And if I already have to modify or configure the OS to
    accommodate the existence of these things in the first place,
    then accommodating an ISA difference really isn't that much
    extra work. The critical observation is that a typical SMP view
    of the world no longer makes sense for the system architecture,
    and trying to shoehorn that model onto the hardware reality is
    just going to cause frustration.

    The shared-memory multiprocessing view of the world is very
    successful, while distributed-memory computers are limited to
    supercomputing and other areas where hardware cost still dominates
    over software cost (i.e., where the software crisis has not happened
    yet); as an example of the lack of success of the distributed-memory
    paradigm, take the PlayStation 3; programmers found it too hard to
    work with, so they did not use the hardware well, and eventually Sony
    decided to go for an SMP machine for the PlayStation 4 and 5.

    OTOH, one can say that the way many peripherals work on
    general-purpose computers is more along the lines of
    distributed-memory; but that's probably due to the relative hardware
    and software costs for that peripheral. Sure, the performance
    characteristics are non-uniform (NUMA) in many cases, but 1) caches
    tend to smooth over that, and 2) most of the code is not
    performance-critical, so it just needs to run, which is easier to
    achieve with SMP and harder with distributed memory.

    Sure, people have argued for advantages of other models for decades,
    like you do now, but SMP has usually won.

    On the other hand, you buy a motherboard with said ASIC core,
    and you can boot the MB without putting a big chip in the
    socket--but you may have to deal with scant DRAM since the
    big centralized chip contains teh memory controller.

    A neat hack for bragging rights, but not terribly practical?

    Very practical for updating the firmware of the board to support the
    big chip you want to put in the socket (called "BIOS FlashBack" in >>connection with AMD big chips).

    "BIOS", as loaded from the EFS by the ABL on the PSP on EPYC
    class chips, is usually stored in a QSPI flash on the main
    board (though starting with Turin you _can_ boot via eSPI).
    Strictly speaking, you don't _need_ an x86 core to rewrite that.
    On our machines, we do that from the SP, but we don't use AGESA
    or UEFI: all of the platform enablement stuff done in PEI and
    DXE we do directly in the host OS.

    EFS? ABL? QSPI? eSPI? PEI? DXE?

    Anyway, what you do in your special setup does not detract from the
    fact that being able to flash the firmware without having a working
    main core has turned out to be so useful that out of 218 AM5
    motherboards offered in Austria <https://geizhals.at/?cat=mbam5>, 203
    have that feature.

    Also, on AMD machines, again considering EPYC, it's up to system
    software running on x86 to direct either the SMU or MPIO to
    configure DXIO and the rest of the fabric before PCIe link
    training even begins (releasing PCIe from PERST is done by
    either the SMU or MPIO, depending on the specific
    microarchitecture). Where are these cores, again? If they're
    close to the devices, are they in the root complex or on the far
    side of a bridge? Can they even talk to the rest of the board?

    The core that does the flashing obviously is on the board, not on the
    CPU package (which may be absent). I do not know where on the board
    it is. Typically only one USB port can be used for that, so that may
    indicate that a special path may be used for that without initializing
    all the USB ports and the other hardware that's necessary for that; I
    think that some USB ports are directly connected to the CPU package,
    so those would not work anyway.

    In a case where we did not have that
    feature, and the board did not support the CPU, we had to buy another
    CPU to update the firmware >><https://www.complang.tuwien.ac.at/anton/asus-p10s-c4l.html>. That's >>especially relevant for AM4 boards, because the support chips make it
    hard to use more than 16MB Flash for firmware, but the firmware for
    all supported big chips does not fit into 16MB. However, as the case >>mentioned above shows, it's also relevant for Intel boards.

    You shouldn't need to boot the host operating system to do that,
    though I get on most consumer-grade machines you'll do it via
    something that interfaces with AGESA or UEFI.

    In the bad old days you had to boot into DOS and run a DOS program for
    flashing the BIOS. Or worse, Windows; not very useful if you don't
    have Windows installed on the computer (DOS at least could be booted
    from a floppy disk). My last few experiences in that direction were
    firmware flashing as a "BIOS" feature, and the flashback feature
    (which has it's own problems, because communication with the user is
    limited).

    Most server-grade
    machines will have a BMC that can do this independently of the
    main CPU,

    And just in another posting you wrote "but not terribly practical?".
    The board I mentioned above where we had to buy a separate CPU for
    flashing mentioned a BMC on the feature list, but when we looked in
    the manual, we found that the BMC is not delivered with the board, but
    has to be bought separately. There was also no mention that one can
    use the BMC for flashing the BIOS.

    and I should be clear that I'm discounting use cases
    for consumer grade boards, where I suspect something like this
    is less interesting than on server hardware.

    What makes you think so? And what do you mean with "something like
    this"?

    1) "BIOS flashback" is a mostly-standard feature in AM5 (i.e.,
    consumer-grade) boards.

    2) DMA has been a standard feature in various forms on consumer
    hardware since the first IBM PC in 1981, and replacing the DMA engines
    with cores running a general-purpose ISA accessible to OS designers
    will not be limited to servers; if hardware designers and OS
    developers put development time into that, there is no reason for
    limiting that effort to servers. The existence of the LPE-Cores on
    Meteor Lake (not a server chip) and the in-order ARM cores on various smartphone SOCs, the existence of P-Cores and E-Cores on Intel
    consumer-grade CPUs, while the server versions of these CPUs have the
    E-Cores disabled, and the uniformity of cores on the dedicated server
    CPUs indicates that non-uniform cores seem to be hard to sell in
    server space.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Sat May 3 14:29:34 2025
    MitchAlsup1 wrote:
    On Fri, 2 May 2025 2:15:24 +0000, Dan Cross wrote:
    On the other hand, you buy a motherboard with said ASIC core,
    and you can boot the MB without putting a big chip in the
    socket--but you may have to deal with scant DRAM since the
    big centralized chip contains teh memory controller.

    A neat hack for bragging rights, but not terribly practical?

    Anyway, it's a neat idea.  It's very reminiscent of IBM channel
    controllers, in a way.

    It is more like the Peripheral Processors of CDC 6600 that run
    ISA of a CDC 6600 without as much fancy execution in periphery.

    Similar timeframe: The ND10 minis were popular in process control, CERN
    bought a brace of them.

    When they later came out with the larger ND100 and then ND500 machines,
    the latter had a 100 (or 10?) as a front-end IO processor, partially
    required because the original ND10 came with a very early version of
    SINTRAN os which didn't have proper/complete IO support, so customers
    had written machine code to handle it.

    The 500 wasn't machine code compatible, so all such IO routines then had
    to run on the front-end processor.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Anton Ertl on Sat May 3 13:33:47 2025
    In article <2025May3.081100@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025May2.073450@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    I think it's the same thing as Greenspun's tenth rule: First you find >>>that a classical DMA engine is too limiting, then you find that an A53
    is too limiting, and eventually you find that it would be practical to >>>run the ISA of the main cores. In particular, it allows you to use
    the toolchain of the main cores for developing them,

    These are issues solveable with the software architecture and
    build system for the host OS.

    Certainly, one can work around many bad decisions, and in reality one
    has to work around some bad decisions, but the issue here is not
    whether "the issues are solvable", but which decision leads to better
    or worse consequences.

    I don't know that either would be "better" or "worse" under any
    objective criteria. They would simply be different.

    The important characteristic is
    that the software coupling makes architectural sense, and that
    simply does not require using the same ISA across IPs.

    IP? Internet Protocol?

    When we discuss hardware designs at this level, reusable
    components that go into the system are often referred to as "IP
    cores" or just "IPs". For example, a UART might be an IP.

    Think of them as building blocks that go into, say, a SoC.

    Software Coupling sounds to me like a concept
    from Constantine out of my Software engineering class.

    I have no idea who or what that is, but it seems unrelated.

    I guess you
    did not mean either, but it's unclear what you mean.

    It's a very common term in this context. https://en.wikipedia.org/wiki/Semiconductor_intellectual_property_core

    In any case, I have made arguments why it would make sense to use the
    same ISA as for the OS for programming the cores that replace DMA
    engines. I will discuss your counterarguments below, but the most
    important one to me seems to be that these cores would cost more than
    with a different ISA. There is something to that, but when the
    application ISA is cheap to implement (e.g., RV64GC), that cost is
    small; it may be more an argument for also selecting the
    cheap-to-implement ISA for the OS/application cores.

    Ok.

    Indeed, consider AMD's Zen CPUs; the PSP/ASP/whatever it's
    called these days is an ARM core while the big CPUs are x86.
    I'm pretty sure there's an Xtensa DSP in there to do DRAM and
    timing and PCIe link training.

    The PSPs are not programmable by the OS or application programmers, so
    using the same ISA would not benefit the OS or application
    programmers.

    Its firmware ships in BIOS images. You can, in fact, interact
    with it from the OS. The only thing that keeps it from being
    programmable by the OS is signing keys.

    By contrast, the idea for the DMA replacement engines is
    that they are programmable by the OS and maybe the application
    programmers, and that changes whether the same ISA is beneficial.

    What is "ASP/whatever"?

    The PSP, or "AMD Platform Security Processor", has many names.
    AMD says that "PSP" is the "legacy name", and that the new name
    is ASP, for "AMD Secure Processor", and that it provides
    "runtime security services"; for example, the PSP implements a
    TPM in firmware, and exposes a random number generator that x86
    can access via the `RDRAND` instruction.

    Similarly with the ME on Intel.

    Last I read about it, ME uses a core developed by Intel with IA-32 or
    AMD64; but in any case, the ME is not programmable by OS or
    application programmers, either.

    I was under the impression that it started out as an ARM core,
    but I may be mistaken.

    In any case, where do you think its firmware comes from?

    A BMC might be running on whatever.

    Again, a BMC is not programmable by OS or application programmers.

    The people working on OpenBMC disagree.

    We increasingly see ARM
    based SBCs that have small RISC-V microcontroller-class cores
    embedded in the SoC for exactly this sort of thing.

    That's interesting; it points to RISC-V being cheaper to implement
    than ARM. As for "that sort of thing", they are all not programmable
    by OS or application programmers, so see above.

    No, the entire point is to provide an off-load for things that
    are real-time. They are absolutely meant to be "programmable by
    OS or application programmers", which is exactly the sort of
    scenario that Mitch's proposed cores would be used for.

    Is a GPU programmable? Yes. Does it use the same ISA as the
    general purpose compute core? No.

    Our hardware RoT

    ?

    Root of Trust.

    The problem is when such service cores are hidden (as they are
    in the case of the PSP, SMU, MPIO, and similar components, to
    use AMD as the example) and treated like black boxes by
    software. It's really cool that I can configure the IO crossbar
    in useful way tailored to specific configurations, but it's much
    less cool that I have to do what amounts to an RPC over the SMN
    to some totally undocumented entity somewhere in the SoC to do
    it. Bluntly, as an OS person, I do not want random bits of code
    running anywhere on my machine that I am not at least aware of
    (yes, this includes firmware blobs on devices).

    Well, one goes with the other. If you design the hardware for being >programmed by the OS programmers, you use the same ISA for all the
    cores that the OS programmers program,

    That's a categorical statement that is not well supported. That
    may be what is _usually_ done. It is not what _has_ to be done,
    or even what _should_ be done.

    You may feel that ths is the way things should be done, but the
    arguments you've presented so far are not persuasive.

    whereas if you design the
    hardware as programmed by "firmware" programmers, you use a >cheap-to-implement ISA and design the whole thing such that it is
    opaque to OS programmers and only offers some certain capabilities to
    OS programmers.

    There is little fundamental difference between "firmware" and
    the "OS". I would further argue that this model of walling off
    bits of system programmed with "firmware" from the OS a dated
    way of thinking about systems that is actively harmful. See
    Roscoe's OSDI'21 keynote, here:

    https://www.usenix.org/conference/osdi21/presentation/fri-keynote

    Insisting that we use the congealed model we currently use
    because that's how it is done is circular reasoning.

    And that's not just limited to ISAs. A very successful example is the
    way that flash memory is usually exposed to OSs: as a block device
    like a plain old hard disk, and all the idiosyncracies of flash are
    hidden in the device behind a flash translation layer that is
    implemented by a microcontroller on the device.

    You're conflating a hardware interface with firmware.

    What's "SMN"?

    The "System Management Network." This is the thing that AMD
    uses inside the SoC to talk between the different components
    that make up the system (that is, between the different IPs in
    the SoC). SMN is really a network of AXI buses, but it's how
    one can, say, read and write registers on various components.

    If you look at, for example, https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/55803-ppr-family-17h-model-31h-b0-processors.pdf
    And you look at the enry for the SMU registers, you'll see
    that they have an "aliasSMN" entry in the instance table; those
    can be decoded to a 32-bit number. That is the SMN address of
    that register. For example, `SMU::THM::THM_TCON_CUR_TMP` is the
    thermal register maintained by the SMU that encodes the current
    temperature (in normalized units that are scaled from e.g.
    degrees C, to accommodate different operating temperature ranges
    between different physical parts). Anyway, if one were to
    decode the address in the instance table, one would see that
    that register is at SMN address 0x0005_9800. One accesses SMN
    via an address/data pair of registers on a special BDF (0/0/0)
    in PCI config space. If you write that address to offset 0x60
    for 0/0/0, and then read form offset 0x64 on 0/0/0, you'll get
    the contents of that register. You can use either port IO or
    ECAM for such accesses.

    Similarly, consider `PCS::DXIO::PCS_GOPX16_PCS_STATUS1`, which
    is a register with multiple instances for each XGMI PCS (before
    you ask, "PCS" is "Physical Coding Sublayer" and xGMI is the
    socket-to-socket [external] Global Memory Interface). That is,
    these are the SerDes (Serializer/Deserializer) for communicating
    between sockets. Anwyway, the SMN address that corresponds to
    PCS 21, serdes aggregator 1, is 0x12ff_0050.

    and you can also
    use the facilities of the main cores (e.g., debugging features that
    may be absent of the I/O cores) during development.

    This is interesting, but we've found it more useful going the
    other way around. We do most of our debugging via the SP.
    Since The SP is also responsible for system initialization and
    holding x86 in reset until we're reading for it to start
    running, it's the obvious nexus for debugging the system
    holistically.

    Sure, for debugging on the core-dump level that's useful. I was
    thinking about watchpoint and breakpoint registers and performance
    counters that one may not want to implement on the DMA-replacement
    core, but that is implemented on the OS/application cores.

    I assumed you were talking about remote hardware debugging
    interfaces. You seem to be talking about just running a
    debugger or profiler on the IO offload core. That's a much
    simpler use case.

    Marking the binaries that should be able to run on the IO service >>>processors with some flag, and letting the component of the OS that >>>assigns processes to cores heed this flag is not rocket science.

    I agree, that's easy. And yet, mistakes will be made, and there
    will be tension between wanting to dedicate those CPUs to IO
    services and wanting to use them for GP programs: I can easily
    imagine a paper where someone modifies a scheduler to move IO
    bound programs to those cores. Using a different ISA obviates
    most of that, and provides an (admittedly modest) security benefit.

    If there really is such tension, that indicates that such cores would
    be useful for general-purpose use. That makes the case for using the
    same ISA even stronger.

    Incorrect. It makes it weaker: the whole point is to have
    coprocessor cores that are dedicated to IO processing that are
    not used for GP compute. As Mitch said, they're already far
    away from DRAM; using them for compute is going to suck. They
    are there to offload IO processing from the big cores; don't
    make it easier to abuse their existence.

    As for "mistakes will be made", that also goes the other way: With a
    separate toolchain for the DMA-replacement ISA, there is lots of
    opportunity for mistakes.

    I meant runtime mistakes. You can't run x86 code on them if
    they're not an x86 core.

    As for "security benefit", where is that supposed to come from?d

    You can't run x86 code on them if they're not an x86 core.

    What
    attack scenario do you have in mind where that "security benefit"
    could materialize?

    Someone figures out how to exploit a flaw in the OS whereby some
    user thread can execute on an IO coprocessor core, and they
    figure out you can speculate on IO transactions, allowing them
    to exfiltrate data directly from the IO source.

    But, if the OS _cannot_ schedule a user process there, because
    it's running an entirely different ISA, then that cannot happen.

    And if I already have to modify or configure the OS to
    accommodate the existence of these things in the first place,
    then accommodating an ISA difference really isn't that much
    extra work. The critical observation is that a typical SMP view
    of the world no longer makes sense for the system architecture,
    and trying to shoehorn that model onto the hardware reality is
    just going to cause frustration.

    The shared-memory multiprocessing view of the world is very
    successful, while distributed-memory computers are limited to
    supercomputing and other areas where hardware cost still dominates
    over software cost (i.e., where the software crisis has not happened
    yet); as an example of the lack of success of the distributed-memory >paradigm, take the PlayStation 3; programmers found it too hard to
    work with, so they did not use the hardware well, and eventually Sony
    decided to go for an SMP machine for the PlayStation 4 and 5.

    The SoCs you are talking about are already, literally,
    "distributed memory computers". See above about the SMN.

    OTOH, one can say that the way many peripherals work on
    general-purpose computers is more along the lines of
    distributed-memory; but that's probably due to the relative hardware
    and software costs for that peripheral. Sure, the performance
    characteristics are non-uniform (NUMA) in many cases, but 1) caches
    tend to smooth over that, and 2) most of the code is not >performance-critical, so it just needs to run, which is easier to
    achieve with SMP and harder with distributed memory.

    Sure, people have argued for advantages of other models for decades,
    like you do now, but SMP has usually won.

    Bluntly, you're making a lot of assumptions and drawing
    conclusions from those assumptions.

    On the other hand, you buy a motherboard with said ASIC core,
    and you can boot the MB without putting a big chip in the
    socket--but you may have to deal with scant DRAM since the
    big centralized chip contains teh memory controller.

    A neat hack for bragging rights, but not terribly practical?

    Very practical for updating the firmware of the board to support the
    big chip you want to put in the socket (called "BIOS FlashBack" in >>>connection with AMD big chips).

    "BIOS", as loaded from the EFS by the ABL on the PSP on EPYC
    class chips, is usually stored in a QSPI flash on the main
    board (though starting with Turin you _can_ boot via eSPI).
    Strictly speaking, you don't _need_ an x86 core to rewrite that.
    On our machines, we do that from the SP, but we don't use AGESA
    or UEFI: all of the platform enablement stuff done in PEI and
    DXE we do directly in the host OS.

    EFS? ABL? QSPI? eSPI? PEI? DXE?

    Umm, those are the basic components of the "BIOS" and
    surrounding stack as implemented on AMD systems with AGESA and
    UEFI. If you are unaware of what these mean, perhaps you should
    spend a little bit of time reading up on how the things you are
    frankly making a lot of assumptions about actually work.

    In this case, I'm happy to explain a bit, but, frankly, your
    response makes it painfully obvious that you really need to
    do your own homework here.

    * EFS: Embedded File System. This is the filesystem-like format
    that AMD uses for the data stored in flash that is loaded by
    the PSP.
    * ABL: AGESA Boot Loader. This is a software component that
    runs on the PSP that reads and interprets the "BIOS" image
    in the EFS on flash and loads the x86 code that runs from the
    reset vector into DRAM.
    * QSPI: Quad SPI. This is the physical interface used to access
    the flash that holds the EFS. It is lined out from the socket
    and thus the CPU so that the PSP can access it. Other things
    can also access it via a series of muxes; for example, on OCP
    boards like Ruby it's accessable across the DC-SCM connector
    to the BMC so that the BMC can update flash.
    * eSPI: enhanced Serial Peripheral Interface. See the Intel
    spec. Supported in Genoa, and now in Turin, it's possible to
    boot and AMD EPYC CPU over eSPI. eSPI is lined out from the
    package.
    * PEI: The "Pre-EFI Initialization" phase of UEFI (Unified
    Extensible Firmware Interface -- the "modern" BIOS). This is
    the phase where most of the platform enablement stuff is done;
    for example, the PCIe buses are initialized and links are
    trained, for example here:
    https://github.com/openSIL/openSIL/blob/main/xUSL/Mpio/Common/MpioInitFlow.c#L508
    * DXE: The "Driver Execution Environment" phase of UEFI, where
    individual _devices_ are found an initialized.
    https://uefi.org/specs/PI/1.9/V1_Overview.html

    Anyway, what you do in your special setup does not detract from the
    fact that being able to flash the firmware without having a working
    main core has turned out to be so useful that out of 218 AM5
    motherboards offered in Austria <https://geizhals.at/?cat=mbam5>, 203
    have that feature.

    Sure. It's useful. You just don't need to have an x86 core to
    do it.

    Also, on AMD machines, again considering EPYC, it's up to system
    software running on x86 to direct either the SMU or MPIO to
    configure DXIO and the rest of the fabric before PCIe link
    training even begins (releasing PCIe from PERST is done by
    either the SMU or MPIO, depending on the specific
    microarchitecture). Where are these cores, again? If they're
    close to the devices, are they in the root complex or on the far
    side of a bridge? Can they even talk to the rest of the board?

    The core that does the flashing obviously is on the board, not on the
    CPU package (which may be absent). I do not know where on the board
    it is.

    I was referring to Mitch's proposed co-processor cores. The
    point was, that if they're on the distant end of an IO bus that
    isn't even configured, and not somehow otherwise connected to
    the flash part that holds the BIOS, then they're not going to
    help you flash the BIOS without the a socket being populated so
    that you've got something that can set up that IO bus so that
    those cores can connect to anything useful. You seem to be
    assuming that they're just going to start, in the absense of
    the main package, but again, that's a big assumption.

    Typically only one USB port can be used for that, so that may
    indicate that a special path may be used for that without initializing
    all the USB ports and the other hardware that's necessary for that; I
    think that some USB ports are directly connected to the CPU package,
    so those would not work anyway.

    Like I said, you could have an electromechanical interlock that
    lets the IO coprocessors boot independently and talk directly to
    the flash mux if the socket is not populated. The interface by
    which you get the flash image is immaterial at that point. But
    it's not at all clear to me that Mitch had anything like that in
    mind.

    In a case where we did not have that
    feature, and the board did not support the CPU, we had to buy another
    CPU to update the firmware >>><https://www.complang.tuwien.ac.at/anton/asus-p10s-c4l.html>. That's >>>especially relevant for AM4 boards, because the support chips make it >>>hard to use more than 16MB Flash for firmware, but the firmware for
    all supported big chips does not fit into 16MB. However, as the case >>>mentioned above shows, it's also relevant for Intel boards.

    You shouldn't need to boot the host operating system to do that,
    though I get on most consumer-grade machines you'll do it via
    something that interfaces with AGESA or UEFI.

    In the bad old days you had to boot into DOS and run a DOS program for >flashing the BIOS. Or worse, Windows; not very useful if you don't
    have Windows installed on the computer (DOS at least could be booted
    from a floppy disk). My last few experiences in that direction were
    firmware flashing as a "BIOS" feature, and the flashback feature
    (which has it's own problems, because communication with the user is >limited).

    Most server-grade
    machines will have a BMC that can do this independently of the
    main CPU,

    And just in another posting you wrote "but not terribly practical?".
    The board I mentioned above where we had to buy a separate CPU for
    flashing mentioned a BMC on the feature list, but when we looked in
    the manual, we found that the BMC is not delivered with the board, but
    has to be bought separately. There was also no mention that one can
    use the BMC for flashing the BIOS.

    Sounds like a problem with the vendor.

    and I should be clear that I'm discounting use cases
    for consumer grade boards, where I suspect something like this
    is less interesting than on server hardware.

    What makes you think so? And what do you mean with "something like
    this"?

    "Something like this" meaning a dedicated IO coprocessor on the
    far side of the root complex for offloading IO handling.

    If you can't see why that might have more applications in the
    data center than on the desktop, I don't know what to tell you.
    Maybe there are consumer use cases I'm not aware of.

    1) "BIOS flashback" is a mostly-standard feature in AM5 (i.e., >consumer-grade) boards.

    Of course.

    2) DMA has been a standard feature in various forms on consumer
    hardware since the first IBM PC in 1981, and replacing the DMA engines
    with cores running a general-purpose ISA accessible to OS designers
    will not be limited to servers;

    I don't think that was the suggestion.

    if hardware designers and OS
    developers put development time into that, there is no reason for
    limiting that effort to servers. The existence of the LPE-Cores on
    Meteor Lake (not a server chip) and the in-order ARM cores on various >smartphone SOCs, the existence of P-Cores and E-Cores on Intel
    consumer-grade CPUs, while the server versions of these CPUs have the
    E-Cores disabled, and the uniformity of cores on the dedicated server
    CPUs indicates that non-uniform cores seem to be hard to sell in
    server space.

    The systems you just mentioned were designed for minimizing
    power consumption, something that's very useful in the consumer
    space (e.g., for battery operated applications, like phones and
    laptops) and less useful in the data center space. However,
    having dedicated coprocessors to offload things like IO has a
    long history in the mainframe world, but that hasn't filtered
    down to the server space in part because it's not well-supported
    by software.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Dan Cross on Sat May 3 21:53:37 2025
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025May2.073450@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>,
    [snip]
    I suspect the 400 GHz NIC needs a rather BIG core to handle the
    traffic loads.

    Looking at
    https://chipsandcheese.com/p/arms-cortex-a53-tiny-but-important, a >>Cortex-A53 would not be up to it (at 1896MHz it can read <12GB/s and
    write <18GB/s even to the L1 cache). However, Chester Lam notes: "A53 >>offers very low cache bandwidth compared to pretty much any other core >>we’ve analyzed." I think, though, that a small in-order core like the >>A53, but with enough load and store buffering and enough bandwidth to
    I/O and the memory controller should not have a problem shoveling data
    from or to a 400Gb/s NIC. With 128 bits/cycle in each direction one
    would need one transfer per cycle in each direction at 3125MHz to
    achieve 400Gb/s, or maybe 4GHz for a dual-issue core to allow for loop >>overhead.

    Running any SoC at 3+gHz requires significant effort in the
    back-end and to ensure timing closure on the front end (and
    affects floorplanning). All this adds to the cost to build
    and manufacture the chips.

    It may be more productive to consider widening the internal
    buses to be 256 or 512 bits wide.

    Given that the A53 typically only has 2GHz, supporting 256
    bits/cycle of transfer width (for load and store instructions, i.e.,
    along the lines of AVX-256) would be more appropriate.

    Better to just use custom hardware for data movement and
    add accelerators for specific activities (such as crypto).

    Back in the late 70's the Burroughs B4900 used 8085 processor
    chips in the I/O controllers (and in the maintenance processor).
    The 8085 was primarily concerned with data movement and supported
    aggregate bandwidth of 8Mbytes/second between each I/O controller
    and memory (there could be up to two IOPs, each responsible for
    32 channels).

    Eh...Having to jump through hoops here matters less to me for
    this kind of use case than if I'm trying to use those cores for >>>general-purpose compute.

    I think it's the same thing as Greenspun's tenth rule: First you find
    that a classical DMA engine is too limiting, then you find that an A53
    is too limiting, and eventually you find that it would be practical to
    run the ISA of the main cores. In particular, it allows you to use
    the toolchain of the main cores for developing them,

    These are issues solveable with the software architecture and
    build system for the host OS. The important characteristic is
    that the software coupling makes architectural sense, and that
    simply does not require using the same ISA across IPs.

    I think there are good reasons to have specialized (or low cost,
    e.g. riscv) ancilliary cores in a processor package. Having
    been on both sides of the keep them proprietary vs. fully document
    them for the OS folks argument, I remain ambivilent.

    There are good reasons for both positions. The same reasons behind
    the MP1.5 spec and UEFI apply in many cases - widening the
    ecosystem and 'you-fix-it' capabilities. On the other hand,
    there may be trade secrets, or system security implications that
    might preclude full disclosure. Once a capability is documented
    in the PC world, it tends to live forever good or bad (ISA anyone?)
    which may limit future choices in the product line (or discommode
    customers).


    At work, our service processor (granted, outside of the SoC but
    tightly coupled at the board level) is a Cortex-M7, but we wrote
    the OS for that,

    What, not Zephyr?

    and we control the host OS that runs on x86,
    so the SP and big CPUs can be mutually aware. Our hardware RoT
    is a smaller Cortex-M. We don't have a BMC on our boards;
    everything that it does is either done by the SP or built into
    the host OS, both of which are measured by the RoT.

    The problem is when such service cores are hidden (as they are
    in the case of the PSP, SMU, MPIO, and similar components, to
    use AMD as the example) and treated like black boxes by
    software. It's really cool that I can configure the IO crossbar
    in useful way tailored to specific configurations, but it's much
    less cool that I have to do what amounts to an RPC over the SMN
    to some totally undocumented entity somewhere in the SoC to do
    it. Bluntly, as an OS person, I do not want random bits of code
    running anywhere on my machine that I am not at least aware of
    (yes, this includes firmware blobs on devices).

    As a hardware (and long-time OS) person (not necessarily in that order),
    I sympathize, but, yet, see above.


    And if I already have to modify or configure the OS to
    accommodate the existence of these things in the first place,
    then accommodating an ISA difference really isn't that much
    extra work. The critical observation is that a typical SMP view
    of the world no longer makes sense for the system architecture,
    and trying to shoehorn that model onto the hardware reality is
    just going to cause frustration. Better to acknowledge that the


    Most of the hardware should be standardized through ACPI calls,
    allowing the underlying implementation to vary over time.


    <big snip>

    Also, on AMD machines, again considering EPYC, it's up to system
    software running on x86 to direct either the SMU or MPIO to
    configure DXIO and the rest of the fabric before PCIe link
    training even begins (releasing PCIe from PERST is done by
    either the SMU or MPIO, depending on the specific
    microarchitecture). Where are these cores, again? If they're
    close to the devices, are they in the root complex or on the far
    side of a bridge? Can they even talk to the rest of the board?

    It's not the core that's proprietary in this case, it's the
    intimate knowledge of the mainboard that is required for that
    operation - consider address routing - once a function BAR
    is programmed, the SoC fabric needs to be configured to route
    that range of addresses to the correct PCI controller to
    be converted to PCIe TLPs to the target function.

    That routing can be incredibly complicated, requiring
    very substantial and rather tricky configuration
    steps in several related IP blocks as well as the
    inter-cpu routing fabric/mesh. Getting it right
    is difficult, and there's really no reason for the
    OS level to be aware of it (and given it is highly
    mainboard/SoC dependent, just complicates the
    operating software when not behind a standard
    configuration mechanism like ACPI.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat May 3 23:02:25 2025
    On Sat, 3 May 2025 21:53:37 +0000, Scott Lurndal wrote:

    cross@spitfire.i.gajendra.net (Dan Cross) writes:

    Looking at >>>https://chipsandcheese.com/p/arms-cortex-a53-tiny-but-important, a >>>Cortex-A53 would not be up to it (at 1896MHz it can read <12GB/s and >>>write <18GB/s even to the L1 cache). However, Chester Lam notes: "A53 >>>offers very low cache bandwidth compared to pretty much any other core >>>we’ve analyzed." I think, though, that a small in-order core like the >>>A53, but with enough load and store buffering and enough bandwidth to
    I/O and the memory controller should not have a problem shoveling data >>>from or to a 400Gb/s NIC. With 128 bits/cycle in each direction one >>>would need one transfer per cycle in each direction at 3125MHz to
    achieve 400Gb/s, or maybe 4GHz for a dual-issue core to allow for loop >>>overhead.

    Running any SoC at 3+gHz requires significant effort in the
    back-end and to ensure timing closure on the front end (and
    affects floorplanning). All this adds to the cost to build
    and manufacture the chips.

    It may be more productive to consider widening the internal
    buses to be 256 or 512 bits wide.

    At smaller than 7nm there seems to be little reason the main
    interconnect is not cache-line-wide or cache-line-wide in two
    directions. Your typical GPU will have 1024 wires into and out
    of each shader core and several other big blocks.

    Many cache-lines are 512-bits wide (except for IBM at 4096-bits
    wide).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun May 4 06:44:24 2025
    On Sat, 03 May 2025 06:11:00 GMT, Anton Ertl wrote:

    In any case, I have made arguments why it would make sense to use the
    same ISA as for the OS for programming the cores that replace DMA
    engines. I will discuss your counterarguments below, but the most
    important one to me seems to be that these cores would cost more than
    with a different ISA.

    I think efficiency of implementation is still important enough to outweigh that. Case in point: the RP2040 chip from the Raspberry Pi Foundation.
    That has an ARM core, combined with a pair of auxiliary processors not a million miles removed from the old mainframe idea of “I/O channels”. Those auxiliary processors have sufficient oomph to perform feats such as
    emulating the analog video signal from a 1980s-vintage BBC micro, in real
    time.

    Newer versions of the chip have a RISC-V core in there somewhere, too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)