Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 30 |
Nodes: | 6 (1 / 5) |
Uptime: | 68:40:12 |
Calls: | 414 |
Calls today: | 1 |
Files: | 1,015 |
Messages: | 94,393 |
Posted today: | 1 |
Well, not entirely. This preprint argues that in environments with
lots of cores and where latency is an issue, programmed I/O can outperform DMA.
Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects
Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
Conventional wisdom holds that an efficient interface between an OS
running on a CPU and a high-bandwidth I/O device should use Direct
Memory Access (DMA) to offload data transfer, descriptor rings for
buffering and queuing, and interrupts for asynchrony between cores and device. In this paper we question this wisdom in the light of two
trends: modern and emerging cache-coherent interconnects like CXL3.0,
and workloads, particularly microservices and serverless computing.
Like some others before us, we argue that the assumptions of the
DMA-based model are obsolete, and in many use-cases programmed I/O,
where the CPU explicitly transfers data and control information to and
from a device via loads and stores, delivers a more efficient system. However, we push this idea much further. We show, in a real hardware implementation, the gains in latency for fine-grained communication achievable using an open cache-coherence protocol which exposes cache transitions to a smart device, and that throughput is competitive with
DMA over modern interconnects. We also demonstrate three use-cases: fine-grained RPC-style invocation of functions on an accelerator,
offloading of operators in a streaming dataflow engine, and a network interface targeting serverless functions, comparing our use of
coherence with both traditional DMA-style interaction and a
highly-optimized implementation using memory-mapped programmed I/O
over PCIe.
https://arxiv.org/abs/2409.08141
On 2025-04-26, John Levine <johnl@taugh.com> wrote:[snip]
Well, not entirely. This preprint argues that in environments with
lots of cores and where latency is an issue, programmed I/O can outperform >> DMA.
Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects
Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
https://arxiv.org/abs/2409.08141
What is the difference between DMA and message-passing to another core
doing CMOV loop at the ISA level?
DMA means doing that it the micro-engine instead of at the ISA level.
Same difference.
What am I missing?
Well, not entirely. This preprint argues that in environments with
lots of cores and where latency is an issue, programmed I/O can outperform >DMA.
Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects
Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
https://arxiv.org/abs/2409.08141
John Levine <johnl@taugh.com> writes:
Well, not entirely. This preprint argues that in environments with<snip abstract>
lots of cores and where latency is an issue, programmed I/O can
outperform
DMA.
Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent
Interconnects
Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
https://arxiv.org/abs/2409.08141
Interesting article, thanks for posting.
Their conclusion is not at all surprising for the operations they target
in the paper. PCI express throughput has increased with each
generation, and PCI express latencies have decreased with each
generation. There are certain workloads enabled by CXL that
benefit from reduced PCIe latency, but those are primarily
aimed at increasing directly accessible memory.
However, I expect there are still benefits in using DMA for bulk data transfer, particularly for network packet handling where
throughput is more interesting than PCI MMIO latency.
One concern that arises from the paper are the security
implications of device access to the cache coherency
protocol. Not an issue for a well-behaved device, but
potentially problematic in a secure environment with
third-party CXL-mem devices.
At 3Leaf systems, we extended the coherency domain over
IB or 10Gbe Ethernet to encompass multiple servers in a
single coherency domain, which both facilitated I/O
and provided a single shared physical address space across
multiple servers (up to 16). CXL-mem is basically the same
but using PCIe instead of IB.
Granted, that was close to 20 years ago, and switch latencies
were significant (100ns for IB, far more for Ethernet).
CXL-mem is a similar technology with a different transport (we
looked at Infiniband, 10Ge ethernet and "advanced switching"
(a flavor of PCIe)). Infiniband was the most mature of the
three technologies and switch latencies were signifincantly lower
for IB than the competing transports.
Today, my CPOE sells a couple of CXL2.0 enabled PCIe devices for
memory expansion; one has 16 high-end ARM V cores.
Quoting from the article (p.2)
" As a second example: for throughput-oriented workloads
DMA has evolved to efficiently transfer data to and from
main memory without polluting the CPU cache. However, for
small, fine-grained interactions, it is important that almost all
the data gets into the right CPU cache as quickly as possible."
Most modern CPU's support "allocate" hints on inbound DMA
that will automatically place the data in the right CPU cache as
quickly as possible.
Decomposing that packet transfer into CPU loads and stores
in a coherent fabric doesn't gain much, and burns more power
on the "device" than a DMA engine.
It's interesting they use one of the processors (designed in 2012) that
we
built over a decade ago (and they mispell our company name :-)
in their research computer. That processor does have
a mechanism to allocate data in cache on inbound DMA[*]; it's
worth noting that the 48 cores on that processor are in-order
cores. The text comparing it with a modern i7 at 3.6ghz
doesn't note that.
[*] Although I don't recall if that mechanism was documented
in the public processor technical documentation.
Their description of the Thunder X-1 processor cache is not accurate;
it's not PIPT, it is PIVT (implemented in such a way as to
appear to software as if it were PIPT). The V drops a cycle
off the load-to-use latency.
It was also the only ARM64 processor chip we built with a cache-coherent interconnect until the recent CXL based products.
Overall, a very interesting paper.
On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
John Levine <johnl@taugh.com> writes:
Well, not entirely. This preprint argues that in environments with<snip abstract>
lots of cores and where latency is an issue, programmed I/O can
outperform
DMA.
Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent
Interconnects
Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
https://arxiv.org/abs/2409.08141
Interesting article, thanks for posting.
Their conclusion is not at all surprising for the operations they target
in the paper. PCI express throughput has increased with each
generation, and PCI express latencies have decreased with each
generation. There are certain workloads enabled by CXL that
benefit from reduced PCIe latency, but those are primarily
aimed at increasing directly accessible memory.
However, I expect there are still benefits in using DMA for bulk data
transfer, particularly for network packet handling where
throughput is more interesting than PCI MMIO latency.
I would like to add a though to the concept under discussion::
Does the paper's conclusion hold better or worse if/when the
core ISA contains both LDM/STM and MM instructions. LDM/STM
allow for several sequential registers to move to/from MMI/O
memory in a single interconnect transaction, while MM allows
for up-to page-sized transfers in a single instruction and
only 2 interconnect transactions.
One concern that arises from the paper are the security
implications of device access to the cache coherency
protocol. Not an issue for a well-behaved device, but
potentially problematic in a secure environment with
third-party CXL-mem devices.
Citation please !?!
Also note:: device DMA goes through I/O MMU which adds a
modicum of security-fencing around device DMA accesses
but also adding latency.
At 3Leaf systems, we extended the coherency domain over
IB or 10Gbe Ethernet to encompass multiple servers in a
single coherency domain, which both facilitated I/O
and provided a single shared physical address space across
multiple servers (up to 16). CXL-mem is basically the same
but using PCIe instead of IB.
IB == InfiniBand ?!?
Most modern CPU's support "allocate" hints on inbound DMA
that will automatically place the data in the right CPU cache as
quickly as possible.
Decomposing that packet transfer into CPU loads and stores
in a coherent fabric doesn't gain much, and burns more power
on the "device" than a DMA engine.
That was my initial thought--core performing lots of LD/ST to
MMI/O is bound to consume more power than device DMA.
Secondarily, using 1-few cores to perform PIO is not going to
have the data land in the cache of the core that will run when
the data has been transferred. The data lands in the cache doing
PIO and not in the one to receive control after I/O is done.
{{It may still be "closer than" memory--but several cache
coherence protocols take longer cache-cache than dram-cache.}}
It's interesting they use one of the processors (designed in 2012) that
we
built over a decade ago (and they mispell our company name :-)
in their research computer. That processor does have
a mechanism to allocate data in cache on inbound DMA[*]; it's
worth noting that the 48 cores on that processor are in-order
cores. The text comparing it with a modern i7 at 3.6ghz
doesn't note that.
[*] Although I don't recall if that mechanism was documented
in the public processor technical documentation.
Their description of the Thunder X-1 processor cache is not accurate;
it's not PIPT, it is PIVT (implemented in such a way as to
appear to software as if it were PIPT). The V drops a cycle
off the load-to-use latency.
Generally its VIPT virtual index-physical tag with a few bits
of virtual-aliasing to disambiguate P&s.
It was also the only ARM64 processor chip we built with a cache-coherent
interconnect until the recent CXL based products.
Overall, a very interesting paper.
Reminds me of trying to sell a micro x86-64 to AMD as a project.
The µ86 is a small x86-64 core made available as IP in Verilog
where it has/runs the same ISA as main GBOoO x86, but is placed
"out in the PCIe" interconnect--performing I/O services topo-
logically adjacent to the device itself. This allows 1ns access
latencies to DCRs and performing OS queueing of DPCs,... without
bothering the GBOoO cores.
AMD didn't buy the arguments.
On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
Their description of the Thunder X-1 processor cache is not accurate;
it's not PIPT, it is PIVT (implemented in such a way as to
appear to software as if it were PIPT). The V drops a cycle
off the load-to-use latency.
Generally its VIPT virtual index-physical tag with a few bits
of virtual-aliasing to disambiguate P&s.
mitchalsup@aol.com (MitchAlsup1) writes:
On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
John Levine <johnl@taugh.com> writes:
Well, not entirely. This preprint argues that in environments with >>>>lots of cores and where latency is an issue, programmed I/O can<snip abstract>
outperform
DMA.
Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent
Interconnects
Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
https://arxiv.org/abs/2409.08141
Interesting article, thanks for posting.
Their conclusion is not at all surprising for the operations they target >>> in the paper. PCI express throughput has increased with each
generation, and PCI express latencies have decreased with each
generation. There are certain workloads enabled by CXL that
benefit from reduced PCIe latency, but those are primarily
aimed at increasing directly accessible memory.
However, I expect there are still benefits in using DMA for bulk data
transfer, particularly for network packet handling where
throughput is more interesting than PCI MMIO latency.
I would like to add a though to the concept under discussion::
Does the paper's conclusion hold better or worse if/when the
core ISA contains both LDM/STM and MM instructions. LDM/STM
allow for several sequential registers to move to/from MMI/O
memory in a single interconnect transaction, while MM allows
for up-to page-sized transfers in a single instruction and
only 2 interconnect transactions.
The article discusses using the intel 64-byte store
instructions, if I recall correctly. ARM also has
a 64-byte store in the latest versions of the ISA.
On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
However, I expect there are still benefits in using DMA for bulk data transfer, particularly for network packet handling where
throughput is more interesting than PCI MMIO latency.
I would like to add a though to the concept under discussion::
Does the paper's conclusion hold better or worse if/when the
core ISA contains both LDM/STM and MM instructions. LDM/STM
allow for several sequential registers to move to/from MMI/O
memory in a single interconnect transaction, while MM allows
for up-to page-sized transfers in a single instruction and
only 2 interconnect transactions.
One concern that arises from the paper are the security
implications of device access to the cache coherency
protocol. Not an issue for a well-behaved device, but
potentially problematic in a secure environment with
third-party CXL-mem devices.
Citation please !?!
Also note:: device DMA goes through I/O MMU which adds a
modicum of security-fencing around device DMA accesses
but also adding latency.
Most modern CPU's support "allocate" hints on inbound DMA
that will automatically place the data in the right CPU cache as
quickly as possible.
Decomposing that packet transfer into CPU loads and stores
in a coherent fabric doesn't gain much, and burns more power
on the "device" than a DMA engine.
That was my initial thought--core performing lots of LD/ST to
MMI/O is bound to consume more power than device DMA.
Secondarily, using 1-few cores to perform PIO is not going to
have the data land in the cache of the core that will run when
the data has been transferred. The data lands in the cache doing
PIO and not in the one to receive control after I/O is done.
{{It may still be "closer than" memory--but several cache
coherence protocols take longer cache-cache than dram-cache.}}
It was also the only ARM64 processor chip we built with a cache-coherent interconnect until the recent CXL based products.
Overall, a very interesting paper.
Reminds me of trying to sell a micro x86-64 to AMD as a project.
The µ86 is a small x86-64 core made available as IP in Verilog
where it has/runs the same ISA as main GBOoO x86, but is placed
"out in the PCIe" interconnect--performing I/O services topo-
logically adjacent to the device itself. This allows 1ns access
latencies to DCRs and performing OS queueing of DPCs,... without
bothering the GBOoO cores.
AMD didn't buy the arguments.
What is the difference between DMA and message-passing to another core
doing CMOV loop at the ISA level?
DMA means doing that it the micro-engine instead of at the ISA level.
Same difference.
What am I missing?
Lars Poulsen <lars@cleo.beagle-ears.com> wrote:
What is the difference between DMA and message-passing to another core
doing CMOV loop at the ISA level?
DMA means doing that it the micro-engine instead of at the ISA level.
Same difference.
What am I missing?
Width and specialisation.
You can absolutely write a DMA engine in software. One thing that is troublesome is that the CPU datapath might be a lot narrower than the
number
of bits you can move in a single cycle. eg on FPGA we can't clock logic anywhere near the DRAM clock so we end up making a very wide memory bus
that
runs at a lower clock - 512/1024/2048/... bits wide. You can do that
in a
regular ISA using vector registers/instructions but it adds complexity
you
don't need.
The other is that there's often some degree of marshalling that needs to happen - reading scatter/gather lists, formatting packets the right way
for
PCIe, filling in the right header fields, etc. It's more efficient to
do
that in hardware than it is to spend multiple instructions per packet
doing
it. Meanwhile the DRAM bandwidth is being wasted.
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
--------------
One concern that arises from the paper are the security
implications of device access to the cache coherency
protocol. Not an issue for a well-behaved device, but
potentially problematic in a secure environment with
third-party CXL-mem devices.
Citation please !?!
CXL's protection model isn't very good: https://dl.acm.org/doi/pdf/10.1145/3617580
(declaration: I'm a coauthor)
Secondarily, using 1-few cores to perform PIO is not going to
have the data land in the cache of the core that will run when
the data has been transferred. The data lands in the cache doing
PIO and not in the one to receive control after I/O is done.
{{It may still be "closer than" memory--but several cache
coherence protocols take longer cache-cache than dram-cache.}}
I think this is an 'it depends'. If you're doing RPC type operations,
it
takes more work to warm up the DMA than it does to just do PIO.
If you're
an SSD pulling a large file from flash, DMA is more efficient. If
you're
moving network packets, which involve multiple scatter-gathers per
packet,
then maybe some heavy lifting is useful for the address handling.
It was also the only ARM64 processor chip we built with a cache-coherent >>> interconnect until the recent CXL based products.
Overall, a very interesting paper.
Reminds me of trying to sell a micro x86-64 to AMD as a project.
The µ86 is a small x86-64 core made available as IP in Verilog
where it has/runs the same ISA as main GBOoO x86, but is placed
"out in the PCIe" interconnect--performing I/O services topo-
logically adjacent to the device itself. This allows 1ns access
latencies to DCRs and performing OS queueing of DPCs,... without
bothering the GBOoO cores.
AMD didn't buy the arguments.
Intel tried that with the Quark line of 'microcontrollers', which
appeared
to be a warmed over P54 Pentium (whether it shared microarchitecture or
RTL I'm not sure).
They were too power hungry and unwieldy to be microcontrollers - they also couldn't run Debian/x86 despite having an
MMU because they were too old for the LOCK CMPXCHG instruction Debian
used (P54 didn't need to worry about concurrency, but we do now).
On Sun, 27 Apr 2025 19:13:47 +0000, Theo wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
I think this is an 'it depends'. If you're doing RPC type operations,
it
takes more work to warm up the DMA than it does to just do PIO.
Yes, it takes more cycles for a CPU to tell a device to move memory
from here to there than it takes CPU to just move the memory from
here to there. I was, instead, referring to a CPU where it has an
MM (move memory to memory) instruction, where the instruction is
allowed to be sent over the interconnect (say CRM controller) and
have DRC perform the M2M movement locally.
They were too power hungry and unwieldy to be
microcontrollers - they also couldn't run Debian/x86 despite having an
MMU because they were too old for the LOCK CMPXCHG instruction Debian
used (P54 didn't need to worry about concurrency, but we do now).
It is not generally known, but back in ~2006 when Opteron was in full
swing, the HT fabric to the SouthBridge was actually coherent, we just
did not publish the coherence spec and thus devices could not use it.
But it was present, and if someone happened to know the protocol they
could have used the coherent nature of it.
On Sun, 27 Apr 2025 18:35:08 +0000, Theo wrote:
Lars Poulsen <lars@cleo.beagle-ears.com> wrote:
What is the difference between DMA and message-passing to another core
doing CMOV loop at the ISA level?
DMA means doing that it the micro-engine instead of at the ISA level.
Same difference.
What am I missing?
Width and specialisation.
You can absolutely write a DMA engine in software. One thing that is
troublesome is that the CPU datapath might be a lot narrower than the
number
of bits you can move in a single cycle. eg on FPGA we can't clock logic
anywhere near the DRAM clock so we end up making a very wide memory bus
that
runs at a lower clock - 512/1024/2048/... bits wide. You can do that
in a
regular ISA using vector registers/instructions but it adds complexity
you
don't need.
With anything at 7nm or smaller, the main core interconnect should be
1 cache line wide (512 bits = 64 bytes :: although IBM's choice of 256
byte cache lines might be troublesome for now.)
The other is that there's often some degree of marshalling that needs to
happen - reading scatter/gather lists, formatting packets the right way
for
PCIe, filling in the right header fields, etc. It's more efficient to
do
that in hardware than it is to spend multiple instructions per packet
doing
it. Meanwhile the DRAM bandwidth is being wasted.
SW is nadda-verrryyy guud at twiddling bits like HW is.
Of course you can customise your ISA with extra instructions for doing
the heavy lifting, but then arguably it's not really a CPU any more,
it's a 'programmable DMA engine'. The line between the two becomes very blurred.
On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
[snip]
Reminds me of trying to sell a micro x86-64 to AMD as a project.
The µ86 is a small x86-64 core made available as IP in Verilog
where it has/runs the same ISA as main GBOoO x86, but is placed
"out in the PCIe" interconnect--performing I/O services topo-
logically adjacent to the device itself. This allows 1ns access
latencies to DCRs and performing OS queueing of DPCs,... without
bothering the GBOoO cores.
AMD didn't buy the arguments.
In article <da5b3dea460370fc1fe8ad2323da9bc4@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
[snip]
Reminds me of trying to sell a micro x86-64 to AMD as a project.
The µ86 is a small x86-64 core made available as IP in Verilog
where it has/runs the same ISA as main GBOoO x86, but is placed
"out in the PCIe" interconnect--performing I/O services topo-
logically adjacent to the device itself. This allows 1ns access
latencies to DCRs and performing OS queueing of DPCs,... without
bothering the GBOoO cores.
AMD didn't buy the arguments.
I can see it either way; I suppose the argument as to whether I
buy it or not comes down to, "in depends". How much control do
I, as the OS implementer, have over this core?
If it is yet another hidden core embedded somewhere deep in the
SoC complex and I can't easily interact with it from the OS,
then no thanks: we've got enough of those between MP0, MP1, MP5,
etc, etc.
On the other hand, if it's got a "normal" APIC ID, the OS has
control over it like any other LP, and its coherent with the big
cores, then yeah, sign me up: I've been wanting something like
that for a long time now.
Consider a virtualization application. A problem with, say,
SR-IOV is that very often the hypervisor wants to interpose some
sort of administrative policy between the virtual function and
whatever it actually corresponds to, but get out of the fast
path for most IO. This implies a kind of offload architecture
where there's some (presumably software) agent dedicated to
handling IO that can be parameterized with such a policy. A
core very close to the device could handle that swimmingly,
though I'm not sure it would be enough to do it at (say) line
rate for a 400Gbps NIC or Gen5 NVMe device.
....but why x86_64? It strikes me that as long as the _data_
formats vis the software-visible ABI are the same, it doesn't
need to use the same ISA. In fact, I can see advantages to not
doing so.
- Dan C.
On Thu, 1 May 2025 13:07:07 +0000, Dan Cross wrote:
In article <da5b3dea460370fc1fe8ad2323da9bc4@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
[snip]
Reminds me of trying to sell a micro x86-64 to AMD as a project.
The µ86 is a small x86-64 core made available as IP in Verilog
where it has/runs the same ISA as main GBOoO x86, but is placed
"out in the PCIe" interconnect--performing I/O services topo-
logically adjacent to the device itself. This allows 1ns access
latencies to DCRs and performing OS queueing of DPCs,... without >>>bothering the GBOoO cores.
AMD didn't buy the arguments.
I can see it either way; I suppose the argument as to whether I
buy it or not comes down to, "in depends". How much control do
I, as the OS implementer, have over this core?
Other than it being placed "away" from the centralized cores,
it runs the same ISA as the main cores has longer latency to
coherent memory and shorter latency to device control registers
--which is why it is placed close to the device itself:: latency.
The big fast centralized core is going to get microsecond latency
from MMI/O device whereas ASIC version will have handful of nano-
second latencies. So the 5 GHZ core sees ~1 microsecond while the
little ASIC sees 10 nanoseconds. ...
If it is yet another hidden core embedded somewhere deep in the
SoC complex and I can't easily interact with it from the OS,
then no thanks: we've got enough of those between MP0, MP1, MP5,
etc, etc.
On the other hand, if it's got a "normal" APIC ID, the OS has
control over it like any other LP, and its coherent with the big
cores, then yeah, sign me up: I've been wanting something like
that for a long time now.
It is just a core that is cheap enough to put in ASICs, that
can offload some I/O burden without you having to do anything
other than setting some bits in some CRs so interrupts are
routed to this core rather than some more centralized core.
Consider a virtualization application. A problem with, say,
SR-IOV is that very often the hypervisor wants to interpose some
sort of administrative policy between the virtual function and
whatever it actually corresponds to, but get out of the fast
path for most IO. This implies a kind of offload architecture
where there's some (presumably software) agent dedicated to
handling IO that can be parameterized with such a policy. A
Interesting:: Could you cite any literature, here !?!
core very close to the device could handle that swimmingly,
though I'm not sure it would be enough to do it at (say) line
rate for a 400Gbps NIC or Gen5 NVMe device.
I suspect the 400 GHz NIC needs a rather BIG core to handle the
traffic loads.
....but why x86_64? It strikes me that as long as the _data_
formats vis the software-visible ABI are the same, it doesn't
need to use the same ISA. In fact, I can see advantages to not
doing so.
Having the remote core run the same OS code as every other core
means the OS developers have fewer hoops to jump through. Bug-for
bug compatibility means that clearing of those CRs just leaves
the core out in the periphery idling and bothering no one.
On the other hand, you buy a motherboard with said ASIC core,
and you can boot the MB without putting a big chip in the
socket--but you may have to deal with scant DRAM since the
big centralized chip contains teh memory controller.
In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
Other than it being placed "away" from the centralized cores,
it runs the same ISA as the main cores has longer latency to
coherent memory and shorter latency to device control registers
--which is why it is placed close to the device itself:: latency.
The big fast centralized core is going to get microsecond latency
from MMI/O device whereas ASIC version will have handful of nano-
second latencies. So the 5 GHZ core sees ~1 microsecond while the
little ASIC sees 10 nanoseconds. ...
Yes, I get the argument for WHY you'd do it, I just want to make
sure that it's an ordinary core (albeit one that is far away
from the sockets with the main SoC complexes) that I interact
with in the usual manner.
core very close to the device could handle that swimmingly,
though I'm not sure it would be enough to do it at (say) line
rate for a 400Gbps NIC or Gen5 NVMe device.
I suspect the 400 GHz NIC needs a rather BIG core to handle the
traffic loads.
Having the remote core run the same OS code as every other core
means the OS developers have fewer hoops to jump through. Bug-for
bug compatibility means that clearing of those CRs just leaves
the core out in the periphery idling and bothering no one.
Eh...Having to jump through hoops here matters less to me for
this kind of use case than if I'm trying to use those cores for >general-purpose compute.
Having a separate ISA means I cannot
accidentally run a program meant only for the big cores on the
IO service processors.
On the other hand, you buy a motherboard with said ASIC core,
and you can boot the MB without putting a big chip in the
socket--but you may have to deal with scant DRAM since the
big centralized chip contains teh memory controller.
A neat hack for bragging rights, but not terribly practical?
In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>,
[snip]
I suspect the 400 GHz NIC needs a rather BIG core to handle the
traffic loads.
Looking at
https://chipsandcheese.com/p/arms-cortex-a53-tiny-but-important, a
Cortex-A53 would not be up to it (at 1896MHz it can read <12GB/s and
write <18GB/s even to the L1 cache). However, Chester Lam notes: "A53
offers very low cache bandwidth compared to pretty much any other core >we’ve analyzed." I think, though, that a small in-order core like the
A53, but with enough load and store buffering and enough bandwidth to
I/O and the memory controller should not have a problem shoveling data
from or to a 400Gb/s NIC. With 128 bits/cycle in each direction one
would need one transfer per cycle in each direction at 3125MHz to
achieve 400Gb/s, or maybe 4GHz for a dual-issue core to allow for loop >overhead. Given that the A53 typically only has 2GHz, supporting 256 >bits/cycle of transfer width (for load and store instructions, i.e.,
along the lines of AVX-256) would be more appropriate.
Going for an OoO core (something like AMD's Bobcat or Intel's
Silvermont) would help achieve the bandwidth goals without excessive >fine-tuning of the software.
Having the remote core run the same OS code as every other core
means the OS developers have fewer hoops to jump through. Bug-for
bug compatibility means that clearing of those CRs just leaves
the core out in the periphery idling and bothering no one.
Eh...Having to jump through hoops here matters less to me for
this kind of use case than if I'm trying to use those cores for >>general-purpose compute.
I think it's the same thing as Greenspun's tenth rule: First you find
that a classical DMA engine is too limiting, then you find that an A53
is too limiting, and eventually you find that it would be practical to
run the ISA of the main cores. In particular, it allows you to use
the toolchain of the main cores for developing them,
and you can also
use the facilities of the main cores (e.g., debugging features that
may be absent of the I/O cores) during development.
Having a separate ISA means I cannot
accidentally run a program meant only for the big cores on the
IO service processors.
Marking the binaries that should be able to run on the IO service
processors with some flag, and letting the component of the OS that
assigns processes to cores heed this flag is not rocket science.
You
probably also don't want to run programs for the I/O processors on the
main cores; whether you use a separate flag for indicating that, or
whether one flag indicates both is an interesting question.
On the other hand, you buy a motherboard with said ASIC core,
and you can boot the MB without putting a big chip in the
socket--but you may have to deal with scant DRAM since the
big centralized chip contains teh memory controller.
A neat hack for bragging rights, but not terribly practical?
Very practical for updating the firmware of the board to support the
big chip you want to put in the socket (called "BIOS FlashBack" in
connection with AMD big chips).
In a case where we did not have that
feature, and the board did not support the CPU, we had to buy another
CPU to update the firmware ><https://www.complang.tuwien.ac.at/anton/asus-p10s-c4l.html>. That's >especially relevant for AM4 boards, because the support chips make it
hard to use more than 16MB Flash for firmware, but the firmware for
all supported big chips does not fit into 16MB. However, as the case >mentioned above shows, it's also relevant for Intel boards.
In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Thu, 1 May 2025 13:07:07 +0000, Dan Cross wrote:
In article <da5b3dea460370fc1fe8ad2323da9bc4@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
[snip]
Reminds me of trying to sell a micro x86-64 to AMD as a project.
The µ86 is a small x86-64 core made available as IP in Verilog
where it has/runs the same ISA as main GBOoO x86, but is placed
"out in the PCIe" interconnect--performing I/O services topo-
logically adjacent to the device itself. This allows 1ns access >>>>latencies to DCRs and performing OS queueing of DPCs,... without >>>>bothering the GBOoO cores.
AMD didn't buy the arguments.
I can see it either way; I suppose the argument as to whether I
buy it or not comes down to, "in depends". How much control do
I, as the OS implementer, have over this core?
Other than it being placed "away" from the centralized cores,
it runs the same ISA as the main cores has longer latency to
coherent memory and shorter latency to device control registers
--which is why it is placed close to the device itself:: latency.
The big fast centralized core is going to get microsecond latency
from MMI/O device whereas ASIC version will have handful of nano-
second latencies. So the 5 GHZ core sees ~1 microsecond while the
little ASIC sees 10 nanoseconds. ...
Yes, I get the argument for WHY you'd do it, I just want to make
sure that it's an ordinary core (albeit one that is far away
from the sockets with the main SoC complexes) that I interact
with in the usual manner. Compare to, say, MP1 or MP0 on AMD
Zen, where it runs its own (proprietary) firmware that I
interact with via an RPC protocol over an AXI bus, if I interact
with it at all: most OEMs just punt and run AGESA (we don't).
If it is yet another hidden core embedded somewhere deep in the
SoC complex and I can't easily interact with it from the OS,
then no thanks: we've got enough of those between MP0, MP1, MP5,
etc, etc.
On the other hand, if it's got a "normal" APIC ID, the OS has
control over it like any other LP, and its coherent with the big
cores, then yeah, sign me up: I've been wanting something like
that for a long time now.
It is just a core that is cheap enough to put in ASICs, that
can offload some I/O burden without you having to do anything
other than setting some bits in some CRs so interrupts are
routed to this core rather than some more centralized core.
Sounds good.
Consider a virtualization application. A problem with, say,
SR-IOV is that very often the hypervisor wants to interpose some
sort of administrative policy between the virtual function and
whatever it actually corresponds to, but get out of the fast
path for most IO. This implies a kind of offload architecture
where there's some (presumably software) agent dedicated to
handling IO that can be parameterized with such a policy. A
Interesting:: Could you cite any literature, here !?!
Sure. This paper is a bit older, but gets at the main points: https://www.usenix.org/system/files/conference/nsdi18/nsdi18-firestone.pdf
I don't know if the details are public for similar technologies
from Amazon or Google.
core very close to the device could handle that swimmingly,
though I'm not sure it would be enough to do it at (say) line
rate for a 400Gbps NIC or Gen5 NVMe device.
I suspect the 400 GHz NIC needs a rather BIG core to handle the
traffic loads.
Indeed. Part of the challenge for the hyperscalars is in
meeting that demand while not burning too many host resources,
which are the thing they're actually selling their customer in
the first place. A lot of folks are pushing this off to the NIC
itself, and I've seen at least one team that implemented NVMe in
firmware on a 100Gbps NIC, exposed via SR-IOV, as part of a
disaggregated storage architecture.
Another option is to push this to the switch; things like Intel
Tofino2 were well-position for this, but of course Intel, in its
infinite wisdom and vision, canc'ed Tofino.
....but why x86_64? It strikes me that as long as the _data_
formats vis the software-visible ABI are the same, it doesn't
need to use the same ISA. In fact, I can see advantages to not
doing so.
Having the remote core run the same OS code as every other core
means the OS developers have fewer hoops to jump through. Bug-for
bug compatibility means that clearing of those CRs just leaves
the core out in the periphery idling and bothering no one.
Eh...Having to jump through hoops here matters less to me for
this kind of use case than if I'm trying to use those cores for general-purpose compute. Having a separate ISA means I cannot
accidentally run a program meant only for the big cores on the
IO service processors. As long as the OS has total control over
the execution of the core, and it participates in whatever cache
coherency scheme the rest of the system uses, then the ISA just
isn't that important.
On the other hand, you buy a motherboard with said ASIC core,
and you can boot the MB without putting a big chip in the
socket--but you may have to deal with scant DRAM since the
big centralized chip contains teh memory controller.
A neat hack for bragging rights, but not terribly practical?
Anyway, it's a neat idea. It's very reminiscent of IBM channel
controllers, in a way.
- Dan C.
In article <2025May2.073450@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
I think it's the same thing as Greenspun's tenth rule: First you find
that a classical DMA engine is too limiting, then you find that an A53
is too limiting, and eventually you find that it would be practical to
run the ISA of the main cores. In particular, it allows you to use
the toolchain of the main cores for developing them,
These are issues solveable with the software architecture and
build system for the host OS.
The important characteristic is
that the software coupling makes architectural sense, and that
simply does not require using the same ISA across IPs.
Indeed, consider AMD's Zen CPUs; the PSP/ASP/whatever it's
called these days is an ARM core while the big CPUs are x86.
I'm pretty sure there's an Xtensa DSP in there to do DRAM and
timing and PCIe link training.
Similarly with the ME on Intel.
A BMC might be running on whatever.
We increasingly see ARM
based SBCs that have small RISC-V microcontroller-class cores
embedded in the SoC for exactly this sort of thing.
Our hardware RoT
The problem is when such service cores are hidden (as they are
in the case of the PSP, SMU, MPIO, and similar components, to
use AMD as the example) and treated like black boxes by
software. It's really cool that I can configure the IO crossbar
in useful way tailored to specific configurations, but it's much
less cool that I have to do what amounts to an RPC over the SMN
to some totally undocumented entity somewhere in the SoC to do
it. Bluntly, as an OS person, I do not want random bits of code
running anywhere on my machine that I am not at least aware of
(yes, this includes firmware blobs on devices).
and you can also
use the facilities of the main cores (e.g., debugging features that
may be absent of the I/O cores) during development.
This is interesting, but we've found it more useful going the
other way around. We do most of our debugging via the SP.
Since The SP is also responsible for system initialization and
holding x86 in reset until we're reading for it to start
running, it's the obvious nexus for debugging the system
holistically.
Marking the binaries that should be able to run on the IO service >>processors with some flag, and letting the component of the OS that
assigns processes to cores heed this flag is not rocket science.
I agree, that's easy. And yet, mistakes will be made, and there
will be tension between wanting to dedicate those CPUs to IO
services and wanting to use them for GP programs: I can easily
imagine a paper where someone modifies a scheduler to move IO
bound programs to those cores. Using a different ISA obviates
most of that, and provides an (admittedly modest) security benefit.
And if I already have to modify or configure the OS to
accommodate the existence of these things in the first place,
then accommodating an ISA difference really isn't that much
extra work. The critical observation is that a typical SMP view
of the world no longer makes sense for the system architecture,
and trying to shoehorn that model onto the hardware reality is
just going to cause frustration.
On the other hand, you buy a motherboard with said ASIC core,
and you can boot the MB without putting a big chip in the
socket--but you may have to deal with scant DRAM since the
big centralized chip contains teh memory controller.
A neat hack for bragging rights, but not terribly practical?
Very practical for updating the firmware of the board to support the
big chip you want to put in the socket (called "BIOS FlashBack" in >>connection with AMD big chips).
"BIOS", as loaded from the EFS by the ABL on the PSP on EPYC
class chips, is usually stored in a QSPI flash on the main
board (though starting with Turin you _can_ boot via eSPI).
Strictly speaking, you don't _need_ an x86 core to rewrite that.
On our machines, we do that from the SP, but we don't use AGESA
or UEFI: all of the platform enablement stuff done in PEI and
DXE we do directly in the host OS.
Also, on AMD machines, again considering EPYC, it's up to system
software running on x86 to direct either the SMU or MPIO to
configure DXIO and the rest of the fabric before PCIe link
training even begins (releasing PCIe from PERST is done by
either the SMU or MPIO, depending on the specific
microarchitecture). Where are these cores, again? If they're
close to the devices, are they in the root complex or on the far
side of a bridge? Can they even talk to the rest of the board?
In a case where we did not have that
feature, and the board did not support the CPU, we had to buy another
CPU to update the firmware >><https://www.complang.tuwien.ac.at/anton/asus-p10s-c4l.html>. That's >>especially relevant for AM4 boards, because the support chips make it
hard to use more than 16MB Flash for firmware, but the firmware for
all supported big chips does not fit into 16MB. However, as the case >>mentioned above shows, it's also relevant for Intel boards.
You shouldn't need to boot the host operating system to do that,
though I get on most consumer-grade machines you'll do it via
something that interfaces with AGESA or UEFI.
Most server-grade
machines will have a BMC that can do this independently of the
main CPU,
and I should be clear that I'm discounting use cases
for consumer grade boards, where I suspect something like this
is less interesting than on server hardware.
On Fri, 2 May 2025 2:15:24 +0000, Dan Cross wrote:
On the other hand, you buy a motherboard with said ASIC core,
and you can boot the MB without putting a big chip in the
socket--but you may have to deal with scant DRAM since the
big centralized chip contains teh memory controller.
A neat hack for bragging rights, but not terribly practical?
Anyway, it's a neat idea. It's very reminiscent of IBM channel
controllers, in a way.
It is more like the Peripheral Processors of CDC 6600 that run
ISA of a CDC 6600 without as much fancy execution in periphery.
In article <2025May2.073450@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
I think it's the same thing as Greenspun's tenth rule: First you find >>>that a classical DMA engine is too limiting, then you find that an A53
is too limiting, and eventually you find that it would be practical to >>>run the ISA of the main cores. In particular, it allows you to use
the toolchain of the main cores for developing them,
These are issues solveable with the software architecture and
build system for the host OS.
Certainly, one can work around many bad decisions, and in reality one
has to work around some bad decisions, but the issue here is not
whether "the issues are solvable", but which decision leads to better
or worse consequences.
The important characteristic is
that the software coupling makes architectural sense, and that
simply does not require using the same ISA across IPs.
IP? Internet Protocol?
Software Coupling sounds to me like a concept
from Constantine out of my Software engineering class.
I guess you
did not mean either, but it's unclear what you mean.
In any case, I have made arguments why it would make sense to use the
same ISA as for the OS for programming the cores that replace DMA
engines. I will discuss your counterarguments below, but the most
important one to me seems to be that these cores would cost more than
with a different ISA. There is something to that, but when the
application ISA is cheap to implement (e.g., RV64GC), that cost is
small; it may be more an argument for also selecting the
cheap-to-implement ISA for the OS/application cores.
Indeed, consider AMD's Zen CPUs; the PSP/ASP/whatever it's
called these days is an ARM core while the big CPUs are x86.
I'm pretty sure there's an Xtensa DSP in there to do DRAM and
timing and PCIe link training.
The PSPs are not programmable by the OS or application programmers, so
using the same ISA would not benefit the OS or application
programmers.
By contrast, the idea for the DMA replacement engines is
that they are programmable by the OS and maybe the application
programmers, and that changes whether the same ISA is beneficial.
What is "ASP/whatever"?
Similarly with the ME on Intel.
Last I read about it, ME uses a core developed by Intel with IA-32 or
AMD64; but in any case, the ME is not programmable by OS or
application programmers, either.
A BMC might be running on whatever.
Again, a BMC is not programmable by OS or application programmers.
We increasingly see ARM
based SBCs that have small RISC-V microcontroller-class cores
embedded in the SoC for exactly this sort of thing.
That's interesting; it points to RISC-V being cheaper to implement
than ARM. As for "that sort of thing", they are all not programmable
by OS or application programmers, so see above.
Our hardware RoT
?
The problem is when such service cores are hidden (as they are
in the case of the PSP, SMU, MPIO, and similar components, to
use AMD as the example) and treated like black boxes by
software. It's really cool that I can configure the IO crossbar
in useful way tailored to specific configurations, but it's much
less cool that I have to do what amounts to an RPC over the SMN
to some totally undocumented entity somewhere in the SoC to do
it. Bluntly, as an OS person, I do not want random bits of code
running anywhere on my machine that I am not at least aware of
(yes, this includes firmware blobs on devices).
Well, one goes with the other. If you design the hardware for being >programmed by the OS programmers, you use the same ISA for all the
cores that the OS programmers program,
whereas if you design the
hardware as programmed by "firmware" programmers, you use a >cheap-to-implement ISA and design the whole thing such that it is
opaque to OS programmers and only offers some certain capabilities to
OS programmers.
And that's not just limited to ISAs. A very successful example is the
way that flash memory is usually exposed to OSs: as a block device
like a plain old hard disk, and all the idiosyncracies of flash are
hidden in the device behind a flash translation layer that is
implemented by a microcontroller on the device.
What's "SMN"?
and you can also
use the facilities of the main cores (e.g., debugging features that
may be absent of the I/O cores) during development.
This is interesting, but we've found it more useful going the
other way around. We do most of our debugging via the SP.
Since The SP is also responsible for system initialization and
holding x86 in reset until we're reading for it to start
running, it's the obvious nexus for debugging the system
holistically.
Sure, for debugging on the core-dump level that's useful. I was
thinking about watchpoint and breakpoint registers and performance
counters that one may not want to implement on the DMA-replacement
core, but that is implemented on the OS/application cores.
Marking the binaries that should be able to run on the IO service >>>processors with some flag, and letting the component of the OS that >>>assigns processes to cores heed this flag is not rocket science.
I agree, that's easy. And yet, mistakes will be made, and there
will be tension between wanting to dedicate those CPUs to IO
services and wanting to use them for GP programs: I can easily
imagine a paper where someone modifies a scheduler to move IO
bound programs to those cores. Using a different ISA obviates
most of that, and provides an (admittedly modest) security benefit.
If there really is such tension, that indicates that such cores would
be useful for general-purpose use. That makes the case for using the
same ISA even stronger.
As for "mistakes will be made", that also goes the other way: With a
separate toolchain for the DMA-replacement ISA, there is lots of
opportunity for mistakes.
As for "security benefit", where is that supposed to come from?d
What
attack scenario do you have in mind where that "security benefit"
could materialize?
And if I already have to modify or configure the OS to
accommodate the existence of these things in the first place,
then accommodating an ISA difference really isn't that much
extra work. The critical observation is that a typical SMP view
of the world no longer makes sense for the system architecture,
and trying to shoehorn that model onto the hardware reality is
just going to cause frustration.
The shared-memory multiprocessing view of the world is very
successful, while distributed-memory computers are limited to
supercomputing and other areas where hardware cost still dominates
over software cost (i.e., where the software crisis has not happened
yet); as an example of the lack of success of the distributed-memory >paradigm, take the PlayStation 3; programmers found it too hard to
work with, so they did not use the hardware well, and eventually Sony
decided to go for an SMP machine for the PlayStation 4 and 5.
OTOH, one can say that the way many peripherals work on
general-purpose computers is more along the lines of
distributed-memory; but that's probably due to the relative hardware
and software costs for that peripheral. Sure, the performance
characteristics are non-uniform (NUMA) in many cases, but 1) caches
tend to smooth over that, and 2) most of the code is not >performance-critical, so it just needs to run, which is easier to
achieve with SMP and harder with distributed memory.
Sure, people have argued for advantages of other models for decades,
like you do now, but SMP has usually won.
On the other hand, you buy a motherboard with said ASIC core,
and you can boot the MB without putting a big chip in the
socket--but you may have to deal with scant DRAM since the
big centralized chip contains teh memory controller.
A neat hack for bragging rights, but not terribly practical?
Very practical for updating the firmware of the board to support the
big chip you want to put in the socket (called "BIOS FlashBack" in >>>connection with AMD big chips).
"BIOS", as loaded from the EFS by the ABL on the PSP on EPYC
class chips, is usually stored in a QSPI flash on the main
board (though starting with Turin you _can_ boot via eSPI).
Strictly speaking, you don't _need_ an x86 core to rewrite that.
On our machines, we do that from the SP, but we don't use AGESA
or UEFI: all of the platform enablement stuff done in PEI and
DXE we do directly in the host OS.
EFS? ABL? QSPI? eSPI? PEI? DXE?
Anyway, what you do in your special setup does not detract from the
fact that being able to flash the firmware without having a working
main core has turned out to be so useful that out of 218 AM5
motherboards offered in Austria <https://geizhals.at/?cat=mbam5>, 203
have that feature.
Also, on AMD machines, again considering EPYC, it's up to system
software running on x86 to direct either the SMU or MPIO to
configure DXIO and the rest of the fabric before PCIe link
training even begins (releasing PCIe from PERST is done by
either the SMU or MPIO, depending on the specific
microarchitecture). Where are these cores, again? If they're
close to the devices, are they in the root complex or on the far
side of a bridge? Can they even talk to the rest of the board?
The core that does the flashing obviously is on the board, not on the
CPU package (which may be absent). I do not know where on the board
it is.
Typically only one USB port can be used for that, so that may
indicate that a special path may be used for that without initializing
all the USB ports and the other hardware that's necessary for that; I
think that some USB ports are directly connected to the CPU package,
so those would not work anyway.
In a case where we did not have that
feature, and the board did not support the CPU, we had to buy another
CPU to update the firmware >>><https://www.complang.tuwien.ac.at/anton/asus-p10s-c4l.html>. That's >>>especially relevant for AM4 boards, because the support chips make it >>>hard to use more than 16MB Flash for firmware, but the firmware for
all supported big chips does not fit into 16MB. However, as the case >>>mentioned above shows, it's also relevant for Intel boards.
You shouldn't need to boot the host operating system to do that,
though I get on most consumer-grade machines you'll do it via
something that interfaces with AGESA or UEFI.
In the bad old days you had to boot into DOS and run a DOS program for >flashing the BIOS. Or worse, Windows; not very useful if you don't
have Windows installed on the computer (DOS at least could be booted
from a floppy disk). My last few experiences in that direction were
firmware flashing as a "BIOS" feature, and the flashback feature
(which has it's own problems, because communication with the user is >limited).
Most server-grade
machines will have a BMC that can do this independently of the
main CPU,
And just in another posting you wrote "but not terribly practical?".
The board I mentioned above where we had to buy a separate CPU for
flashing mentioned a BMC on the feature list, but when we looked in
the manual, we found that the BMC is not delivered with the board, but
has to be bought separately. There was also no mention that one can
use the BMC for flashing the BIOS.
and I should be clear that I'm discounting use cases
for consumer grade boards, where I suspect something like this
is less interesting than on server hardware.
What makes you think so? And what do you mean with "something like
this"?
1) "BIOS flashback" is a mostly-standard feature in AM5 (i.e., >consumer-grade) boards.
2) DMA has been a standard feature in various forms on consumer
hardware since the first IBM PC in 1981, and replacing the DMA engines
with cores running a general-purpose ISA accessible to OS designers
will not be limited to servers;
if hardware designers and OS
developers put development time into that, there is no reason for
limiting that effort to servers. The existence of the LPE-Cores on
Meteor Lake (not a server chip) and the in-order ARM cores on various >smartphone SOCs, the existence of P-Cores and E-Cores on Intel
consumer-grade CPUs, while the server versions of these CPUs have the
E-Cores disabled, and the uniformity of cores on the dedicated server
CPUs indicates that non-uniform cores seem to be hard to sell in
server space.
In article <2025May2.073450@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>,
[snip]
I suspect the 400 GHz NIC needs a rather BIG core to handle the
traffic loads.
Looking at
https://chipsandcheese.com/p/arms-cortex-a53-tiny-but-important, a >>Cortex-A53 would not be up to it (at 1896MHz it can read <12GB/s and
write <18GB/s even to the L1 cache). However, Chester Lam notes: "A53 >>offers very low cache bandwidth compared to pretty much any other core >>we’ve analyzed." I think, though, that a small in-order core like the >>A53, but with enough load and store buffering and enough bandwidth to
I/O and the memory controller should not have a problem shoveling data
from or to a 400Gb/s NIC. With 128 bits/cycle in each direction one
would need one transfer per cycle in each direction at 3125MHz to
achieve 400Gb/s, or maybe 4GHz for a dual-issue core to allow for loop >>overhead.
Given that the A53 typically only has 2GHz, supporting 256
bits/cycle of transfer width (for load and store instructions, i.e.,
along the lines of AVX-256) would be more appropriate.
Eh...Having to jump through hoops here matters less to me for
this kind of use case than if I'm trying to use those cores for >>>general-purpose compute.
I think it's the same thing as Greenspun's tenth rule: First you find
that a classical DMA engine is too limiting, then you find that an A53
is too limiting, and eventually you find that it would be practical to
run the ISA of the main cores. In particular, it allows you to use
the toolchain of the main cores for developing them,
These are issues solveable with the software architecture and
build system for the host OS. The important characteristic is
that the software coupling makes architectural sense, and that
simply does not require using the same ISA across IPs.
At work, our service processor (granted, outside of the SoC but
tightly coupled at the board level) is a Cortex-M7, but we wrote
the OS for that,
and we control the host OS that runs on x86,
so the SP and big CPUs can be mutually aware. Our hardware RoT
is a smaller Cortex-M. We don't have a BMC on our boards;
everything that it does is either done by the SP or built into
the host OS, both of which are measured by the RoT.
The problem is when such service cores are hidden (as they are
in the case of the PSP, SMU, MPIO, and similar components, to
use AMD as the example) and treated like black boxes by
software. It's really cool that I can configure the IO crossbar
in useful way tailored to specific configurations, but it's much
less cool that I have to do what amounts to an RPC over the SMN
to some totally undocumented entity somewhere in the SoC to do
it. Bluntly, as an OS person, I do not want random bits of code
running anywhere on my machine that I am not at least aware of
(yes, this includes firmware blobs on devices).
And if I already have to modify or configure the OS to
accommodate the existence of these things in the first place,
then accommodating an ISA difference really isn't that much
extra work. The critical observation is that a typical SMP view
of the world no longer makes sense for the system architecture,
and trying to shoehorn that model onto the hardware reality is
just going to cause frustration. Better to acknowledge that the
Also, on AMD machines, again considering EPYC, it's up to system
software running on x86 to direct either the SMU or MPIO to
configure DXIO and the rest of the fabric before PCIe link
training even begins (releasing PCIe from PERST is done by
either the SMU or MPIO, depending on the specific
microarchitecture). Where are these cores, again? If they're
close to the devices, are they in the root complex or on the far
side of a bridge? Can they even talk to the rest of the board?
cross@spitfire.i.gajendra.net (Dan Cross) writes:
Looking at >>>https://chipsandcheese.com/p/arms-cortex-a53-tiny-but-important, a >>>Cortex-A53 would not be up to it (at 1896MHz it can read <12GB/s and >>>write <18GB/s even to the L1 cache). However, Chester Lam notes: "A53 >>>offers very low cache bandwidth compared to pretty much any other core >>>we’ve analyzed." I think, though, that a small in-order core like the >>>A53, but with enough load and store buffering and enough bandwidth to
I/O and the memory controller should not have a problem shoveling data >>>from or to a 400Gb/s NIC. With 128 bits/cycle in each direction one >>>would need one transfer per cycle in each direction at 3125MHz to
achieve 400Gb/s, or maybe 4GHz for a dual-issue core to allow for loop >>>overhead.
Running any SoC at 3+gHz requires significant effort in the
back-end and to ensure timing closure on the front end (and
affects floorplanning). All this adds to the cost to build
and manufacture the chips.
It may be more productive to consider widening the internal
buses to be 256 or 512 bits wide.
In any case, I have made arguments why it would make sense to use the
same ISA as for the OS for programming the cores that replace DMA
engines. I will discuss your counterarguments below, but the most
important one to me seems to be that these cores would cost more than
with a different ISA.