Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 43 |
Nodes: | 6 (0 / 6) |
Uptime: | 104:11:28 |
Calls: | 290 |
Files: | 905 |
Messages: | 76,612 |
Brett <ggtgp@yahoo.com> writes:
When a modern CPU takes an interrupt it does not suspend the current
processing, instead it just starts fetching code from the new process while >> letting computations in the pipeline continue to completion. The OoOe can >> have a 1000 instructions in flight. At some point the resources start
getting dedicated to the new process, and the old process is drained out or >> maybe actually stopped.
Not necessarily the case. For various reasons, entry to the interrupt handler may actually have a barrier to ensure that outstanding stores
are committed (store buffer drained) before continuing. This is for
error containment purposes.
Scott Lurndal wrote:
Brett <ggtgp@yahoo.com> writes:
When a modern CPU takes an interrupt it does not suspend the current
processing, instead it just starts fetching code from the new process
while
letting computations in the pipeline continue to completion. The OoOe
can
have a 1000 instructions in flight. At some point the resources start
getting dedicated to the new process, and the old process is drained out >>> or
maybe actually stopped.
Not necessarily the case. For various reasons, entry to the interrupt
handler may actually have a barrier to ensure that outstanding stores
are committed (store buffer drained) before continuing. This is for
error containment purposes.
Yes but pipelining interrupts is trickier than that.
First there is pipelining the super/user mode change. This requires
fetch
to have a future copy of the mode which is used for instruction address translation, and a mode flag attached to each instruction or uOp,
each checkpoint saves a mode copy, and retire has the committed mode
copy.
Privileged instructions are checked by decode to ensure their fetch mode
was correct.
On interrupt, if the core starts fetching instructions from the handler
and
stuffing them into the instruction queue (ROB) while there are still instructions in flight, and if those older instructions get a branch mispredict, then the purge of mispredicted older instructions will also
purge the interrupt handler.
Also the older instructions might trigger
an exception, delivery of which would take precedence over the delivery
of the interrupt and again purge the handler. Also the older
instructions
might raise the core's interrupt priority, masking the interrupt that
it just tried to accept.
The interrupt controller can't complete the hand-off of the interrupt
to a core until it knows that hand-off won't get purged by a mispredict, exception or priority change. So the hand-off becomes like a two-phase
commit where the controller offers an available core an interrupt,
core accepts it tentatively and starts executing the handler,
and core later either commits or rejects the hand-off.
While the interrupt is in limbo the controller marks it as tentative
but keeps its position in the interrupt queue.
This is where your point comes in.
Because the x86/x64 automatically pushes the saved context on the kernel stack, RIP, RSP, RFLAG, that context store can only happen when the
entry
to the interrupt sequence reaches retire, which means all older
instructions must have retired. At that point the core sends a commit
signal to the interrupt controller and begins its stores, and controller removes the interrupt from its queue. If anything purges the hand-off
then
core sends a reject signal to controller, which returns the interrupt
to a pending state at its position at the front of its queue.
On 10/3/24 10:00, Anton Ertl wrote:
Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.
The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily
multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.
For profiling, do we really need accurate counters? They just need to
be statistically accurate I would think.
Instead of incrementing a counter, just store a non-zero immediate into
a zero initialized byte array at a per "counter" index. There's no
rmw data dependency, just a store so should have little impact on
pipeline.
A profiling thread loops thru the byte array, incrementing an actual
counter when it sees no zero byte, and resets the byte to zero. You
could use vector ops to process the array.
If the stores were fast enough, you could do 2 or more stores at
hashed indices, different hash for each store. Sort of a counting
Bloom filter. The effective count would be the minimum of the
hashed counts.
No idea how feasible this would be though.
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 10/3/2024 7:00 AM, Anton Ertl wrote:....
Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.
The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily
multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.
....For the HotSpot application, the
eventual answer was that they live with the cost of cache contention
for the programs that have that problem. After some minutes the hot
parts of the program are optimized, and cache contention is no longer
a problem.
If the per-thread counters are properly padded to a l2 cache line and
properly aligned on cache line boundaries, well, the should not cause
false sharing with other cache lines... Right?
Sure, that's what the first sentence of the second paragraph you cited
(and which I cited again) is about. Next, read the next sentence.
Maybe I should give an example (fully made up on the spot, read the
paper for real numbers): If HotSpot uses, on average one counter per conditional branch, and assuming a conditional branch every 10 static instructions (each having, say 4 bytes), with 1MB of generated code
and 8 bytes per counter, that's 200KB of counters. But these counters
are shared between all threads, so for code running on many cores you
get true and false sharing.
As mentioned, the usual mitigation is per-core counters. With a
256-core machine, we now have 51.2MB of counters for 1MB of executable
code. Now this is Java, so there might be quite a bit more executable
code and correspondingly more counters. They eventually decided that
the benefit of reduced cache coherence traffic is not worth that cost
(or the cost of a hardware mechanism), as described in the last
paragraph, from which I cited the important parts.
- anton
On 12/25/24 1:30 PM, MitchAlsup1 wrote:
On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:
On 10/5/24 11:11 AM, EricP wrote:--------------------------
MitchAlsup1 wrote:[snip]
But voiding doesn't look like it works for exceptions or
conflicting
interrupt priority adjustments. In those cases purging the
interrupt
handler and rejecting the hand-off looks like the only option.
Should exceptions always have priority? It seems to me that if a
thread is low enough priority to be interrupted, it is low enough
priority to have its exception processing interrupted/delayed.
It depends on what you mean::
a) if you mean that exceptions are prioritized and the highest
priority exception is the one taken, then OK you are working
in an ISA that has multiple exceptions per instruction. Most
RISC ISAs do not have this property.
The context was any exception taking priority over an interrupt
that was accepted, at least on a speculative path. I.e., the
statement would have been more complete as "Should exceptions
always (or ever) have priority over an accepted interrupt?"
Sooner or later an ISR has to actually deal with the MMI/O
control registers associated with the <ahem> interrupt.
Yes, but multithreading could hide some of those latencies in
terms of throughput.
On 10/5/24 11:11 AM, EricP wrote:--------------------------
MitchAlsup1 wrote:[snip]
But voiding doesn't look like it works for exceptions or conflicting
interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.
Should exceptions always have priority? It seems to me that if a
thread is low enough priority to be interrupted, it is low enough
priority to have its exception processing interrupted/delayed.
(There might be cases where normal operation allows deadlines to
be met with lower priority and unusual extended operation requires
high priority/resource allocation. Boosting the priority/resource
budget of a thread/task to meet deadlines seems likely to make
system-level reasoning more difficult. It seems one could also
create an inflationary spiral.)
With substantial support for Switch-on-Event MultiThreading, it
is conceivable that a lower priority interrupt could be held
"resident" after being interrupted by a higher priority interrupt.
A chunked ROB could support such, but it is not clear that such
is desirable even ignoring complexity factors.
Being able to overlap latency of a memory-mapped I/O access (or
other slow access) with execution of another thread seems
attractive and even an interrupt handler with few instructions
might have significant run time. Since interrupt blocking is
used to avoid core-localized resource contention, software would
have to know about such SoEMT.
(Interrupts seem similar to certain server software threads in
having lower ILP from control dependencies and more frequent high
latency operations, which hints that multithreading may be
desirable.)
On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:
MitchAlsup1 wrote:
On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:
On interrupt, if the core starts fetching instructions from the handler >>>> and
stuffing them into the instruction queue (ROB) while there are still
instructions in flight, and if those older instructions get a branch
mispredict, then the purge of mispredicted older instructions will also >>>> purge the interrupt handler.
Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.
That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order. For a branch mispredict you might be able
to mark a circular range of entries as voided, and leave the entries
to be recovered serially at retire.
Every instruction needs a way to place itself before or after
any mispredictable branch. Once you know which branch mispredicted, you
know instructions will not retire, transitively. All you really need to
know is if the instruction will retire, or not. The rest of the
mechanics play out naturally in the pipeline.
But voiding doesn't look like it works for exceptions or conflicting
interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.
Can you make this statement again and use different words?
If one can live with the occasional replay of an interrupt hand-off and
handler execute due to mispredict/exception/interrupt_priority_adjust
then the interrupt pipelining looks much simplified.
You just have to cover the depth of the pipeline.
On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:
On 10/5/24 11:11 AM, EricP wrote:--------------------------
MitchAlsup1 wrote:[snip]
But voiding doesn't look like it works for exceptions or conflicting
interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.
Should exceptions always have priority? It seems to me that if a
thread is low enough priority to be interrupted, it is low enough
priority to have its exception processing interrupted/delayed.
It depends on what you mean::
a) if you mean that exceptions are prioritized and the highest
priority exception is the one taken, then OK you are working
in an ISA that has multiple exceptions per instruction. Most
RISC ISAs do not have this property.
b) if you mean that exceptions take priority over non-exception
instruction streaming, well that is what exceptions ARE. In these
cases, the exception handler inherits the priority of the instruction
stream that raised it--but that is NOT assigning a priority to the
exception.
c) and then there are the cases where a PageFault from GuestOS
page tables is serviced by GuestOS, while a PageFault from
HyperVisor page tables is serviced by HyperVisor. You could
assert that HV has higher priority than GuestOS, but it is
more like HV has privilege over GuestOS while running at the
same priority level.
Being able to overlap latency of a memory-mapped I/O access (or
other slow access) with execution of another thread seems
MitchAlsup1 wrote:--------------------------
On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:
Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.
That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order. For a branch mispredict you might be able
to mark a circular range of entries as voided, and leave the entries
to be recovered serially at retire.
But voiding doesn't look like it works for exceptions or conflicting interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:
On 10/5/24 11:11 AM, EricP wrote:--------------------------
MitchAlsup1 wrote:[snip]
But voiding doesn't look like it works for exceptions or conflicting
interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.
Should exceptions always have priority? It seems to me that if a
thread is low enough priority to be interrupted, it is low enough
priority to have its exception processing interrupted/delayed.
It depends on what you mean::
a) if you mean that exceptions are prioritized and the highest
priority exception is the one taken, then OK you are working
in an ISA that has multiple exceptions per instruction. Most
RISC ISAs do not have this property.
AArch64 has 44 different synchronous exception priorities, and within
each priority that describes more than one exception, there
is a sub-prioritization therein. (Section D 1.3.5.5 pp 6080 in
DDI0487K_a).
While it is not common for a particular instruction to generate
multiple execeptions, it is certainly possible (e.g. when
instructions are trapped to a more privileged execution mode).
b) if you mean that exceptions take priority over non-exception
instruction streaming, well that is what exceptions ARE. In these
cases, the exception handler inherits the priority of the instruction >>stream that raised it--but that is NOT assigning a priority to the >>exception.
c) and then there are the cases where a PageFault from GuestOS
page tables is serviced by GuestOS, while a PageFault from
HyperVisor page tables is serviced by HyperVisor. You could
assert that HV has higher priority than GuestOS, but it is
more like HV has privilege over GuestOS while running at the
same priority level.
It seems unlikely that a translation fault in user mode would need
handling in both the guest OS and the hypervisor during the
execution of an instruction;
the
exception to the hypervisor would generally occur when the
instruction trapped by the guest (who updated the guest translation
tables) is restarted.
Other exception causes (such as asynchronous exceptions
like interrupts)
would remain pending and be taken (subject
to priority and control enables) when the instruction is
restarted (or the next instruction is dispached for asynchronous
exceptions).
<snip>
Being able to overlap latency of a memory-mapped I/O access (or
other slow access) with execution of another thread seems
That depends on whether the access is posted or non-posted.
On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:
MitchAlsup1 wrote:--------------------------
On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:
Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.
That's difficult with a circular buffer for the instruction
queue/rob as you can't edit the order. For a branch mispredict you
might be able to mark a circular range of entries as voided, and
leave the entries to be recovered serially at retire.
Sooner or later, the pipeline designer needs to recognize the of
occuring
code sequence pictured as::
INST
INST
BC-------\
INST |
INST |
INST |
/----BR |
| INST<----/
| INST
| INST
\--->INST
INST
So that the branch predictor predicts as usual, but DECODER recognizes
the join point of this prediction, so if the prediction is wrong, one
only nullifies the mispredicted instructions and then inserts the
alternate instructions while holding the join point instructions until
the alternate instruction complete.
On 10/3/2024 7:00 AM, Anton Ertl wrote:...
Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.
The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily
multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.
...For the HotSpot application, the
eventual answer was that they live with the cost of cache contention
for the programs that have that problem. After some minutes the hot
parts of the program are optimized, and cache contention is no longer
a problem.
If the per-thread counters are properly padded to a l2 cache line and >properly aligned on cache line boundaries, well, the should not cause
false sharing with other cache lines... Right?
On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:
MitchAlsup1 wrote:--------------------------
On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:
Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.
That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order. For a branch mispredict you might be able
to mark a circular range of entries as voided, and leave the entries
to be recovered serially at retire.
Sooner or later, the pipeline designer needs to recognize the of
occuring
code sequence pictured as::
INST
INST
BC-------\
INST |
INST |
INST |
/----BR |
| INST<----/
| INST
| INST
\--->INST
INST
So that the branch predictor predicts as usual, but DECODER recognizes
the join point of this prediction, so if the prediction is wrong, one
only nullifies the mispredicted instructions and then inserts the
alternate instructions while holding the join point instructions until
the alternate instruction complete.
But voiding doesn't look like it works for exceptions or conflicting
interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.
Nullify instructions from the mispredicted paths. On hand off to ISR,
adjust recovery IP to past the last instruction that executed properly; nullifying between exception and ISR.
On Tue, 31 Dec 2024 2:02:05 +0000, Paul A. Clayton wrote:
On 12/25/24 1:30 PM, MitchAlsup1 wrote:
Sooner or later an ISR has to actually deal with the MMI/O
control registers associated with the <ahem> interrupt.
Yes, but multithreading could hide some of those latencies in
terms of throughput.
EricP is the master proponent of finishing the instructions in the
execution window that are finishable. I, merely, have no problem
in allowing the pipe to complete or take a flush based on the kind
of pipeline being engineered.
With 300-odd instructions in the window this thesis has merit,
with a 5-stage pipeline 1-wide, it does not have merit but is
not devoid of merit either.
MitchAlsup1 wrote:
On Tue, 31 Dec 2024 2:02:05 +0000, Paul A. Clayton wrote:
On 12/25/24 1:30 PM, MitchAlsup1 wrote:
Sooner or later an ISR has to actually deal with the MMI/O
control registers associated with the <ahem> interrupt.
Yes, but multithreading could hide some of those latencies in
terms of throughput.
EricP is the master proponent of finishing the instructions in the
execution window that are finishable. I, merely, have no problem
in allowing the pipe to complete or take a flush based on the kind
of pipeline being engineered.
With 300-odd instructions in the window this thesis has merit,
with a 5-stage pipeline 1-wide, it does not have merit but is
not devoid of merit either.
It is also possible that the speculation barriers I describe below
will limit the benefits that pipelining exceptions and interrupts
might be able to see.
The issue is that both exception handlers and interrupts usually read
and
write Privileged Control Registers (PCR) and/or MMIO device registers
very
early into the handler. Most MMIO device registers and cpu PCR cannot be speculatively read as that may cause a state transition.
Of course all stores are never speculated and can only be initiated
at commit/retire.
The normal memory coherence rules assume that loads are to memory-like locations that do not state transition on reads and that therefore
memory loads can be harmlessly replayed if needed.
While memory stores are not performed speculatively, an implementation
might speculatively prefetch a cache line as soon as a store is queued
and cause cache lines to ping-pong.
But for loads to many MMIO devices and PCR effectively require a
speculation barrier in front of them to prevent replays.
A SPCB Speculation Barrier instruction could block speculation.
It stalls execution until all older conditional branches are resolved
and
all older instructions that might throw an exception have determined
they won't do so.
The core could have an internal lookup table telling it which PCR can be
read speculatively because there are no side effects to doing so.
Those PCR would not require an SPCB to guard them.
For MMIO device registers I think having an explicit SPCB instruction
might be better than putting a "no-speculate" flag on the PTE for the
device register address as that flag would be difficult to propagate backwards from address translate to all the parts of the core that
we might have to sync with.
This all means that there may be very little opportunity for speculative execution of their handlers, no matter how much hardware one tosses at
them.
On Wed, 25 Dec 2024 19:10:09 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:
On 10/5/24 11:11 AM, EricP wrote:--------------------------
MitchAlsup1 wrote:[snip]
But voiding doesn't look like it works for exceptions or conflicting >>>>> interrupt priority adjustments. In those cases purging the interrupt >>>>> handler and rejecting the hand-off looks like the only option.
Should exceptions always have priority? It seems to me that if a
thread is low enough priority to be interrupted, it is low enough
priority to have its exception processing interrupted/delayed.
It depends on what you mean::
a) if you mean that exceptions are prioritized and the highest
priority exception is the one taken, then OK you are working
in an ISA that has multiple exceptions per instruction. Most
RISC ISAs do not have this property.
AArch64 has 44 different synchronous exception priorities, and within
each priority that describes more than one exception, there
is a sub-prioritization therein. (Section D 1.3.5.5 pp 6080 in
DDI0487K_a).
Thanks for the link::
However, I would claim that the vast majority of those 44 things
are interrupts and not exceptions (in colloquial nomenclature).
An exception is raised if an instruction cannot execute to completion
and is raised synchronously with the instruction stream (and at a
precise point in the instruction stream.
An interrupt is raised asynchronous to the instruction stream.
Reset is an interrupt and not an exceptions.
Debug that hits an address range is closer to an interrupt than an
exception. <but I digress>
But it appears that ARM has many interrupts classified as exceptions. >Anything not generated from instructions within the architectural
instruction stream is an interrupt, and anything generated from
within an architectural instructions stream is an exception.
It also appears ARM uses priority to sort exceptions into an order,
while most architectures define priority as a mechanism to to choose
when to take hard-control-flow-events rather than what.
It seems unlikely that a translation fault in user mode would need
handling in both the guest OS and the hypervisor during the
execution of an instruction;
Neither stated nor inferred. A PageFault is handled singularly by
the level in the system that controls (writes) those PTEs.
There is a significant period of time in may architectures after
control arrives at ISR where the ISR is not allowed to raise a
page fault {Storing registers to a stack}, and since this ISR
might be the PageFault handler, it is not in a position to
handle its own faults. However, HyperVisor can handle GuestOS >PageFaults--GuestOS thinks the pages are present with reasonable
access rights, HyperVisor tables are used to swap them in/out.
Other than latency GuestOS ISR does not see the PageFault.
My 66000, on the other hand, when ISR receives control, state
has been saved on a stack, the instruction stream is already
re-entrant, and the register file as it was the last time
this ISR ran.
the
exception to the hypervisor would generally occur when the
instruction trapped by the guest (who updated the guest translation
tables) is restarted.
Other exception causes (such as asynchronous exceptions
like interrupts)
Asynchronous exceptions A R E interrupts, not like interrupts;
they ARE interrupts. If it is not synchronous with instruction
stream it is an interrupt. Only if it is synchronous with the
instruction stream is it an exception.
would remain pending and be taken (subject
to priority and control enables) when the instruction is
restarted (or the next instruction is dispached for asynchronous
exceptions).
<snip>
Being able to overlap latency of a memory-mapped I/O access (or
other slow access) with execution of another thread seems
That depends on whether the access is posted or non-posted.
Writes can be posted, Reads cannot. Reads must complete for the
ISR to be able to setup the control block softIRQ/DPC will
process shortly. Only after the data structure for softIRQ/DPC
is written can ISR allow control flow to leave.
Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.
The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.
MitchAlsup1 wrote:
On Tue, 31 Dec 2024 2:02:05 +0000, Paul A. Clayton wrote:
On 12/25/24 1:30 PM, MitchAlsup1 wrote:
Sooner or later an ISR has to actually deal with the MMI/O
control registers associated with the <ahem> interrupt.
Yes, but multithreading could hide some of those latencies in
terms of throughput.
EricP is the master proponent of finishing the instructions in the
execution window that are finishable. I, merely, have no problem
in allowing the pipe to complete or take a flush based on the kind
of pipeline being engineered.
With 300-odd instructions in the window this thesis has merit,
with a 5-stage pipeline 1-wide, it does not have merit but is
not devoid of merit either.
It is also possible that the speculation barriers I describe below
will limit the benefits that pipelining exceptions and interrupts
might be able to see.
The issue is that both exception handlers and interrupts usually read and >write Privileged Control Registers (PCR) and/or MMIO device registers very >early into the handler. Most MMIO device registers and cpu PCR cannot be >speculatively read as that may cause a state transition.
Of course all stores are never speculated and can only be initiated
at commit/retire.
The normal memory coherence rules assume that loads are to memory-like >locations that do not state transition on reads and that therefore
memory loads can be harmlessly replayed if needed.
While memory stores are not performed speculatively, an implementation
might speculatively prefetch a cache line as soon as a store is queued
and cause cache lines to ping-pong.
But for loads to many MMIO devices and PCR effectively require a
speculation barrier in front of them to prevent replays.
A SPCB Speculation Barrier instruction could block speculation.
It stalls execution until all older conditional branches are resolved and
all older instructions that might throw an exception have determined
they won't do so.
The core could have an internal lookup table telling it which PCR can be
read speculatively because there are no side effects to doing so.
Those PCR would not require an SPCB to guard them.
For MMIO device registers I think having an explicit SPCB instruction
might be better than putting a "no-speculate" flag on the PTE for the
device register address as that flag would be difficult to propagate >backwards from address translate to all the parts of the core that
we might have to sync with.
This all means that there may be very little opportunity for speculative >execution of their handlers, no matter how much hardware one tosses at them.