Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 23 |
Nodes: | 6 (0 / 6) |
Uptime: | 56:50:55 |
Calls: | 584 |
Calls today: | 1 |
Files: | 1,139 |
D/L today: |
179 files (27,921K bytes) |
Messages: | 112,134 |
Greetings everyone !
Since Google closed down comp.ach on google groups, I had been using
Real World Technologies as a portal. About 8 weeks ago it crashed for
the first time, then a couple weeks later it crashed a second time, apparently terminally, or Dave Kanter's interest has waned ...
For Usenet you were using i2pn2.org server, probably via www.novabbs.com
web portal created with Rocksolid Light software. The server and portal
were maintained by Retro Guy (Thom). 2025-04-26 Thom passed away from pancreatic cancer. His Usenet server and his portal continued to work
without maintenance until late July. But eventually they stopped.
So it goes.
Greetings everyone !
Since Google closed down comp.ach on google groups, I had been using
Real World Technologies as a portal. About 8 weeks ago it crashed for
the first time, then a couple weeks later it crashed a second time, apparently terminally, or Dave Kanter's interest has waned ...
With help from Terje and SFuld, we have located this portal, and this
is my first attempt at posting here.
Anyone familiar with my comp.arch record over the years, understands
that I participate "a lot"; probably more than it good for my interests,
but it is energy I seem to have on a continuous basis. My unanticipated
down time gave my energy a time to work on stuff that I had been neglect-
ing for quite some time--that is the non-ISA parts of my architecture.
{{I should probably learn something for this down time and my productivity}}
My 66000 ISA is in "pretty good shape" having almost no changes over
the last 6 months, with only specification clarifications. So it was time
to work on the non-ISA parts.
----------------------------
First up was/is the System Binary Interface: which for the most part is
"just like" Application Binary Interaface, except that it uses supervisor call SVC and supervisor return SVR instead of CALL, CALX, and RET. This method gives uniform across the 4 privilege levels {HyperVisor, Host OS, Guest OS, and Application}.
I decided to integrate exception, check, and interrupts with SVC since
they all involve a privilege transfer of control, and a dispatcher.
Instead of having something like <Interrupt> vector table, I decided,
under EricP's tutelage, to use a software dispatcher. This allows the
vector table to be positioned anywhere in memory, on any boundary, and
be of any size; such that each Guest OS, Host OS, HyperVisor can have
their own table organized anyway they please. Control arrives with a re-entrant Thread State and register file at the dispatcher with R0
holding "why" and enough temporary registers for dispatch to perform
its duties immediately.
{It ends up that even Linux "signals" can use this means with very
slight modification to the software dispatcher--it merely has to
be cognizant that signal privilege == thread waiting privilege and
thus "save the non-preserved registers".}
Dispatch extracts R0<38:32>, compares this to the size of the table,
and if it is within the table, CALX's the entry point in the table.
This performs an ABI control transfer to the required "handler".
Upon return, Dispatcher performs SVR to return control whence it came.
The normal path through Dispatcher is 7 instructions.
In My 66000 Architecture, SVR also checks pending interrupts of higher priority than where SVR is going; thus, softIRQ's are popped off the
deferred call list and processed before control is delivered to lower priority levels.
----------------------------
Next up was the System Programming model: I modeled Chip Resources after
PCIe Peripherals. {{I had to use the term Peripheral, because with SR-IOV
and MR-IOV; with physical Functions, virtual Functions, and base Functions and Bus; Device. Function being turned into a routing code--none of those terms made sense and required to many words to describe. So, I use the term Peripheral as anything that performs an I/O service on behalf of system.}}
My 66000 uses nested paging with Application and Guest OS using Level-1 translation while Host OS and HyperVisor using Level-2 translation.
My 66000 translation projects a 64-bit virtual address space into a
66-bit universal address space with {DRAM, Configuration, MM I/O, and
ROM} spaces.
Since My 66000 comes out of reset with the MMU turned on. Boot software assesses virtual Configuration space, which is mapped to {Chip, DRAM,
and PCIe} configuration spaces. Resources are identified by Type 0
PCIe Configuration headers, and programmed the "obvious" way (later) assigning a page of MM I/O address space to/for each Resource.
Chip Configuration headers have the Built-In Self-Test BIST control
port. Chip-resources use BIST to clear and initialize the internal
stores for normal operation. Prior to writing to BIST these resources
can be read using the diagnostic port and dumped as desired. BIST is
assumed to "take some time" so BOOT SW might cause most Chip resources
to BIST while it goes about getting DRAM up and running.
In all cases:: Control Registers exist--it is only whether SW can access
them that is in question. A control registers that does not exist, reads
as 0 and discards any write, while a control register that does exist
absorbs the write, and returns the last write or the last HW update. Configuration control registers are accessible in <physical> configuration space, The BAR registers in particular are used to assign MM I/O addresses
to the rest of the control registers no addressable in configuration space.
Chip resources {Cores, on-Die Interconnect, {L3, DRAM}, {HostBridge,
I/O MMU, PCIe Segmenter}} have the first 32 DoubleWords of the
assigned MM I/O space defined as a "file" containing R0..R31. In all
cases:
R0 contains the Voltage and Frequency control terms of the resource,
R1..R27 contains any general purpose control registers of resource.
R28..R30 contains the debug port,
R31 contains the Performance Counter port.
The remaining 480 DoubleWords are defined by the resource itself
(or not).
Because My 66000 ISA has memory instructions that "touch" multiple
memory locations, these instructions take on special significance
when using the debug and performance counter ports. Single memory instructions access the control registers themselves, while multi-
memory instructions access "through" the port to the registers
the port controls.
For example: each resource has 8 performance counters and 1 control
register (R31) governing that port.
a STB Rd,[R31] writes a selection into the PC selectors
a STD Rd,[R31] writes 8 selections into the PC selectors
a LDB Rd,[R31] reads a selection from a PC selectors
a LDD Rd,[R31] reads 8 selections from the PC selectors
while:
a LDM Rd,Rd+7,[R31] reads 8 Performance Counters,
a STM Rd,Rd+7,[R31] writes 8 Performance Counters,
a MS #0,[R31],#64 clears 8 Performance Counters.
The Diagnostic port provides access to storage within the resource.
R28 is roughly the "address" control register
R29 is roughly the "data" control register
R30 is roughly the "other" control register
For a Core; one can access the following components from this port:
ICache Tag
ICache Data
ICache TLB
DCache Tag
DCache Data
DCache TLB
Level-1 Miss Buffer
L2Cache Tag
L2Cache Data
L2Cache TLB
L2Cache MMU
Level-2 Miss Buffer
Accesses through this port come in single-memory and multi-memory
flavors. Accessing these control registers as single memory actions
allows raw access to the data and associated ECC. Reads tell you
what HW has stored, writes allow SW to write "bad" ECC, should it
so choose. Multi-memory accesses allow SW to read or write cache
line sized chunks. The Core tags are configured so that every line
has a state where this line neither hits nor participates in set
allocation (when a line needs allocated on miss or replacement.)
So, a single bad line in a 16KB cache 4-way set looses 64-bytes
and one line becomes 3-way set associative.
----------------------------
By using the fact that cores come out of reset with MMU turned on,
and BOOT ROM supplying the translation tables, I was able to achieve
that all resources come out of reset with all control register flip-
flops = 0, except for Core[0].Hypervisor_Context.v = 1.
Core[0] I$, D$, and L2$ come out of reset in the "allocated" state,
so Boot SW has a small amount of memory from which to find DRAM,
configure, initialize, tune the pin interface, and clear; so that
one can proceed to walk and configure the PCIe trees of peripherals. ----------------------------
Guest OS can configure its translation tables to emit {Configuration
and MM I/O} space accesses. Now that these are so easy to recognize:
Host OS and HyperVisor have the ability to translate Guest Physical {Configuration and MM I/O} accesses into Universal {Config or MM I/O} accesses. This requires that the PTE KNOW how SR-IOV was set up on
that virtual Peripheral. All we really want is a) the "routing" code
of the physical counterpart of the virtual Function, and b) whether
the access is to be allowed (valid & present). Here, the routing code contains the PCIe physical Segment, whether the access is physical
or virtual, and whether the routing code uses {Bus, Device, *},
{Bus, *, *} or {*, *, *}. The rest is PCIe transport engines.
Anyway: School is back in session !
Mitch
Greetings everyone !
Chip resources {Cores, on-Die Interconnect, {L3, DRAM}, {HostBridge,
I/O MMU, PCIe Segmenter}} have the first 32 DoubleWords of the
assigned MM I/O space defined as a "file" containing R0..R31. In all
cases:
R0 contains the Voltage and Frequency control terms of the resource,
R1..R27 contains any general purpose control registers of resource.
R28..R30 contains the debug port,
R31 contains the Performance Counter port.
The remaining 480 DoubleWords are defined by the resource itself
(or not).
Because My 66000 ISA has memory instructions that "touch" multiple
memory locations, these instructions take on special significance
when using the debug and performance counter ports. Single memory >instructions access the control registers themselves, while multi-
memory instructions access "through" the port to the registers
the port controls.
For example: each resource has 8 performance counters and 1 control
register (R31) governing that port.
a STB Rd,[R31] writes a selection into the PC selectors
a STD Rd,[R31] writes 8 selections into the PC selectors
a LDB Rd,[R31] reads a selection from a PC selectors
a LDD Rd,[R31] reads 8 selections from the PC selectors
while:
a LDM Rd,Rd+7,[R31] reads 8 Performance Counters,
a STM Rd,Rd+7,[R31] writes 8 Performance Counters,
a MS #0,[R31],#64 clears 8 Performance Counters.
The Diagnostic port provides access to storage within the resource.
R28 is roughly the "address" control register
R29 is roughly the "data" control register
R30 is roughly the "other" control register
For a Core; one can access the following components from this port:
ICache Tag
ICache Data
ICache TLB
DCache Tag
DCache Data
DCache TLB
Level-1 Miss Buffer
L2Cache Tag
L2Cache Data
L2Cache TLB
L2Cache MMU
Level-2 Miss Buffer
Accesses through this port come in single-memory and multi-memory
flavors. Accessing these control registers as single memory actions
allows raw access to the data and associated ECC. Reads tell you
what HW has stored, writes allow SW to write "bad" ECC, should it
so choose. Multi-memory accesses allow SW to read or write cache
line sized chunks. The Core tags are configured so that every line
has a state where this line neither hits nor participates in set
allocation (when a line needs allocated on miss or replacement.)
So, a single bad line in a 16KB cache 4-way set looses 64-bytes
and one line becomes 3-way set associative.
----------------------------
By using the fact that cores come out of reset with MMU turned on,
and BOOT ROM supplying the translation tables, I was able to achieve
that all resources come out of reset with all control register flip-
flops = 0, except for Core[0].Hypervisor_Context.v = 1.
Core[0] I$, D$, and L2$ come out of reset in the "allocated" state,
so Boot SW has a small amount of memory from which to find DRAM,
configure, initialize, tune the pin interface, and clear; so that
one can proceed to walk and configure the PCIe trees of peripherals.
----------------------------
Guest OS can configure its translation tables to emit {Configuration
and MM I/O} space accesses. Now that these are so easy to recognize:
Host OS and HyperVisor have the ability to translate Guest Physical >{Configuration and MM I/O} accesses into Universal {Config or MM I/O} >accesses. This requires that the PTE KNOW how SR-IOV was set up on
that virtual Peripheral.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Greetings everyone !
Chip resources {Cores, on-Die Interconnect, {L3, DRAM}, {HostBridge,
I/O MMU, PCIe Segmenter}} have the first 32 DoubleWords of the
assigned MM I/O space defined as a "file" containing R0..R31. In all
cases:
R0 contains the Voltage and Frequency control terms of the resource, >R1..R27 contains any general purpose control registers of resource. >R28..R30 contains the debug port,
R31 contains the Performance Counter port.
The remaining 480 DoubleWords are defined by the resource itself
(or not).
I'd allow for regions larger than 4096 bytes. It's not uncommmon
for specialized on-board DMA engines to require 20 bits of
address space to define the complete set of device resources,
even for on-chip devices (A DMA engine may support a large number
of "ring" structures, for example, and one might group the
ring configuration registers into 4k regions (so they can be assigned
to a guest in a SRIOV-type device)).
I've seen devices with dozens of performance registers (both
direct-access and indirect-access).
Because My 66000 ISA has memory instructions that "touch" multiple
memory locations, these instructions take on special significance
when using the debug and performance counter ports. Single memory >instructions access the control registers themselves, while multi-
memory instructions access "through" the port to the registers
the port controls.
That level of indirection may cause difficulties when virtualizing
a device.
For example: each resource has 8 performance counters and 1 control >register (R31) governing that port.
a STB Rd,[R31] writes a selection into the PC selectors
a STD Rd,[R31] writes 8 selections into the PC selectors
a LDB Rd,[R31] reads a selection from a PC selectors
a LDD Rd,[R31] reads 8 selections from the PC selectors
while:
a LDM Rd,Rd+7,[R31] reads 8 Performance Counters,
a STM Rd,Rd+7,[R31] writes 8 Performance Counters,
a MS #0,[R31],#64 clears 8 Performance Counters.
The Diagnostic port provides access to storage within the resource.
R28 is roughly the "address" control register
R29 is roughly the "data" control register
R30 is roughly the "other" control register
For a Core; one can access the following components from this port:
ICache Tag
ICache Data
ICache TLB
DCache Tag
DCache Data
DCache TLB
Level-1 Miss Buffer
L2Cache Tag
L2Cache Data
L2Cache TLB
L2Cache MMU
Level-2 Miss Buffer
Accesses through this port come in single-memory and multi-memory
flavors. Accessing these control registers as single memory actions
allows raw access to the data and associated ECC. Reads tell you
what HW has stored, writes allow SW to write "bad" ECC, should it
so choose. Multi-memory accesses allow SW to read or write cache
line sized chunks. The Core tags are configured so that every line
has a state where this line neither hits nor participates in set
allocation (when a line needs allocated on miss or replacement.)
So, a single bad line in a 16KB cache 4-way set looses 64-bytes
and one line becomes 3-way set associative.
----------------------------
The KISS principle applies.
By using the fact that cores come out of reset with MMU turned on,
and BOOT ROM supplying the translation tables, I was able to achieve
that all resources come out of reset with all control register flip-
flops = 0, except for Core[0].Hypervisor_Context.v = 1.
Where is the ROM? Modern SoCs have an on-board ROM, which
cannot be changed without a re-spin and new tapeout. That
ROM needs to be rock-solid and provide just enough capability
to securely load a trusted blob from a programmable device
(e.g. SPI flash device).
I'm really leary about the idea of starting with MMU enabled,
I don't see any advantage to doing that.
Core[0] I$, D$, and L2$ come out of reset in the "allocated" state,
so Boot SW has a small amount of memory from which to find DRAM,
configure, initialize, tune the pin interface, and clear; so that
one can proceed to walk and configure the PCIe trees of peripherals.
You don't need to configure peripherals before DRAM is initialized
(other than the DRAM controller itself). All other peripheral initialization should be done in loadable firmware or a secure
monitor, hypervisor or bare-metal kernel.
----------------------------
Guest OS can configure its translation tables to emit {Configuration
and MM I/O} space accesses. Now that these are so easy to recognize:
Security. Guest OS should only be able to access resources
granted to it by the HV.
Host OS and HyperVisor have the ability to translate Guest Physical >{Configuration and MM I/O} accesses into Universal {Config or MM I/O} >accesses. This requires that the PTE KNOW how SR-IOV was set up on
that virtual Peripheral.
This seems unnecessarily complicated.
Every SR-IOV capable device
is different and aside the standard PCIe defined configuration space registers, everything else is device-specific.
scott@slp53.sl.home (Scott Lurndal) posted:
Host OS and HyperVisor have the ability to translate Guest Physical
{Configuration and MM I/O} accesses into Universal {Config or MM I/O}
accesses. This requires that the PTE KNOW how SR-IOV was set up on
that virtual Peripheral.
This seems unnecessarily complicated.
So did IEEE 754 in 1982...
What I have done is to virtualize Config and MM I/O spaces, so Guest OS
does not even see that it is not Real OS running on bare metal--and doing
so without HV intervention on any of the Config or MM I/O accesses.
Every SR-IOV capable device
is different and aside the standard PCIe defined configuration space
registers, everything else is device-specific.
Only requires 3 bits in the MM I/O PTE.
Only requires 1 bit in Config PTE, a bit that already had to be there.
On 8/22/2025 11:17 AM, MitchAlsup wrote:
scott@slp53.sl.home (Scott Lurndal) posted:
<snip>
Host OS and HyperVisor have the ability to translate Guest Physical
{Configuration and MM I/O} accesses into Universal {Config or MM I/O}
accesses. This requires that the PTE KNOW how SR-IOV was set up on
that virtual Peripheral.
This seems unnecessarily complicated.
So did IEEE 754 in 1982...
Still is...
Denormals, Inf/NaN, ... tend to accomplish relatively little in
practice; apart from making FPUs more expensive, often slower, and
requiring programmers to go through extra hoops to specify DAZ/FTZ in
cases where they need more performance.
Likewise, +/- 0.5 ULP, accomplishes little beyond adding cost; whereas
+/- 0.63 ULP would be a lot cheaper, and accomplishes nearly the same effect.
Well, apart from the seeming failure of being unable to fully converge
the last few bits of N-R, which seems to depend primarily on sub-ULP bits.
But, there is a tradeoff:
Doing a faster FPU which uses trap-and-emulate.
Still isn't free, as detecting cases tat will require trap-and-emulate
still has a higher cost than merely not bothering in the first place
(and now requires trickery of routing FPSR bits into the instruction
decoder depending on whether they need to be routed in a way that will
allow the FPU to detect violations of IEEE semantics).
And finding some other issues in the process, ...
...
What I have done is to virtualize Config and MM I/O spaces, so Guest OS does not even see that it is not Real OS running on bare metal--and doing so without HV intervention on any of the Config or MM I/O accesses.
Still seems unnecessarily complicated.
Could be like:
Machine/ISR Mode: Bare metal, no MMU.
Supervisor Mode: Full Access, MMU.
User: Limited Access, MMU
VM Guest OS then runs in User Mode. and generates a fault whenever a privileged operation is encountered. The VM can then fake the rest of
the system in software...
And/Or: Ye Olde Interpreter or JIT compiler (sorta like DOSBox and similar).
Nested Translation? Fake it in software.
Unlike real VMs, SW address translation can more easily scale to N
levels of VM, even if this also means N levels of slow...
Every SR-IOV capable device
is different and aside the standard PCIe defined configuration space
registers, everything else is device-specific.
Only requires 3 bits in the MM I/O PTE.
Only requires 1 bit in Config PTE, a bit that already had to be there.
On 8/22/2025 11:17 AM, MitchAlsup wrote:
scott@slp53.sl.home (Scott Lurndal) posted:
<snip>
Host OS and HyperVisor have the ability to translate Guest Physical
{Configuration and MM I/O} accesses into Universal {Config or MM I/O}
accesses. This requires that the PTE KNOW how SR-IOV was set up on
that virtual Peripheral.
This seems unnecessarily complicated.
So did IEEE 754 in 1982...
Still is...
Denormals, Inf/NaN, ... tend to accomplish relatively little in
practice; apart from making FPUs more expensive, often slower, and
requiring programmers to go through extra hoops to specify DAZ/FTZ in
cases where they need more performance.
Likewise, +/- 0.5 ULP, accomplishes little beyond adding cost; whereas
+/- 0.63 ULP would be a lot cheaper, and accomplishes nearly the same effect.
Well, apart from the seeming failure of being unable to fully converge
the last few bits of N-R, which seems to depend primarily on sub-ULP bits.
scott@slp53.sl.home (Scott Lurndal) posted:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Greetings everyone !
Because My 66000 ISA has memory instructions that "touch" multiple
memory locations, these instructions take on special significance
when using the debug and performance counter ports. Single memory
instructions access the control registers themselves, while multi-
memory instructions access "through" the port to the registers
the port controls.
That level of indirection may cause difficulties when virtualizing
a device.
These are on-Die resources not PCIe peripherals.
By using the fact that cores come out of reset with MMU turned on,
and BOOT ROM supplying the translation tables, I was able to achieve
that all resources come out of reset with all control register flip-
flops = 0, except for Core[0].Hypervisor_Context.v = 1.
Where is the ROM? Modern SoCs have an on-board ROM, which
cannot be changed without a re-spin and new tapeout. That
ROM needs to be rock-solid and provide just enough capability
to securely load a trusted blob from a programmable device
(e.g. SPI flash device).
ROM is external FLASH in the envisioned implementations.
----------------------------
Guest OS can configure its translation tables to emit {Configuration
and MM I/O} space accesses. Now that these are so easy to recognize:
Security. Guest OS should only be able to access resources
granted to it by the HV.
Yes, Guest physical MM I/O Space is translated by Host MM I/O Translation >tables. Real OS setup its translation tables to emit MM I/O Accesses, so >Guest OS should too, then that Guest Physical is considered Host virtual
and translated and protected again.
As far as I am concerned, Guest OS thinks it has 32 Devices each of which >have 8 Functions all on Bus 0... So, a Guest OS with fewer than 256 Fctns >sees only 1 BUS and can short circuit the virtual Config discovery.
These virtual Guest OS accesses, then, get redistributed to the
Segments and Busses on which the VFs actually reside by Level-2 SW.
Host OS and HyperVisor have the ability to translate Guest Physical
{Configuration and MM I/O} accesses into Universal {Config or MM I/O}
accesses. This requires that the PTE KNOW how SR-IOV was set up on
that virtual Peripheral.
This seems unnecessarily complicated.
So did IEEE 754 in 1982...
What I have done is to virtualize Config and MM I/O spaces, so Guest OS
does not even see that it is not Real OS running on bare metal--and doing
so without HV intervention on any of the Config or MM I/O accesses.
Every SR-IOV capable device
is different and aside the standard PCIe defined configuration space
registers, everything else is device-specific.
Only requires 3 bits in the MM I/O PTE.
Only requires 1 bit in Config PTE, a bit that already had to be there.
BGB wrote:
On 8/22/2025 11:17 AM, MitchAlsup wrote:
scott@slp53.sl.home (Scott Lurndal) posted:
<snip>
Host OS and HyperVisor have the ability to translate Guest Physical
{Configuration and MM I/O} accesses into Universal {Config or MM I/O} >>>> accesses. This requires that the PTE KNOW how SR-IOV was set up on
that virtual Peripheral.
This seems unnecessarily complicated.
So did IEEE 754 in 1982...
Still is...
Denormals, Inf/NaN, ... tend to accomplish relatively little in
practice; apart from making FPUs more expensive, often slower, and requiring programmers to go through extra hoops to specify DAZ/FTZ in cases where they need more performance.
Likewise, +/- 0.5 ULP, accomplishes little beyond adding cost; whereas
+/- 0.63 ULP would be a lot cheaper, and accomplishes nearly the same effect.
Well, apart from the seeming failure of being unable to fully converge
the last few bits of N-R, which seems to depend primarily on sub-ULP bits.
Having spent ~10 years (very much part time!) working on 754 standards I strongly believe you are wrong:
Yes, there are a few small issues, some related to grandfather clauses
that might go away at some point, but the zero/subnorm/normal/inf/nan
setup is not one of them.
Personally I think it would have been a huge win if the original
standard had defined inf/nan a different way:
What we have is Inf == Maximal exponent, all-zero mantissa, while all
other mantissa values indicates a NaN.
For binary FP it is totally up to the CPU vendor how to define Quiet NaN
vs Signalling NaN, most common seems to be to set the top bit in the mantissa.
What we have been missing for 40 years now is a fourth category:
None (or Null/Missing)
This would have simplified all sorts of array/matrix sw where both
errors (NaN) and missing (None) items are possible.
The easiest way to implement it would also make the FPU hardware simpler:
The top two bits of the mantissa define
11 : SNaN
10 : QNaN
01 : None
00 : Inf
The rest of the mantissa bits could then carry any payload you want, including optional debug info for Infinities.
Terje
BGB wrote:
What we have been missing for 40 years now is a fourth category:
None (or Null/Missing)
This would have simplified all sorts of array/matrix sw where both
errors (NaN) and missing (None) items are possible.
BGB <cr88192@gmail.com> posted:
On 8/23/2025 10:11 AM, Terje Mathisen wrote:-------------
BGB wrote:
Mitch and I have repeated this too many times already:
If you are implementing a current-standards FPU, including FMAC support, >>> then you already have the very wide normalizer which is the only
expensive item needed to allow zero-cycle denorm cost.
Errm, no single-rounded FMA in my case, as single rounded FMA (for
Binary64) would also require Trap-and-Emulate...
But, yeah, Free if you have FMA, is not the same as FMA being free.
Partial issue is that single rounded FMA would effectively itself have
too high of cost (and an FMA unit would require higher latency than
separate FMUL and FADD units).
FMA latency < (FMUL + FADD) latency
FMA latency >= FMUL latency
FMA latency >= FADD latency
Ironically, what FMA operations exist tend to be slower for Binary32 ops
than using separate MUL and ADD ops in the default (non-IEEE) mode.
Though for Binary64, it would be slightly faster, though still
double-rounded-ish. They can mimic Single-Rounded behavior with Binary32
and Binary16 though mostly for sake of internally operating on Binary64.
You must accept that::
FMA Rd,Rs1,Rs2,Rs3
FSUB Re,Rd,Rs3
leaves all the proper bits in Re; whereas you cannot even argue::
FMUL Rd,Rs1,Rs2
FADD Re,Rd,Rs3
RSUB Re,Re,R3
leaves all the proper bits in Re !! in all cases !!
BGB <cr88192@gmail.com> posted:
On 8/22/2025 2:51 PM, Terje Mathisen wrote:--------------
BGB wrote:
On 8/22/2025 11:17 AM, MitchAlsup wrote:
Often, seemingly works well enough to either assume the 0 exponent is
either all 0, or the all 0's value is 0.
Decided to leave out going into a detour about aggressive corner-cutting
(or, more aggressive than I tend to use).
But, the issue is partly asking what exactly Denormals and Inf/NaN tend
to meaningfully offer to calculations...
A slow and meaningful death instead of a quick and unsightly death.
As is, it seems like a case of:Used in initialization.
Denormals:
Usually too small to matter;
Often their only visible effect is making performance worse.
Inf/NaN:
Often only appear if something has gone wrong.
So, you do not follow standards, or even agree that they bring value
If the FPU were to behave like, say:
Exponent 0 is 0 (DAZ/FTZ);
Max exponent is treated like an extra level of huge values.
Inf and NaN are then just huge values understood as Inf/NaN.
Likely relatively little software would notice in practice.
to the computing community !?!
Say, for example, a video that came up not too long ago:
https://www.youtube.com/watch?v=y-NOz94ZEOA
Or, in effect, even with all the powers of shiny/expensive Desktop PC
CPUs, denormal numbers still aren't free.
Neither are seat belts or air-bags or 5MPH bumpers.
And, people need to manually opt out far more often than programs are
likely to notice, and if it were a case of "opt-in for slightly more
accurate math at the cost of worse performance"; how many people would
make *that* choice?...
Contrast, at least rounding has a useful/visible effect:
If calculations were to always truncate, values tend to drift towards 0.
In a lot of scenarios, this effect is, at least, visible.
As for strict 0.5 ULP?
Mostly seems to makes hardware more expensive.
And, determinism could have been achieved in cheaper ways.
Say, by specifying use of truncation instead.
One could achieve 0.63 ULP by having 4 bits below the ULP.
And, say, 54*54->58 FMUL is cheaper than 54*54->108 bit (or, more, if
one wants to be able to support single-rounded FMAC).
But then it takes 11 instructions to get double-double |u accuracy
instead of 2 instructions::
{double, double} TwoProduct( double a, double b )
{ // Knuth
x = FPMUL( a * b );
{ ahi, alo } = Split( a );
{ bhi, blo } = Split( b );
q = FPMUL( ahi * bhi );
r = FPMUL( alo * bhi );
s = FPMUL( ahi * blo );
t = FPMUL( alo * blo );
u = FPADD( x - q );
v = FPADD( u - r );
w = FPADD( v - s );
y = FPADD( w - t );
return { x, y };
}
versus::
{double, double} TwoProduct( double a, double b )
{
double x = FPMUL( a * b );
double y = FPMAC( a * b - x );
return { x, y };
}
This would have simplified all sorts of array/matrix sw where both
errors (NaN) and missing (None) items are possible.
In what ways would None behave differently from SNaN?
It would be transparently ignored in reductions, with zero overhead.
Terje Mathisen wrote:I'm used to the Mill None, where a store becomes a NOP, a mul behaves
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
This would have simplified all sorts of array/matrix sw where both
errors (NaN) and missing (None) items are possible.
In what ways would None behave differently from SNaN?
It would be transparently ignored in reductions, with zero overhead.
There is also the behavior with operators - how is it different from xNan? xNan behaves like an error and poisons any calculation it is in,
which is also how SQL behaves wrt NULL values:
-a value + xNan => xNan
-a value * xNan => xNan
whereas Null is typically thought of as a missing value:
-a value + Null => value?
-a value * Null => 0?
It could also have different operator instruction options that select different behaviors similar to rounding mode or exception handling bits.
All those option bits would take up a lot of instruction space.
EricP <ThatWouldBeTelling@thevillage.com> posted:
Terje Mathisen wrote:
Anton Ertl wrote:There is also the behavior with operators - how is it different from xNan? >> xNan behaves like an error and poisons any calculation it is in,
Terje Mathisen <terje.mathisen@tmsw.no> writes:It would be transparently ignored in reductions, with zero overhead.
This would have simplified all sorts of array/matrix sw where bothIn what ways would None behave differently from SNaN?
errors (NaN) and missing (None) items are possible.
which is also how SQL behaves wrt NULL values:
value + xNan => xNan
value * xNan => xNan
whereas Null is typically thought of as a missing value:
value + Null => value?
value * Null => 0?
I think they would want::
value + xNaN => <same> xNaN
value |u xNaN => <same> xNaN
value / xNaN => <same> xNaN
xNaN / value => <same> xNaN
Where the non-existent operand is treated as turning the calculation
into a copy of the xNaN. Some architectures put a payload into the
xNaN such as a 3-bit code for why the xNaN was created, others also IP<low-bits> to help identify the instruction the xNaN first occurred.
It could also have different operator instruction options that select
different behaviors similar to rounding mode or exception handling bits.
All those option bits would take up a lot of instruction space.
If your math goes NaN, it means math was wrong.
But, if the math is not wrong, there are no NaN's.
EricP wrote:
Terje Mathisen wrote:
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
This would have simplified all sorts of array/matrix sw where both
errors (NaN) and missing (None) items are possible.
In what ways would None behave differently from SNaN?
It would be transparently ignored in reductions, with zero overhead.
There is also the behavior with operators - how is it different from
xNan?
xNan behaves like an error and poisons any calculation it is in,
which is also how SQL behaves wrt NULL values:
value + xNan => xNan
value * xNan => xNan
whereas Null is typically thought of as a missing value:
value + Null => value?
value * Null => 0?
It could also have different operator instruction options that select
different behaviors similar to rounding mode or exception handling bits.
All those option bits would take up a lot of instruction space.
I'm used to the Mill None, where a store becomes a NOP, a mul behaves
like x * 1 (or a NOP), same for other operations.
Terje
On 8/23/2025 5:59 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
But, the issue is partly asking what exactly Denormals and Inf/NaN tend
to meaningfully offer to calculations...
A slow and meaningful death instead of a quick and unsightly death.
If your math goes NaN, it means math was wrong.
But, if the math is not wrong, there are no NaN's.
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
This would have simplified all sorts of array/matrix sw where both
errors (NaN) and missing (None) items are possible.
In what ways would None behave differently from SNaN?
It would be transparently ignored in reductions, with zero overhead.
In matrix calculations I simply padded matrices with zeros.
MitchAlsup wrote:Exceptions themselves are inexpensive. Misuses of exceptions for 'fixup
EricP <ThatWouldBeTelling@thevillage.com> posted:
Terje Mathisen wrote:
Anton Ertl wrote:There is also the behavior with operators - how is it different
Terje Mathisen <terje.mathisen@tmsw.no> writes:It would be transparently ignored in reductions, with zero
This would have simplified all sorts of array/matrix sw whereIn what ways would None behave differently from SNaN?
both errors (NaN) and missing (None) items are possible.
overhead.
from xNan? xNan behaves like an error and poisons any calculation
it is in, which is also how SQL behaves wrt NULL values:
value + xNan => xNan
value * xNan => xNan
whereas Null is typically thought of as a missing value:
value + Null => value?
value * Null => 0?
I think they would want::
value + xNaN => <same> xNaN
value - xNaN => <same> xNaN
value / xNaN => <same> xNaN
xNaN / value => <same> xNaN
Where the non-existent operand is treated as turning the calculation
into a copy of the xNaN. Some architectures put a payload into the
xNaN such as a 3-bit code for why the xNaN was created, others also IP<low-bits> to help identify the instruction the xNaN first
occurred.
That's how Nan propagation works now, to poison the calculation.
The Nan propagation rules were designed back when people thought
the using traps for fixing individual calculations was a good idea.
That way Nan could serve as either an error or missing value
and your exception handler could customize the behavior you want.
"6.2.3 NaN propagation
An operation that propagates a NaN operand to its result and has a
single NaN as an input should produce a NaN with the payload of the
input NaN if representable in the destination format.
If two or more inputs are NaN, then the payload of the resulting NaN
should be identical to the payload of one of the input NaNs if
representable in the destination format. This standard does not
specify which of the input NaNs will provide the payload."
Traps are expensive for pipelines, vectors, gpu's,
so I'd want
None to behave differently - I'm just not sure what.
And I recognize (below) that there may be different ways that users
want None to behave so suggest there might be control bits to select
among multiple None behaviors on an each instruction.
It could also have different operator instruction options that
select different behaviors similar to rounding mode or exception
handling bits. All those option bits would take up a lot of
instruction space.
On Sun, 24 Aug 2025 01:34:41 -0500, BGB <cr88192@gmail.com> wrote:
If your math goes NaN, it means math was wrong.
But, if the math is not wrong, there are no NaN's.
Not exactly. A NaN result means that the computation has failed. It
may be due to limited precision or range rather than incorrect math.
For most use cases, a result of INF or IND[*] similarly means the
computation has failed and there is no point trying to continue.
[*] IEEE 754-2008.
Terje Mathisen wrote:It does not!
EricP wrote:
Terje Mathisen wrote:
Anton Ertl wrote:xNan behaves like an error and poisons any calculation it is in,
Terje Mathisen <terje.mathisen@tmsw.no> writes:
This would have simplified all sorts of array/matrix sw where both>>>>>> errors (NaN) and missing (None) items are possible.
In what ways would None behave differently from SNaN?
It would be transparently ignored in reductions, with zero overhead.>>> >>> There is also the behavior with operators - how is it different from >>> xNan?
which is also how SQL behaves wrt NULL values:
-a-a value + xNan => xNan
-a-a value * xNan => xNan
whereas Null is typically thought of as a missing value:
-a-a value + Null => value?
-a-a value * Null => 0?
It could also have different operator instruction options that select>>> different behaviors similar to rounding mode or exception handling bits.
All those option bits would take up a lot of instruction space.
I'm used to the Mill None, where a store becomes a NOP, a mul behaves >> like x * 1 (or a NOP), same for other operations.
Terje
How does Mill store a None value if they change to NOP?
Terje Mathisen wrote:
EricP wrote:
Terje Mathisen wrote:
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
This would have simplified all sorts of array/matrix sw where both >>>>>> errors (NaN) and missing (None) items are possible.
In what ways would None behave differently from SNaN?
It would be transparently ignored in reductions, with zero overhead.
There is also the behavior with operators - how is it different from
xNan?
xNan behaves like an error and poisons any calculation it is in,
which is also how SQL behaves wrt NULL values:
value + xNan => xNan
value * xNan => xNan
whereas Null is typically thought of as a missing value:
value + Null => value?
value * Null => 0?
It could also have different operator instruction options that select
different behaviors similar to rounding mode or exception handling bits. >>> All those option bits would take up a lot of instruction space.
I'm used to the Mill None, where a store becomes a NOP, a mul behaves
like x * 1 (or a NOP), same for other operations.
Terje
I was thinking of spreadsheet style rules for missing cells.
Something that's compatible with dsp's, simd, vector, and gpu's,
but I don't know enough about all their calculations to know the
different ways calculations handle missing values.
And there can be different None rules just like different roundings.
EricP wrote:
Terje Mathisen wrote:
EricP wrote:
Terje Mathisen wrote:
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
This would have simplified all sorts of array/matrix sw where both >>>>>> errors (NaN) and missing (None) items are possible.
In what ways would None behave differently from SNaN?
It would be transparently ignored in reductions, with zero overhead.
There is also the behavior with operators - how is it different from
xNan?
xNan behaves like an error and poisons any calculation it is in,
which is also how SQL behaves wrt NULL values:
value + xNan => xNan
value * xNan => xNan
whereas Null is typically thought of as a missing value:
value + Null => value?
value * Null => 0?
It could also have different operator instruction options that select
different behaviors similar to rounding mode or exception handling bits. >>> All those option bits would take up a lot of instruction space.
I'm used to the Mill None, where a store becomes a NOP, a mul behaves
like x * 1 (or a NOP), same for other operations.
Terje
I was thinking of spreadsheet style rules for missing cells.
Something that's compatible with dsp's, simd, vector, and gpu's,
but I don't know enough about all their calculations to know the
different ways calculations handle missing values.
And there can be different None rules just like different roundings.
Musing about errors...
As exceptions can be masked one might also be want to make a
distinction between generated and propagated Nan,
as well as the reason for Nan.
Something like this for Nan high order fraction bits:
- 1 bit to indicate 0=Quite or 1=Signalled
- 1 bit to indicate 0=Generated or 1=Propagated
- 4 bits to indicate error code with
0 = Missing or None
1 = Invalid operand format
2 = Invalid operation
3 = Divide by zero
etc
If any source operand is a Nan marked Generated the the result is a Nan
with the same error code but Propagated. If multiple source operands
are Nan then some rules on how to propagate the Nan error value
- if any is Signalled then result is Signalled,
- if all Nan source operands are error code None then the result is None
otherwise the error code is one of the >0 codes.
And (assuming instruction bits are free and infinitely available)
instruction bits to control how each deals with Nan source values
and how to handle each (fault, trap, propagate, substitute),
how it generates Nan (quite, signalled) for execution errors,
and various exception condition enable flags (denorm, overflow,
underflow, Inexact, etc).
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Greetings everyone !
I'm really leary about the idea of starting with MMU enabled,
I don't see any advantage to doing that.
scott@slp53.sl.home (Scott Lurndal) posted:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:--------------------
Greetings everyone !
I'm really leary about the idea of starting with MMU enabled,
I don't see any advantage to doing that.
Boot ROM is encrypted,
and the MMU tables provide access to
the keys.
Greetings everyone !
My 66000 ISA is in "pretty good shape" having almost no changes over
the last 6 months, with only specification clarifications. So it was time
to work on the non-ISA parts.
On 8/21/2025 1:49 PM, MitchAlsup wrote:
Greetings everyone !
snip
My 66000 ISA is in "pretty good shape" having almost no changes over
the last 6 months, with only specification clarifications. So it was time to work on the non-ISA parts.
You mention two non-ISA parts that you have been working on. I thought
I would ask you for your thoughts on another non-ISA part. Timers and clocks. Doing a "clean slate" ISA frees you from being compatible with
lots of old features that might have been the right thing to do back
then, but aren't now.
So, how many clocks/timers should a system have?
What precision?
How
fast does the software need to be able to access them?
I presume you
need some comparitors (unless you use count down to zero).
Should the comparisons be one time or recurring?
What about syncing with an
external timer?
There are many such decisions to make, and I am curious--- Synchronet 3.21a-Linux NewsLink 1.2
as to your thinking on the subject.
If you haven't gotten around to working on this part of the system, just
say so.
On 8/21/2025 1:49 PM, MitchAlsup wrote:
Greetings everyone !
snip
My 66000 ISA is in "pretty good shape" having almost no changes over
the last 6 months, with only specification clarifications. So it was time
to work on the non-ISA parts.
You mention two non-ISA parts that you have been working on.-a I thought
I would ask you for your thoughts on another non-ISA part.-a Timers and clocks.-a Doing a "clean slate" ISA frees you from being compatible with lots of old features that might have been the right thing to do back
then, but aren't now.
So, how many clocks/timers should a system have?-a What precision?-a How fast does the software need to be able to access them?-a I presume you
need some comparitors (unless you use count down to zero).-a Should the comparisons be one time or recurring?-a What about syncing with an
external timer?-a There are many such decisions to make, and I am curious
as to your thinking on the subject.
If you haven't gotten around to working on this part of the system, just
say so.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 8/21/2025 1:49 PM, MitchAlsup wrote:
Greetings everyone !
snip
My 66000 ISA is in "pretty good shape" having almost no changes over
the last 6 months, with only specification clarifications. So it was time >> > to work on the non-ISA parts.
You mention two non-ISA parts that you have been working on. I thought
I would ask you for your thoughts on another non-ISA part. Timers and
clocks. Doing a "clean slate" ISA frees you from being compatible with
lots of old features that might have been the right thing to do back
then, but aren't now.
So, how many clocks/timers should a system have?
Lots. Every major resource should have its own clock as part of its >performance counter set.
Every interruptible resource should have its own timer which is programmed
to throw interrupts to one thing or another.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 8/21/2025 1:49 PM, MitchAlsup wrote:
Greetings everyone !
snip
My 66000 ISA is in "pretty good shape" having almost no changes over
the last 6 months, with only specification clarifications. So it was time >>> to work on the non-ISA parts.
You mention two non-ISA parts that you have been working on. I thought
I would ask you for your thoughts on another non-ISA part. Timers and
clocks. Doing a "clean slate" ISA frees you from being compatible with
lots of old features that might have been the right thing to do back
then, but aren't now.
So, how many clocks/timers should a system have?
Lots. Every major resource should have its own clock as part of its performance counter set.
Every interruptible resource should have its own timer which is programmed
to throw interrupts to one thing or another.
Clocks need to be as fast as the fastest event, right now that means
16 GHz since PCIe 5.0 and 6.0 use 16 GHz clock bases. But, realistically,
if you can count 0..16 events per ns, its fine.
How
fast does the software need to be able to access them?
1 instruction-then the latency of actual access.
2 instructions back-to-back to perform an ATOMIC-like read-update.
LDD Rd,[timer]
STD Rs,[timer]
need some comparitors (unless you use count down to zero).
You can count down to zero, count up to zero, or use a comparator.
Zeroes cost less HW than comparators. Comparators also require
an additional register and an additional instruction at swap time.
Should the
comparisons be one time or recurring?
I have no opinion at this time.
What about syncing with an
external timer?
A necessity--that is what the ATOMIC-like comment above is for.
On 8/29/2025 8:26 AM, MitchAlsup wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 8/21/2025 1:49 PM, MitchAlsup wrote:
Greetings everyone !
snip
My 66000 ISA is in "pretty good shape" having almost no changes over
the last 6 months, with only specification clarifications. So it was time >>> to work on the non-ISA parts.
You mention two non-ISA parts that you have been working on. I thought
I would ask you for your thoughts on another non-ISA part. Timers and
clocks. Doing a "clean slate" ISA frees you from being compatible with
lots of old features that might have been the right thing to do back
then, but aren't now.
So, how many clocks/timers should a system have?
Lots. Every major resource should have its own clock as part of its performance counter set.
Every interruptible resource should have its own timer which is programmed to throw interrupts to one thing or another.
So if I want to keep track of how much CPU time a task has accumulated, someone has to save the value of the CPU clock when the task gets interrupted or switched out. Is this done by the HW or by code in the
task switcher?
Later, when the task gets control of the CPU again,
there needs to be a mechanism to resume adding time to its saved value.
How is this done?
What precision?
Clocks need to be as fast as the fastest event, right now that means
16 GHz since PCIe 5.0 and 6.0 use 16 GHz clock bases. But, realistically, if you can count 0..16 events per ns, its fine.
How
fast does the software need to be able to access them?
1 instruction-then the latency of actual access.
2 instructions back-to-back to perform an ATOMIC-like read-update.
LDD Rd,[timer]
STD Rs,[timer]
Good.
I presume you
need some comparitors (unless you use count down to zero).
You can count down to zero, count up to zero, or use a comparator.
Zeroes cost less HW than comparators. Comparators also require
an additional register and an additional instruction at swap time.
Should the
comparisons be one time or recurring?
I have no opinion at this time.
What about syncing with an
external timer?
A necessity--that is what the ATOMIC-like comment above is for.
Got it.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 8/29/2025 8:26 AM, MitchAlsup wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 8/21/2025 1:49 PM, MitchAlsup wrote:
Greetings everyone !
snip
My 66000 ISA is in "pretty good shape" having almost no changes over >>>>> the last 6 months, with only specification clarifications. So it was time >>>>> to work on the non-ISA parts.
You mention two non-ISA parts that you have been working on. I thought >>>> I would ask you for your thoughts on another non-ISA part. Timers and >>>> clocks. Doing a "clean slate" ISA frees you from being compatible with >>>> lots of old features that might have been the right thing to do back
then, but aren't now.
So, how many clocks/timers should a system have?
Lots. Every major resource should have its own clock as part of its
performance counter set.
Every interruptible resource should have its own timer which is programmed >>> to throw interrupts to one thing or another.
So if I want to keep track of how much CPU time a task has accumulated,
someone has to save the value of the CPU clock when the task gets
interrupted or switched out. Is this done by the HW or by code in the
task switcher?
Task switcher.
Later, when the task gets control of the CPU again,
there needs to be a mechanism to resume adding time to its saved value.
How is this done?
One reads the old performance counters:
LDM Rd,Rd+7,[MMIO+R29]
and saves with Thread:
STM Rd,Rd+7,[Thread+128]
Side Note: LDM and STM are ATOMIC wrt the 8 doublewords loaded/stored.
To put them back:
LDM Rd,Rd+7,[Thread+128]
STM Rd,Rd+7,[MMIO+R29]
On 8/29/2025 12:31 PM, MitchAlsup wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 8/29/2025 8:26 AM, MitchAlsup wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 8/21/2025 1:49 PM, MitchAlsup wrote:
Greetings everyone !
snip
My 66000 ISA is in "pretty good shape" having almost no changes over >>>>>> the last 6 months, with only specification clarifications. So it
was time
to work on the non-ISA parts.
You mention two non-ISA parts that you have been working on.-a I
thought
I would ask you for your thoughts on another non-ISA part.-a Timers and >>>>> clocks.-a Doing a "clean slate" ISA frees you from being compatible >>>>> with
lots of old features that might have been the right thing to do back >>>>> then, but aren't now.
So, how many clocks/timers should a system have?
Lots. Every major resource should have its own clock as part of its
performance counter set.
Every interruptible resource should have its own timer which is
programmed
to throw interrupts to one thing or another.
So if I want to keep track of how much CPU time a task has accumulated,
someone has to save the value of the CPU clock when the task gets
interrupted or switched out.-a Is this done by the HW or by code in the
task switcher?
Task switcher.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Later, when the task gets control of the CPU again,
there needs to be a mechanism to resume adding time to its saved value.
How is this done?
One reads the old performance counters:
-a-a-a-a LDM-a-a Rd,Rd+7,[MMIO+R29]
and saves with Thread:
-a-a-a-a STM-a-a Rd,Rd+7,[Thread+128]
Side Note: LDM and STM are ATOMIC wrt the 8 doublewords loaded/stored.
To put them back:
-a-a-a-a LDM-a-a Rd,Rd+7,[Thread+128]
-a-a-a-a STM-a-a Rd,Rd+7,[MMIO+R29]
I think there is a problem with that.-a lets say there are two OSs
running under a hypervisor.-a And I want to collect CPU time for an application running under one of those OSs.-a Now consider a timer
interrupt to the hypervisor that causes it to switch out the OS that our program is running under and switch in the other OS.-a The mechanism you described takes care of getting the correct CPU time for the OS that is switched out, but I don't think it "switches out" the application
program, so the application's CPU time is too high (it includes the time spent in the other OS).
I don't know how other systems with hypervisors--
handle this situation.