Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 35 |
Nodes: | 6 (1 / 5) |
Uptime: | 18:48:35 |
Calls: | 321 |
Calls today: | 1 |
Files: | 957 |
Messages: | 82,382 |
According to MitchAlsup1 <mitchalsup@aol.com>:
Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.
Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.
Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.
John Levine <johnl@taugh.com> writes:
According to MitchAlsup1 <mitchalsup@aol.com>:
Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.
Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.
Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.
For some flavors of Algol _everything_ was on the stack.
(e.g. B5500 and successors).
Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.
Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.
Mitch's point AIUI was that Algol 60 had no heap allocation (and no
explicit pointer types), so indeed all data were either on the stack or >statically allocated.
On 2025-01-18 5:08, John Levine wrote:
According to MitchAlsup1 <mitchalsup@aol.com>:
Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.
Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.
Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.
Mitch's point AIUI was that Algol 60 had no heap allocation (and no
explicit pointer types), so indeed all data were either on the stack or statically allocated.
I'm not an English native speaker, but it seems to me that Mitch should
have written "Algol 60 had only stack allocation" instead of "Algol 60
only had stack allocation".
The most-used Ada compiler, GNAT, uses a "secondary stack" to reduce the
need for heap. Dynamically sized local data are placed on the secondary stack, and dynamically sized return values of functions are returned on
the secondary stack. So a function can return "by value" an array sized
1..N, with N a function parameter, without needing the heap.
Of course the programmer then has the problem of setting sufficient
sizes for /two/ stacks, the primary and the secondary. For
embedded-systems programs one usually avoids constructs that would need
a secondary stack.
A two-stack setup can be used in C too. (The C standards don't require
a stack at all.) On the AVR microcontroller, it is not uncommon for C implementations to work with a dual stack, since it does not have any
kind of "[SP + n]" or "[SP + r]" addressing modes, but it /does/ have an
"[Y + n]" addressing mode using an index register.
Two stacks are also pretty much required for FORTH.
The use of a dual stack could also significantly improve the security of systems by separating call/return addresses from data.
On 18/01/2025 09:59, Niklas Holsti wrote:
The most-used Ada compiler, GNAT, uses a "secondary stack" to reduce
the need for heap. Dynamically sized local data are placed on the
secondary stack, and dynamically sized return values of functions are
returned on the secondary stack. So a function can return "by value"
an array sized 1..N, with N a function parameter, without needing the
heap.
Of course the programmer then has the problem of setting sufficient
sizes for /two/ stacks, the primary and the secondary. For
embedded-systems programs one usually avoids constructs that would
need a secondary stack.
A two-stack setup can be used in C too. (The C standards don't require
a stack at all.) On the AVR microcontroller, it is not uncommon for C implementations to work with a dual stack, since it does not have any
kind of "[SP + n]" or "[SP + r]" addressing modes, but it /does/ have an
"[Y + n]" addressing mode using an index register.
On 2025-01-19 18:33, David Brown wrote:
On 18/01/2025 09:59, Niklas Holsti wrote:
[...]
The most-used Ada compiler, GNAT, uses a "secondary stack" to reduce
the need for heap. Dynamically sized local data are placed on the
secondary stack, and dynamically sized return values of functions are
returned on the secondary stack. So a function can return "by value"
an array sized 1..N, with N a function parameter, without needing the
heap.
Of course the programmer then has the problem of setting sufficient
sizes for /two/ stacks, the primary and the secondary. For
embedded-systems programs one usually avoids constructs that would
need a secondary stack.
A two-stack setup can be used in C too. (The C standards don't
require a stack at all.) On the AVR microcontroller, it is not
uncommon for C implementations to work with a dual stack, since it
does not have any kind of "[SP + n]" or "[SP + r]" addressing modes,
but it /does/ have an "[Y + n]" addressing mode using an index register.
Yes. Other C compilers use a single stack but use Y as a frame pointer
so they can use "[Y + n]" to access stack-frame locations.
The issue is more acute for 8051/MCS-51 systems where the call/return
stack is in the very small "internal" RAM, so C compilers often allocate
a larger "SW stack" for stack data in the larger "external" RAM. But
they do so only for potentially recursive or reentrant functions, and
instead use statically allocated space for the call-frames of other
functions (with smart whole-program analysis to share such space for functions that can never be active at the same time).
On Sun, 19 Jan 2025 18:28:40 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
In My 66000 the code cannot read/write that other stack with LD and ST
instructions. It can only be accessed by ENTER (stores) and EXIT
(LDs). The mapping PTE is marked RWE = 000.
So, while you can still overrun buffers, you cannot damage the call/
return stack or the preserved registers !!
Not that I am a specialist in GC, but according to my understanding
the most common and the best performing variants of GC can not work
without read access to preserved registers. Compacting collector seems
to need write access as well.
As to return addresses, I would think that read access to stack of
return addresses is necessary for exception handling.
In My 66000 the code cannot read/write that other stack with LD and ST instructions. It can only be accessed by ENTER (stores) and EXIT
(LDs). The mapping PTE is marked RWE = 000.
So, while you can still overrun buffers, you cannot damage the call/
return stack or the preserved registers !!
Michael S <already5chosen@yahoo.com> wrote:
On Sun, 19 Jan 2025 18:28:40 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
In My 66000 the code cannot read/write that other stack with LD and ST
instructions. It can only be accessed by ENTER (stores) and EXIT
(LDs). The mapping PTE is marked RWE = 000.
So, while you can still overrun buffers, you cannot damage the call/
return stack or the preserved registers !!
Not that I am a specialist in GC, but according to my understanding
the most common and the best performing variants of GC can not work
without read access to preserved registers. Compacting collector seems
to need write access as well.
As to return addresses, I would think that read access to stack of
return addresses is necessary for exception handling.
_Correctness_ of GC depends on ability to see preserved registers
and return address: return address may be the only live reference
to some function and similarly for preserved registers. On could
try to work around lack of access using separate software-managed
stack duplicating data from "hardware" stack, but that is ugly
and is likely to kill any performance advantage from hardware
features.
BTW, the same holds for debuggers and exception handling. Those
clearly need some way to go around hardware limitations.
On Mon, 20 Jan 2025 11:12:54 +0000, Waldek Hebisch wrote:
Michael S <already5chosen@yahoo.com> wrote:
On Sun, 19 Jan 2025 18:28:40 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
In My 66000 the code cannot read/write that other stack with LD
and ST instructions. It can only be accessed by ENTER (stores)
and EXIT (LDs). The mapping PTE is marked RWE = 000.
So, while you can still overrun buffers, you cannot damage the
call/ return stack or the preserved registers !!
Not that I am a specialist in GC, but according to my understanding
the most common and the best performing variants of GC can not work
without read access to preserved registers. Compacting collector
seems to need write access as well.
As to return addresses, I would think that read access to stack of
return addresses is necessary for exception handling.
_Correctness_ of GC depends on ability to see preserved registers
and return address: return address may be the only live reference
to some function and similarly for preserved registers. On could
try to work around lack of access using separate software-managed
stack duplicating data from "hardware" stack, but that is ugly
and is likely to kill any performance advantage from hardware
features.
BTW, the same holds for debuggers and exception handling. Those
clearly need some way to go around hardware limitations.
Yes, there is a way to do all those things, but I am not in a position
to discuss due to USPTO rules.
It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?
I/O works similarly--in that to the application a page may be marked
RWE=001 (execute only) but the swap disk is allowed to read or write
those pages.
On Mon, 20 Jan 2025 22:05:10 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
On Mon, 20 Jan 2025 11:12:54 +0000, Waldek Hebisch wrote:
Michael S <already5chosen@yahoo.com> wrote:
On Sun, 19 Jan 2025 18:28:40 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
In My 66000 the code cannot read/write that other stack with LD
and ST instructions. It can only be accessed by ENTER (stores)
and EXIT (LDs). The mapping PTE is marked RWE = 000.
So, while you can still overrun buffers, you cannot damage the
call/ return stack or the preserved registers !!
Not that I am a specialist in GC, but according to my understanding
the most common and the best performing variants of GC can not work
without read access to preserved registers. Compacting collector
seems to need write access as well.
As to return addresses, I would think that read access to stack of
return addresses is necessary for exception handling.
_Correctness_ of GC depends on ability to see preserved registers
and return address: return address may be the only live reference
to some function and similarly for preserved registers. On could
try to work around lack of access using separate software-managed
stack duplicating data from "hardware" stack, but that is ugly
and is likely to kill any performance advantage from hardware
features.
BTW, the same holds for debuggers and exception handling. Those
clearly need some way to go around hardware limitations.
Yes, there is a way to do all those things, but I am not in a position
to discuss due to USPTO rules.
It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?
Call it 'user'. Then rename the level that you now call 'application'
to 'sandbox'.
I/O works similarly--in that to the application a page may be marked
RWE=001 (execute only) but the swap disk is allowed to read or write
those pages.
It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?
It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?
On 20 Jan 2025, MitchAlsup1 wrote
(in article<43e21bd0bddea1733cd672c07a6319d4@www.novabbs.org>):
It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?
Entitled? 8-)
On Mon, 3 Feb 2025 22:47:24 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 3 Feb 2025 21:13:24 +0000, Scott Lurndal wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
It is like there is a privilege level between application and
GuestOS.
{{I spent all afternoon trying to think of a name for this privilege >>>>>> above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?
handyman?
Application -> Library -> OS -> Hypervisor -> Secure Monitor
{Sandbox -> user -> application -> Library} ->{sual}×{GuestOS, HV, SM}
??
You need to precisely define your terms. What are sandbox
and user in this context?
It is all about manipulating access rights without modifying
what is stored in the TLB (so you don't have to reload any
entries to change access rights.) It is sort of like what
the G-bit does (global) {except in my architecture globality
is controlled by ASID.}
Sandbox is a privilege level where one cannot be granted both
write and execute access at the same time. There may be other
restrictions, too; like access to control registers user may
be allowed to write.
Library would include all the trusted stuff, but also ld.so
and any JITs. JITs can only create code for sandboxes. So,
JIT can write to JITcache but sandbox cannot using the same
PTE entry. ld.so can write GOT while user and application
cannot write GOT (or execute GOT).
User is the privilege level where sandbox does not apply but
also there is no ability to over-access things protected by
PTE.RWE.
Application is a privilege level where PTE.RWE can sometimes
be usurped--such as DMA from a device needing to write into
a execute only page.
Where does memmove() come from is not the library ??
Libraries have a SW-kind of trust even if they are
devoid of HW kinds of trust (PTE.RWE overrides).
But these levels are just talking point at this point.
=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.
If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW, REW.
- no higher priv level can have less access than a lower priv level.
- we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.
we can get this all down to a 4-bit PTE field:
Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW
The core's (thread's) privilege mode would enable access to the pages.
The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different threads.
On Mon, 3 Feb 2025 22:47:24 +0000, Scott Lurndal wrote:
User is the privilege level where sandbox does not apply but
also there is no ability to over-access things protected by
PTE.RWE.
Application is a privilege level where PTE.RWE can sometimes
be usurped--such as DMA from a device needing to write into
a execute only page.
Where does memmove() come from is not the library ??
Libraries have a SW-kind of trust even if they are
devoid of HW kinds of trust (PTE.RWE overrides).
But these levels are just talking point at this point.
The hypervisor is optional, as would be a library.
It cannot be a library of process !!
It is not a library of GuestOS !
it is certainly not a library of Secure Monitor !!
The Burroughs Large systems and HP-3000 segmented libraries
were distinct entities with attributes.
And could change (update/upgrade) the library while the process
was running !!
Code in a library could be more privileged than the application
when acting on behalf of the application, for example; but the
application could not take advantage of the permissions assigned
to the library it was linked with without using interfaces
provided by the library.
No disagreement.
MitchAlsup1 wrote:
But these levels are just talking point at this point.
It sounds you want something like the VAX privilege/protection
mechanism.
It had 4 privilege levels: User, Supervisor, Executive, Kernel.
Each PTE grants R, RW or na (no-access) rights for each priv level.
(Read access implied Execute)
Naively that would take 4*2 = 8 bits in each 32-bit PTE.
However they reduce the combinations with a simple set of rules:
- if any priv level has read access then higher levels have read also.
- if any priv level has write access then higher levels have write also.
That brings the PTE access control field down to 4-bits for all
for priv levels.
For comparison, x64 PTE has 3 bits for 2 priv levels.
EricP wrote:
=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.
If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW, REW.
- no higher priv level can have less access than a lower priv level.
- we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.
we can get this all down to a 4-bit PTE field:
Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW
The core's (thread's) privilege mode would enable access to the pages.
The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different
threads.
Or if you want the flexibility to choose your own REW combinations,
the 4-bit PTE access control field is an index to a 16 entry array
of 12-bit values for the four privilege levels.
That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.
The next problem though might be finding 4 bits in the PTE.
On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:
EricP wrote:
=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.
If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW, REW.
- no higher priv level can have less access than a lower priv level.
- we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.
we can get this all down to a 4-bit PTE field:
Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW
The core's (thread's) privilege mode would enable access to the pages.
The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different
threads.
Or if you want the flexibility to choose your own REW combinations,
the 4-bit PTE access control field is an index to a 16 entry array
of 12-bit values for the four privilege levels.
That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.
The next problem though might be finding 4 bits in the PTE.
Another PTE bit I can find. Placing the 16×12 vector is more difficult,
even when I position it as 4 places of 16×3.
MitchAlsup1 wrote:
On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:
EricP wrote:
=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.
If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW, REW. >>>> - no higher priv level can have less access than a lower priv level.
- we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.
we can get this all down to a 4-bit PTE field:
Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW
The core's (thread's) privilege mode would enable access to the pages. >>>> The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different
threads.
Or if you want the flexibility to choose your own REW combinations,
the 4-bit PTE access control field is an index to a 16 entry array
of 12-bit values for the four privilege levels.
That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.
The next problem though might be finding 4 bits in the PTE.
Another PTE bit I can find. Placing the 16×12 vector is more difficult,
even when I position it as 4 places of 16×3.
I don't understand what you said.
The 4-bit Access Control (AC) field is in the PTE.
The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
is in the MMU.
The core's 2-bit mode selects-muxes one of the 3-bit allowed access
fields from the indexed 12-bits to extract the 3 R-E-W bits.
The 2-bit mode comes from the LD/ST uOp, which was set to the mode
active when the instruction was decoded (so it can pipeline mode
changes).
On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:
MitchAlsup1 wrote:
On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:
EricP wrote:
=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.
If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW,
REW.
- no higher priv level can have less access than a lower priv level. >>>>> - we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.
we can get this all down to a 4-bit PTE field:
Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW
The core's (thread's) privilege mode would enable access to the pages. >>>>> The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different
threads.
Or if you want the flexibility to choose your own REW combinations,
the 4-bit PTE access control field is an index to a 16 entry array
of 12-bit values for the four privilege levels.
That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.
The next problem though might be finding 4 bits in the PTE.
Another PTE bit I can find. Placing the 16×12 vector is more difficult, >>> even when I position it as 4 places of 16×3.
I don't understand what you said.
The 4-bit Access Control (AC) field is in the PTE.
Currently, PTE uses a 3-bit access control field, and PTE has
2-bits spare. So making access control larger is easy.
The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
is in the MMU.
How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super cannot see things Hyper can see and the same with secure. So, somewhere
in the various control blocks I need to find space without changing
the overall use pattern of the control blocks and tables. Which is
why I alluded to 4×16×3 each interpretation of the 4-bit access control
is stored it its own natural place. It also means each layer can apply
its own interpretation (mapping).
The core's 2-bit mode selects-muxes one of the 3-bit allowed access
fields from the indexed 12-bits to extract the 3 R-E-W bits.
That much is straightforward.
The 2-bit mode comes from the LD/ST uOp, which was set to the mode
active when the instruction was decoded (so it can pipeline mode
changes).
Yes, core-state index follows the memref down the pipe.
Core-state index is written into MMI/O/device control block for the
DMA portion of a command, other CD indexes are associated with I/O
page faults and device errors.
MitchAlsup1 wrote:
On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:
MitchAlsup1 wrote:
On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:
EricP wrote:
=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.
If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW, >>>>>> REW.
- no higher priv level can have less access than a lower priv level. >>>>>> - we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.
we can get this all down to a 4-bit PTE field:
Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW
The core's (thread's) privilege mode would enable access to the
pages.
The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different
threads.
Or if you want the flexibility to choose your own REW combinations,
the 4-bit PTE access control field is an index to a 16 entry array
of 12-bit values for the four privilege levels.
That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.
The next problem though might be finding 4 bits in the PTE.
Another PTE bit I can find. Placing the 16×12 vector is more difficult, >>>> even when I position it as 4 places of 16×3.
I don't understand what you said.
The 4-bit Access Control (AC) field is in the PTE.
Currently, PTE uses a 3-bit access control field, and PTE has
2-bits spare. So making access control larger is easy.
The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
is in the MMU.
How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super
cannot see things Hyper can see and the same with secure. So, somewhere
in the various control blocks I need to find space without changing
the overall use pattern of the control blocks and tables. Which is
why I alluded to 4×16×3 each interpretation of the 4-bit access control
is stored it its own natural place. It also means each layer can apply
its own interpretation (mapping).
Its just an SRAM loaded by the boot ROM before the Hypervisor boots.
The super-secure version of boot ROM loads a table with values
(Sandbox, User, Kernel, Hypervisor):
Snd Usr Krn Hyp
na na na R
na na na RE
na na na RW
na na na REW
na na R na
na na RE na
na na RW na
na na REW na
na R R na
na RE RE na
...
REW REW REW na
which grants mode 0 (Hyp) no direct RW access to any memory outside itself. Boot ROM sets an optional table lock so even hypervisor cannot later
grant itself access permission to less priv memory by changing the table.
The core's 2-bit mode selects-muxes one of the 3-bit allowed access
fields from the indexed 12-bits to extract the 3 R-E-W bits.
That much is straightforward.
The 2-bit mode comes from the LD/ST uOp, which was set to the mode
active when the instruction was decoded (so it can pipeline mode
changes).
Yes, core-state index follows the memref down the pipe.
Core-state index is written into MMI/O/device control block for the
DMA portion of a command, other CD indexes are associated with I/O
page faults and device errors.
Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV
is not authorized to see (so HV operators don't need Top Secret clearance). The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?
Hmmm... something like: once a secure HV passes a physical frame address
to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.
That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...
MitchAlsup1 wrote:
On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: -----------------------------------
Currently, PTE uses a 3-bit access control field, and PTE has
2-bits spare. So making access control larger is easy.
The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
is in the MMU.
How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super
cannot see things Hyper can see and the same with secure. So, somewhere
in the various control blocks I need to find space without changing
the overall use pattern of the control blocks and tables. Which is
why I alluded to 4×16×3 each interpretation of the 4-bit access control
is stored it its own natural place. It also means each layer can apply
its own interpretation (mapping).
Its just an SRAM loaded by the boot ROM before the Hypervisor boots.
The super-secure version of boot ROM loads a table with values
(Sandbox, User, Kernel, Hypervisor):
Snd Usr Krn Hyp
na na na R
na na na RE
na na na RW
na na na REW
na na R na
na na RE na
na na RW na
na na REW na
na R R na
na RE RE na
....
REW REW REW na
which grants mode 0 (Hyp) no direct RW access to any memory outside
itself.
Boot ROM sets an optional table lock so even hypervisor cannot later
grant itself access permission to less priv memory by changing the
table.
The core's 2-bit mode selects-muxes one of the 3-bit allowed access
fields from the indexed 12-bits to extract the 3 R-E-W bits.
That much is straightforward.
The 2-bit mode comes from the LD/ST uOp, which was set to the mode
active when the instruction was decoded (so it can pipeline mode
changes).
Yes, core-state index follows the memref down the pipe.
Core-state index is written into MMI/O/device control block for the
DMA portion of a command, other CD indexes are associated with I/O
page faults and device errors.
Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV
is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?
Hmmm... something like: once a secure HV passes a physical frame address
to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.
That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...
On 2/6/2025 10:51 AM, EricP wrote:
MitchAlsup1 wrote:
On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:
MitchAlsup1 wrote:
On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:
EricP wrote:
=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.
If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE,
RW, REW.
- no higher priv level can have less access than a lower priv level. >>>>>>> - we can save 1 combo because all 4 priv levels = na is redundant >>>>>>> with the PTE Present bit being clear.
we can get this all down to a 4-bit PTE field:
Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW
The core's (thread's) privilege mode would enable access to the
pages.
The PTE's access control field, which is derived from the kind of >>>>>>> mapped memory section, would not have to change between different >>>>>>> threads.
Or if you want the flexibility to choose your own REW combinations, >>>>>> the 4-bit PTE access control field is an index to a 16 entry array >>>>>> of 12-bit values for the four privilege levels.
That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.
The next problem though might be finding 4 bits in the PTE.
Another PTE bit I can find. Placing the 16×12 vector is more
difficult,
even when I position it as 4 places of 16×3.
I don't understand what you said.
The 4-bit Access Control (AC) field is in the PTE.
Currently, PTE uses a 3-bit access control field, and PTE has
2-bits spare. So making access control larger is easy.
The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
is in the MMU.
How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super
cannot see things Hyper can see and the same with secure. So, somewhere
in the various control blocks I need to find space without changing
the overall use pattern of the control blocks and tables. Which is
why I alluded to 4×16×3 each interpretation of the 4-bit access control >>> is stored it its own natural place. It also means each layer can apply
its own interpretation (mapping).
Its just an SRAM loaded by the boot ROM before the Hypervisor boots.
The super-secure version of boot ROM loads a table with values
(Sandbox, User, Kernel, Hypervisor):
Snd Usr Krn Hyp
na na na R
na na na RE
na na na RW
na na na REW
na na R na
na na RE na
na na RW na
na na REW na
na R R na
na RE RE na
...
REW REW REW na
which grants mode 0 (Hyp) no direct RW access to any memory outside
itself.
Boot ROM sets an optional table lock so even hypervisor cannot later
grant itself access permission to less priv memory by changing the table.
The core's 2-bit mode selects-muxes one of the 3-bit allowed access
fields from the indexed 12-bits to extract the 3 R-E-W bits.
That much is straightforward.
The 2-bit mode comes from the LD/ST uOp, which was set to the mode
active when the instruction was decoded (so it can pipeline mode
changes).
Yes, core-state index follows the memref down the pipe.
Core-state index is written into MMI/O/device control block for the
DMA portion of a command, other CD indexes are associated with I/O
page faults and device errors.
Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV
is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?
Hmmm... something like: once a secure HV passes a physical frame address
to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.
That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...
I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses? ISTM that
protecting memory of lower privileged programs is useless if a higher privileged program can force a page out to disk, then can read the data
from the disk drive itself. Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.
On 2/6/2025 10:51 AM, EricP wrote:
MitchAlsup1 wrote:
On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:Not sure how this would work with device IO and DMA.
-------------------
Say a secure kernel that owns a disk drive with secrets that even the HV
is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?
Hmmm... something like: once a secure HV passes a physical frame address
to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.
That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...
I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses?
ISTM that
protecting memory of lower privileged programs is useless if a higher privileged program can force a page out to disk, then can read the data
from the disk drive itself.
Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.
Stephen Fuld wrote:
-----------------
I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses? ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data
from the disk drive itself. Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.
I'm just kinda free associating what the consequences of restricting
the HV's access to lesser privilege levels might be.
As I understand it, the AMD secure HV approach is that memory owned by a guest kernel and its applications is encrypted and only the guest kernel
has the key. Memory content is only decrypted while inside the core.
As the key is only stored inside that guest kernel memory there is
no way for HV to get at it.
So it doesn't matter that the HV has access to guest memory because it
can only see encrypted memory values. Presumably such data is encrypted
on disk so intercepting a DMA gives you nothing.
But it doesn't look like blocking HV access to the guest kernel, user,
or sandbox memory accomplishes the same security because the HV can
always diddle its own page tables to grant itself access.
Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.
Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.
According to MitchAlsup1 <mitchalsup@aol.com>:
Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.
Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.
Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.
On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:
On 2/6/2025 10:51 AM, EricP wrote:
MitchAlsup1 wrote:
On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:Not sure how this would work with device IO and DMA.
-------------------
Say a secure kernel that owns a disk drive with secrets that even the HV >>> is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?
Hmmm... something like: once a secure HV passes a physical frame address >>> to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.
That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...
I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses?
I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.
In my block diagrams of HostBridge, I show I/O MMU only on the
receiving side of PCIe transport links. all the outbound traffic
has already been translated by a core MMMU, unless one allows
a device to send commands to another device.
I/O MMU sees the virtual address of where DMA is accessing, translating >accordingly.
I/O MMU sees the virtual address of MSI-X interrupts, page faults and
errors.
ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data
from the disk drive itself.
Protecting a process without privilege from a process WITH privilege
requires more than a little trust in the privileged process(s).
This is why there is a Secure Monitor over HyperVisor to take HV out
of the control loop for "secure stuff". By assuming the duties of
HV wrt accessing unprivileged memory of storage, SM minimizes the
footprint where trust is required.
Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:
On 2/6/2025 10:51 AM, EricP wrote:
MitchAlsup1 wrote:
On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:Not sure how this would work with device IO and DMA.
-------------------
Say a secure kernel that owns a disk drive with secrets that even the HV >>>> is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory >>>> frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?
Hmmm... something like: once a secure HV passes a physical frame address >>>> to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.
That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...
I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses?
I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.
That is indeed the case. The IOMMU is on the inbound path
from the PCIe controller to the internal bus/mesh structure.
Note that there is a translation on the outbound path from
the host address space to the PCIe memory space - this is
often 1:1, but need not be so. This translation happens
in the PCIe controller when creating the a TLP that contains
an address before sending the TLP to the endpoint. Take
an AHCI controller, for example, where the only device
BAR is 32-bits; if a host wants to map the AHCI controller
at a 64-bit address, the controller needs to map that 64-bit
address window into a 32-bit 3DW TLP to be sent to the endpoint
function.
The ARM SMMU is split into two - one that translates inbound
addresses that are not marked secure by the endpoint, and
one that translates addresses that are marked secure by the
endpoint (or by some host bridge between the endpoint and
the host internal bus structures which is configured by
the secure software). The secure side is managed by the
secure monitor; the non-secure side by the HV or bare-metal
OS.
In my block diagrams of HostBridge, I show I/O MMU only on the
receiving side of PCIe transport links. all the outbound traffic
has already been translated by a core MMMU, unless one allows
a device to send commands to another device.
I/O MMU sees the virtual address of where DMA is accessing, translating >>accordingly.
I/O MMU sees the virtual address of MSI-X interrupts, page faults and >>errors.
By page faults, I assume you're referring to the PCIe PRI (Page Request Interface) and ATS capabilities.
ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data
from the disk drive itself.
Protecting a process without privilege from a process WITH privilege >>requires more than a little trust in the privileged process(s).
This is why there is a Secure Monitor over HyperVisor to take HV out
of the control loop for "secure stuff". By assuming the duties of
HV wrt accessing unprivileged memory of storage, SM minimizes the
footprint where trust is required.
ARM has a "RM" (Realm Monitor) that sits between the HV and the SM
to manage memory visiblity and security.
https://developer.arm.com/documentation/den0127/0200/Software-components/Realm-Management-Monitor
Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.
Assuming the file is not secured via other means such as cryptography.
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Feb 2025 13:57:51 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:
On 2/6/2025 10:51 AM, EricP wrote:
MitchAlsup1 wrote:
On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: >>>>>>>-------------------Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV >>>>>> is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory >>>>>> frame that it has no access to itself. How does one block the HV from >>>>>> setting the IOMMU to DMA the device's secrets into its own memory? >>>>>>
Hmmm... something like: once a secure HV passes a physical frame address >>>>>> to a secure kernel then it cannot take it back, it can only ask that >>>>>> kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.
That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission. >>>>>> Hmmm...
I am a little confused here. When you talk about I0MMU addresses, are >>>>> you talking about memory addresses or disk addresses?
I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.
That is indeed the case. The IOMMU is on the inbound path
from the PCIe controller to the internal bus/mesh structure.
Note that there is a translation on the outbound path from
the host address space to the PCIe memory space - this is
often 1:1, but need not be so. This translation happens
in the PCIe controller when creating the a TLP that contains
an address before sending the TLP to the endpoint. Take
Is there any reason this cannot happen in the core MMU ??
How do you map the translation table to the device?
How do you map the translation table to the device?
Why
would you wish to have the CPU translating I/O virtual
addresses?
The IOMMU tables are per device, and they
can be configured to map the minimum amount of the address
space (even updated per-I/O if desired) required to support
the completion of an inbound DMA from the device.
Guest OS uses a virtual device address given to it from HV.
HV sets up the 2nd nesting of translation to translate this
to "what HostBridge needs" to route commands to device control
registers. The handoff can be done by spoofing config space
of having HV simply hand Guest OS a list of devices it can >>discover/configure/use.
The IOMMU only is involved in DMA transactions _initiated_ by
the device, not by the CPUs. They're two completely different
concepts.
an AHCI controller, for example, where the only device
BAR is 32-bits; if a host wants to map the AHCI controller
at a 64-bit address, the controller needs to map that 64-bit
address window into a 32-bit 3DW TLP to be sent to the endpoint
function.
This is one of the reasons My 66000 architecture has a unique
MMI/O address space--you can setup a 32-bit BAR to put a
page of control registers in 32-bit address space without
conflict. {{If I understand correctly}} Core MMU, then,
translates normal device virtual control register addresses
such that the request is routed to where the device is looking
{{which has 32 high order bits zero.}}
Most systems have DRAM located at physical address zero, and
a 4GB DRAM is pretty small these days.
So you either need
to make a hole in the DRAM or provide a mapping mechanism to
map a 64-bit address into a 32-bit bar when sending TLPs
to the AHCI controller.
Systems that aren't intel compatible will designate a range
of the 64-bit physical address space (near the top) and will
map regions in that range to the 32-bit bar via translation
registers in the PCIe controller.
On the other hand--it would take a very big system indeed to
overflow the 32-bit MMI/O space, although ECAM can access
42-bit device CR MMI/O space.
Leaving aside the small size of the legacy Intel I/O space
(16-bit addresses), history seems to have favored single
address space systems, so I suspect such a MMI/O space will
not be favored by many.
On Fri, 14 Feb 2025 19:51:44 +0000, Scott Lurndal wrote:
The traditional PCI bus supports a 5-bit device# and a 3-bit function #.
A PCIe bus supports either a 3-bit function number and the high-order
five bits MBZ (i.e. only device 0 is legal on PCIe) or, if the function
advertises
the Alternate Routing Identifier (ARI) capability, an 8-bit function
number
(SRIOV leverages ARI to support dense routing IDs, but any bus that
supports
ARI can handle 256 physical functions.
PCI supports two forms of configuration space addresses in TLPs:
Type 0 contains only a function number and the register address.
Type 1 contains the bus number, function number and register address.
PCIe segments go where ? Are they "picked off" prior to being routed
down the tree ??
mitchalsup@aol.com (MitchAlsup1) writes:-----------
PCI supports two forms of configuration space addresses in TLPs:
Type 0 contains only a function number and the register address.
Type 1 contains the bus number, function number and register address.
PCIe segments go where ? Are they "picked off" prior to being routed
down the tree ??
Logically, they can be considered a prefix to the RID for routing
purposes (inbound to the IOMMU, for example, the PCIE controller
will prepend its segment number to the RID and use that as an
ID to the IOMMU). ARM calls it a streamid.
For PCI configuration transactions initiated by the CPU,
"PCIe segments go where?" is an interesting question.
PCI Express has been designed as a point-to-point protocol
using serial connections rather than the wide PCI local
bus, which changes the topology of the system. With PCIe
the device number in the RID -must be zero- (unless the
BUS is an ARI bus,
in which case bits <7:0> of the RID
are a function number provided by the PCIe device (up
to 256 functions per each - more with SRIOV as it can
consume additional space in the bus <15:8> field of the
RID to support up to 65535 virtual functions on a single
device). A non-SRIOV and non-ARI device can only provide
from one to eight functions.
The specification allows the implementation to provide a single ECAM
segment per PCIe controller, or and implementation may provide a single
ECAM region and used bits <63:28> as a segment number.
This is how
most non-intel systems handle this today; an processors that supports
six PCIe controllers would have perhaps 7 segments (one or more for the
root bus 0 for the on-chip devices such as memory controllers, and
one for each PCIe controller.
Software simply constructs the target ECAM address and issues normal
loads and stores to access it - no need to use the clumsy, slow and non-standard PCI peek-and-poke configuration space accesses.
For outbound memory space (non-config space) transactions from the CPU
to
the device, the RID is not used - once the CPU I/O fabric has routed
the request by PA to the proper PCIe controller (address based routing through a heirarchy of host bridges), it is simply sent to the device
which CAMs the address against the programmed bars and reacts
appropriately
(e.g. by responding with a UR (Unsupported Request) on a non-posted
request or dropping a posted request if it doesn't match a BAR).
On Thu, 6 Feb 2025 21:53:27 +0000, EricP wrote:
Stephen Fuld wrote:
-----------------
I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses? ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data
from the disk drive itself. Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.
I'm just kinda free associating what the consequences of restricting
the HV's access to lesser privilege levels might be.
As I understand it, the AMD secure HV approach is that memory owned by a
guest kernel and its applications is encrypted and only the guest kernel
has the key. Memory content is only decrypted while inside the core.
As the key is only stored inside that guest kernel memory there is
no way for HV to get at it.
Interesting, so you can see the data you just can't interpret the bit-patterns. This still leaves the door open for malicious action.
So it doesn't matter that the HV has access to guest memory because it
can only see encrypted memory values. Presumably such data is encrypted
on disk so intercepting a DMA gives you nothing.
It maters if HV can access (especially modify) what is in storage
(not memory), making it impossible for secure process from using
his own data !
But it doesn't look like blocking HV access to the guest kernel, user,
or sandbox memory accomplishes the same security because the HV can
always diddle its own page tables to grant itself access.
My 66000 does it differently. HV can create a PTE that translates
"anywhere", but cannot use the VA of Guest OS or SM in order to
use that PTE. There are 4 VAS privilege levels::
HoB = 0 nest HoB = 1 nest
application: application VAS yes no access X
Guest OS application VAS yes Guest OS VAS yes
HyperVisor HyperVisor VAS no Guest OS VAS yes
Secure HyperVisor VAS no SM VAS no
So, HV has no 'direct' path to Application VAS using standard
memory access protocols.
MitchAlsup1 wrote:
On Thu, 6 Feb 2025 21:53:27 +0000, EricP wrote:
Stephen Fuld wrote:
-----------------
I am a little confused here. When you talk about I0MMU addresses, are >>>> you talking about memory addresses or disk addresses? ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data >>>> from the disk drive itself. Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.
I'm just kinda free associating what the consequences of restricting
the HV's access to lesser privilege levels might be.
As I understand it, the AMD secure HV approach is that memory owned by a >>> guest kernel and its applications is encrypted and only the guest kernel >>> has the key. Memory content is only decrypted while inside the core.
As the key is only stored inside that guest kernel memory there is
no way for HV to get at it.
Interesting, so you can see the data you just can't interpret the
bit-patterns. This still leaves the door open for malicious action.
Yes. Their Secure VM sounds exactly like what a timeshare vendor might
want
if they desire to sell services to governments and businesses that have secrets that rogue operators must provably not access.
Also blocks accidental leaks between guests so you could have multiple
such secure guest OS on the same host.
So it doesn't matter that the HV has access to guest memory because it
can only see encrypted memory values. Presumably such data is encrypted
on disk so intercepting a DMA gives you nothing.
It maters if HV can access (especially modify) what is in storage
(not memory), making it impossible for secure process from using
his own data !
Yes a rogue or buggy HV could DoS a guest OS by scrambling its data.
Or an ECC memory error on critical memory location, like the key.
But it doesn't look like blocking HV access to the guest kernel, user,
or sandbox memory accomplishes the same security because the HV can
always diddle its own page tables to grant itself access.
My 66000 does it differently. HV can create a PTE that translates
"anywhere", but cannot use the VA of Guest OS or SM in order to
use that PTE. There are 4 VAS privilege levels::
HoB = 0 nest HoB = 1 nest
application: application VAS yes no access X
Guest OS application VAS yes Guest OS VAS yes
HyperVisor HyperVisor VAS no Guest OS VAS yes
Secure HyperVisor VAS no SM VAS no
So, HV has no 'direct' path to Application VAS using standard
memory access protocols.
But the physical memory in use by that guest can be remapped
by rogue HV to be in its own virtual space and then accessed.
On Sat, 15 Feb 2025 15:31:44 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:-----------
PCI supports two forms of configuration space addresses in TLPs:PCIe segments go where ? Are they "picked off" prior to being routed
Type 0 contains only a function number and the register address.
Type 1 contains the bus number, function number and register address. >>>
down the tree ??
Logically, they can be considered a prefix to the RID for routing
purposes (inbound to the IOMMU, for example, the PCIE controller
will prepend its segment number to the RID and use that as an
ID to the IOMMU). ARM calls it a streamid.
For PCI configuration transactions initiated by the CPU,
"PCIe segments go where?" is an interesting question.
Indeed. Snipping Intel brain damage
-----------------------
PCI Express has been designed as a point-to-point protocol
using serial connections rather than the wide PCI local
bus, which changes the topology of the system. With PCIe
the device number in the RID -must be zero- (unless the
BUS is an ARI bus,
What is an ARI bus, and do non x86 systems have them ??
in which case bits <7:0> of the RID
are a function number provided by the PCIe device (up
to 256 functions per each - more with SRIOV as it can
consume additional space in the bus <15:8> field of the
RID to support up to 65535 virtual functions on a single
device). A non-SRIOV and non-ARI device can only provide
from one to eight functions.
snipping
The specification allows the implementation to provide a single ECAM
segment per PCIe controller, or and implementation may provide a single
ECAM region and used bits <63:28> as a segment number.
Wikipedia states ECAM contains 42 bits::
16-bit segment, 8-bit bus, 5-bit device, 3-bit function, 4-bit
xReg, and 6-bit reg. Is this wrong, misleading, of out of date ?
This is how
most non-intel systems handle this today; an processors that supports
six PCIe controllers would have perhaps 7 segments (one or more for the
root bus 0 for the on-chip devices such as memory controllers, and
one for each PCIe controller.
Software simply constructs the target ECAM address and issues normal
loads and stores to access it - no need to use the clumsy, slow and
non-standard PCI peek-and-poke configuration space accesses.
We still have to translate 'target ECAM' into a bit pattern matching
that device's BAR. And thinks to your help I have a means to do so
that has overhead only when Booting a new Guest OS.
On 2025-01-18 5:08, John Levine wrote:
According to MitchAlsup1 <mitchalsup@aol.com>:
Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.
Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.
Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.
Mitch's point AIUI was that Algol 60 had no heap allocation (and no
explicit pointer types), so indeed all data were either on the stack
or statically allocated.
I'm not an English native speaker, but it seems to me that Mitch
should have written "Algol 60 had only stack allocation" instead of
"Algol 60 only had stack allocation".
It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?
It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?
handyman?
Stefan Monnier <monnier@iro.umontreal.ca> writes:
It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?
handyman?
Application -> Library -> OS -> Hypervisor -> Secure Monitor
On Mon, 3 Feb 2025 21:13:24 +0000, Scott Lurndal wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
It is like there is a privilege level between application and GuestOS. >>>> {{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?
handyman?
Application -> Library -> OS -> Hypervisor -> Secure Monitor
{Sandbox -> user -> application -> Library} ->{sual}×{GuestOS, HV, SM}
??
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 3 Feb 2025 21:13:24 +0000, Scott Lurndal wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
It is like there is a privilege level between application and GuestOS. >>>>> {{I spent all afternoon trying to think of a name for this privilege >>>>> above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?
handyman?
Application -> Library -> OS -> Hypervisor -> Secure Monitor
{Sandbox -> user -> application -> Library} ->{sual}×{GuestOS, HV, SM}
??
You need to precisely define your terms. What are sandbox
and user in this context?
The hypervisor is optional, as would be a library.
The Burroughs Large systems and HP-3000 segmented libraries
were distinct entities with attributes.
Code in a library could be more privileged than the application
when acting on behalf of the application, for example; but the
application could not take advantage of the permissions assigned
to the library it was linked with without using interfaces
provided by the library.
On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Feb 2025 13:57:51 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:
On 2/6/2025 10:51 AM, EricP wrote:
MitchAlsup1 wrote:
On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: >>>>>>>>-------------------Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV
is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory >>>>>>> frame that it has no access to itself. How does one block the HV from >>>>>>> setting the IOMMU to DMA the device's secrets into its own memory? >>>>>>>
Hmmm... something like: once a secure HV passes a physical frame address
to a secure kernel then it cannot take it back, it can only ask that >>>>>>> kernel for it back. Which means that the HV looses control of any >>>>>>> core or IOMMU PTE's that map that frame until it is handed back. >>>>>>>
That would seem to imply that once an HV gives memory to a secure >>>>>>> guest kernel that it can only page that guest with its permission. >>>>>>> Hmmm...
I am a little confused here. When you talk about I0MMU addresses, are >>>>>> you talking about memory addresses or disk addresses?
I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.
That is indeed the case. The IOMMU is on the inbound path
from the PCIe controller to the internal bus/mesh structure.
Note that there is a translation on the outbound path from
the host address space to the PCIe memory space - this is
often 1:1, but need not be so. This translation happens
in the PCIe controller when creating the a TLP that contains
an address before sending the TLP to the endpoint. Take
Is there any reason this cannot happen in the core MMU ??
How do you map the translation table to the device?
device is configured by setting BAR[s] to an addressable
page. Accesses to this page are performed by the device
consisting of Rd and Wt to control registers. Physical
addresses matching BAR aperture are routed to device.
HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.
Thus, HV MMU maps guest OS physical address into universal
MMI/O address.
A long time before accessing the device, HyperVisor sets up
a device control block and places it in a table indexed
by segment:bus;device and stores table address in a control
register of the I/O MMU {HostBridge}. This control block
contains several context pointers an interrupt table
pointer and four event coordinators--one for DMA, page
faults, errors, and interrupts. The EC provides an index
into the root pointers.
Guest OS uses the virtual device address in code, Guest OS
MMU maps it to the aperture maintained by HyperVisor. HV
then maps GPA to MMI/O:device_address. Using said trans-
lations, Guest OS writes commands to the function:register
of the addressed device.
The path from core virtual address to device control register
address does not pass through the I/O MMU.
When device responds with DMA request it uses a device virtual
address (not a virtual device address),
said request is routed
to the top of PCIe tree, where I/O MMU uses ECAM to identify
the MMU tables for this device, once identified, translates*
the device virtual address into a universal address (almost
invariably targeting DRAM) Once translated and checked, the
command is allowed to proceed. (*) assuming ATS was not used.
When device responds with Interrupt request, I/O MMU uses
ECAM (again) to find the associated interrupt table,
and then translates the device interrupt address in to a
universal MMI/O write to the attached interrupt table.
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: -----------------isolating---------------------------------
HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.
Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.
On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >-----------------isolating---------------------------------
HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.
Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.
I need a "WHY" on this sentence before responding to the rest.
On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:
On 2/6/2025 10:51 AM, EricP wrote:
MitchAlsup1 wrote:
On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:Not sure how this would work with device IO and DMA.
-------------------
Say a secure kernel that owns a disk drive with secrets that even the HV >>> is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?
Hmmm... something like: once a secure HV passes a physical frame address >>> to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.
That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...
I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses?
I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.
In my block diagrams of HostBridge, I show I/O MMU only on the
receiving side of PCIe transport links. all the outbound traffic
has already been translated by a core MMMU, unless one allows
a device to send commands to another device.
I/O MMU sees the virtual address of where DMA is accessing, translating accordingly.
I/O MMU sees the virtual address of MSI-X interrupts, page faults and
errors.
ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data
from the disk drive itself.
Protecting a process without privilege from a process WITH privilege
requires more than a little trust in the privileged process(s).
This is why there is a Secure Monitor over HyperVisor to take HV out
of the control loop for "secure stuff".
By assuming the duties of
HV wrt accessing unprivileged memory of storage, SM minimizes the
footprint where trust is required.
Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.
On 2/6/2025 6:39 PM, MitchAlsup1 wrote:
On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:
I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.
I am terribly out of date with all of this, but what if the device is a
SATA disk? It at least used to be that you sent a command packet to the
disk and said packet contained the disk relative block number. I know
of no way to initiate an I/O by writing to the disk's "control registers".
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>-----------------isolating---------------------------------
HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.
Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.
I need a "WHY" on this sentence before responding to the rest.
The guest OS (or user application) is directly programming the addresses
in the
PCIe function DMA engine[*]. The point is to completely
avoid the hypervisor from being involved in the I/O,
so the guest programs the function DMA engine using guest PA and the
IOMMU translates the guest PA into machine
addresses for all transactions initiated by the
function.
(Each PCIe function is treated as an individual device
by the OS/HV. The entire purpose of SR-IOV is to
present the device directly to the guest to avoid any
hypervisor involvment in the I/O path).
On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>>-----------------isolating---------------------------------
HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.
Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.
I need a "WHY" on this sentence before responding to the rest.
The guest OS (or user application) is directly programming the addresses
in the
PCIe function DMA engine[*]. The point is to completely
avoid the hypervisor from being involved in the I/O,
so the guest programs the function DMA engine using guest PA and the
IOMMU translates the guest PA into machine
addresses for all transactions initiated by the
function.
(Each PCIe function is treated as an individual device
by the OS/HV. The entire purpose of SR-IOV is to
present the device directly to the guest to avoid any
hypervisor involvment in the I/O path).
This is the missing part:: When a user performs a LD or ST
the guest Virtual address is translated to Guest Physical
by Guest OS translation tables. Guest Physical is then
interpreted as Host virtual and translated a second time
by {SM,HV} mapping tables. This nested MMU does both
translations in a single access; so core TLB is organized
to associate guest virtual directly with machine physical !
as if there were a level crossing PTE providing all the
"right bits".
Now that the device has been configured, Guest OS decides
to write some control registers of the device. Guest OS
has its own translation tables for Guest Virtual to Guest
Physical--but the core MMU then translates guest Physical
to Machine Physical before it gets transported over the
interconnect. So, by the time said address gets to Host-
Bridge it is already in Machine Physical, not the Guest
physical you mention.*
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>>>-----------------isolating---------------------------------
HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.
Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.
I need a "WHY" on this sentence before responding to the rest.
The guest OS (or user application) is directly programming the addresses >>> in the
PCIe function DMA engine[*]. The point is to completely
avoid the hypervisor from being involved in the I/O,
so the guest programs the function DMA engine using guest PA and the
IOMMU translates the guest PA into machine
addresses for all transactions initiated by the
function.
(Each PCIe function is treated as an individual device
by the OS/HV. The entire purpose of SR-IOV is to
present the device directly to the guest to avoid any
hypervisor involvment in the I/O path).
This is the missing part:: When a user performs a LD or ST
You're missing the point completely. There no user
performing a LD or ST. The DMA controller on the device
is initiating the transaction, not the host CPU.
the guest Virtual address is translated to Guest Physical
by Guest OS translation tables. Guest Physical is then
interpreted as Host virtual and translated a second time
by {SM,HV} mapping tables. This nested MMU does both
translations in a single access; so core TLB is organized
to associate guest virtual directly with machine physical !
as if there were a level crossing PTE providing all the
"right bits".
This is basically how all modern CPUs handle it, yes.
But it is not relevent to the inbound traffic initiated
-by the device- which can't be translated by the CPU,
rather must be translated at some point between the
PCIe controller and the internal processor interconnect
(e.g. mesh).
Now that the device has been configured, Guest OS decides
to write some control registers of the device. Guest OS
has its own translation tables for Guest Virtual to Guest
Physical--but the core MMU then translates guest Physical
to Machine Physical before it gets transported over the
interconnect. So, by the time said address gets to Host-
Bridge it is already in Machine Physical, not the Guest
physical you mention.*
my66000 as no insight into the device, you can't know
a priori which 64-bit write to the device contains
a physical address. Particularly in all modern
devices where there may be only one "control register"[*]
the guest driver writes commands and s/g lists to one of
several hundred queues in local dram, then signals
the device to initiate a DMA operation to read the
entry from the queue in DRAM (using a guest physical
address). The CPU never sees that read,
nor can you know a priori that a particular write
from a device driver is an guest address that might need
to be translated - the driver is writing the command
to main memory and just poking the device to read the
command directly there is no way to associate that
write with any particular device in the CPU.
[*] The doorbell. There will generally be a few more
to set global characteristics, configurations, etc,
but they'll only be actively used during driver initialization
and will likely not contain addresses of any form.
On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>>>>-----------------isolating---------------------------------
HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.
Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.
I need a "WHY" on this sentence before responding to the rest.
The guest OS (or user application) is directly programming the addresses >>>> in the
PCIe function DMA engine[*]. The point is to completely
avoid the hypervisor from being involved in the I/O,
so the guest programs the function DMA engine using guest PA and the
IOMMU translates the guest PA into machine
addresses for all transactions initiated by the
function.
(Each PCIe function is treated as an individual device
by the OS/HV. The entire purpose of SR-IOV is to
present the device directly to the guest to avoid any
hypervisor involvment in the I/O path).
This is the missing part:: When a user performs a LD or ST
You're missing the point completely. There no user
performing a LD or ST. The DMA controller on the device
is initiating the transaction, not the host CPU.
the guest Virtual address is translated to Guest Physical
by Guest OS translation tables. Guest Physical is then
interpreted as Host virtual and translated a second time
by {SM,HV} mapping tables. This nested MMU does both
translations in a single access; so core TLB is organized
to associate guest virtual directly with machine physical !
as if there were a level crossing PTE providing all the
"right bits".
This is basically how all modern CPUs handle it, yes.
But it is not relevent to the inbound traffic initiated
-by the device- which can't be translated by the CPU,
rather must be translated at some point between the
PCIe controller and the internal processor interconnect
(e.g. mesh).
I am tracing the path from user of device (core Guest OS).
If I don't understand this "should be simple" path, I am
too lost to continue from the device side looking towards
memory.
I can see how core can write Guest Physical Address into
device.BAR using config space access (with appropriate
MMU permissions).
But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote: >>>>---------------------
This is the missing part:: When a user performs a LD or ST
You're missing the point completely. There no user
performing a LD or ST. The DMA controller on the device
is initiating the transaction, not the host CPU.
the guest Virtual address is translated to Guest Physical
by Guest OS translation tables. Guest Physical is then
interpreted as Host virtual and translated a second time
by {SM,HV} mapping tables. This nested MMU does both
translations in a single access; so core TLB is organized
to associate guest virtual directly with machine physical !
as if there were a level crossing PTE providing all the
"right bits".
This is basically how all modern CPUs handle it, yes.
But it is not relevent to the inbound traffic initiated
-by the device- which can't be translated by the CPU,
rather must be translated at some point between the
PCIe controller and the internal processor interconnect
(e.g. mesh).
I am tracing the path from user of device (core Guest OS).
If I don't understand this "should be simple" path, I am
too lost to continue from the device side looking towards
memory.
I can see how core can write Guest Physical Address into
device.BAR using config space access (with appropriate
MMU permissions).
But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.
The hardware is generally described to the operating system
through tables provided by the firmware (supplied by the system
builder). There are two different mechanisms in widespread use;
ACPI and DeviceTree. The former, sponsered primarily by Microsoft
and Intel, provides a standard mechanism for the hardware to
describe itself to the operating system/hypervisor.
An entry in those tables for a particular device will describe
in standard terms the path from the memory/processor complex
to the device. Say, for example, your CPU uses a high-speed
mesh interface between cores. The outer ring on the mesh
will connect to memory controllers and I/O bridges.
Each point in the path is described by a table entry
in the ACPI tables.
Assume a simple config
+=========+ +========+ +=======+
| CPU | | HOST | | PCIe |
| Memory |-----| Bridge |--| Ctlr
|------+---------------+-----------
| complex | | | | | | |
+=========+ +========+ +=======+ Func 0 Func 1
BAR[0]=X BAR[2]=Y
On Wed, 12 Feb 2025 0:34:18 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:
But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.
We will ignore this for a while.
The hardware is generally described to the operating system
through tables provided by the firmware (supplied by the system
builder). There are two different mechanisms in widespread use;
ACPI and DeviceTree. The former, sponsered primarily by Microsoft
and Intel, provides a standard mechanism for the hardware to
describe itself to the operating system/hypervisor.
Side topic::
Why not just spoof PCIe config space, and setup config space
headers that cause Guest OS to load the paravirtualized driver ??
instead of the direct device driver ?!?
{{It has to be a speed/latency issue}}
An entry in those tables for a particular device will describe
in standard terms the path from the memory/processor complex
to the device. Say, for example, your CPU uses a high-speed
mesh interface between cores. The outer ring on the mesh
will connect to memory controllers and I/O bridges.
Each point in the path is described by a table entry
in the ACPI tables.
Assume a simple config
+=========+ +========+ +=======+
| CPU | | HOST | | PCIe |
| Memory |-----| Bridge |--| Ctlr
|------+---------------+-----------
| complex | | | | | | |
+=========+ +========+ +=======+ Func 0 Func 1
BAR[0]=X BAR[2]=Y
Now, consider a device with 16 virtual functions, and 16 Guest
OSs. All 16 Guest OSs use the same Guest Physical Address in
their virtual function BARs.
How does the PCIe controller figure out that an MMI/O space
sized write is to GuestOS[7] or GuestOS[14] ?? {{Obviously,
once the command is place in the routing tree, it will be
matched by all the GuestOS BARs.}}
Thus, there is still information missing for my understanding.
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 12 Feb 2025 0:34:18 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:
But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.
We will ignore this for a while.
The hardware is generally described to the operating system
through tables provided by the firmware (supplied by the system
builder). There are two different mechanisms in widespread use;
ACPI and DeviceTree. The former, sponsered primarily by Microsoft
and Intel, provides a standard mechanism for the hardware to
describe itself to the operating system/hypervisor.
Side topic::
Why not just spoof PCIe config space, and setup config space
headers that cause Guest OS to load the paravirtualized driver ??
instead of the direct device driver ?!?
That was the approach prior to the PCI-SIG introducing
SRIOV in the mid 2000s. Performance sucked.
{{It has to be a speed/latency issue}}
Throughput reduced and extra HV overhead.
An entry in those tables for a particular device will describe
in standard terms the path from the memory/processor complex
to the device. Say, for example, your CPU uses a high-speed
mesh interface between cores. The outer ring on the mesh
will connect to memory controllers and I/O bridges.
Each point in the path is described by a table entry
in the ACPI tables.
Assume a simple config
+=========+ +========+ +=======+
| CPU | | HOST | | PCIe |
| Memory |-----| Bridge |--| Ctlr
|------+---------------+-----------
| complex | | | | | | |
+=========+ +========+ +=======+ Func 0 Func 1
BAR[0]=X BAR[2]=Y
Now, consider a device with 16 virtual functions, and 16 Guest
OSs. All 16 Guest OSs use the same Guest Physical Address in
their virtual function BARs.
Technically, there would be one physical function (owned
by the hypervisor - with control CSRs apportioning resources
between virtual functions) and up to 65535 virtual functions.
The 'bus' and 'function' together make up a 16-bit routing
id (RID) which is the target id field in the Config Read/Write TLP. The
PF will generally have a function number between 0 and 7,
(although with ARI, it can be any function number) and the
VF routing IDs will start at some offset from the PF with
a programmable stride (e.g. for a PF with 3 VFs):
PF0 RID = 0
VF1 RID = 8
VF2 RID = 16
VF3 RID = 24
When the VF number exceeds 255, it's RID will be function 0
on the next higher bus number.
As an endpoint function, the bus field in the RID will be the secondary
bus of the Root Complex bridge (generally 1).
So the PCI config space RID (BDF) for VF2 would be 0x0110.
This RID (in combination with a PCIe controller index (segment
in Intel terminology)) is used by the IOMMU to select the
translation table to use for inbound DMA from this function.
How does the PCIe controller figure out that an MMI/O space
sized write is to GuestOS[7] or GuestOS[14] ?? {{Obviously,
once the command is place in the routing tree, it will be
matched by all the GuestOS BARs.}}
The PCIe controller sends all MRW/MRD TLPs to the endpoint
device, which matches them against ALL BAR registers (all PFs
and all VFs) on the device to determine which function
the memory read or memory write TLP is targeting.
(Note that the VF BARs are actually fixed, the VF
memory spaces are equal sized and contiguous, so
the endpoint only needs to CAM on six BARs at most)
It gets a bit more complicated when there is a PCIe
switch on the device, likewise if there is a PCIe to PCI
bridge on the endpoint (very unlikely nowadays).
If the TLP address doesn't match any _enabled_ function
BAR, a memory write (posted) will be dropped[*] and a memory read
will return a UR (Unsupported Request) completion TLP.
The real question is how does the CPU route the
load or store request to the proper PCIe controller
port, and that's the hard part, particularly to
follow the proper PCIe transaction ordering rules.
Thus, there is still information missing for my understanding.
[*] it may post a RAS error in some implementation defined manner
or via the Advanced Error Reporting (AER) PCI Express capability.
On Thu, 13 Feb 2025 18:12:52 +0000, Scott Lurndal wrote:
How does the PCIe controller figure out that an MMI/O space
sized write is to GuestOS[7] or GuestOS[14] ?? {{Obviously,
once the command is place in the routing tree, it will be
matched by all the GuestOS BARs.}}
The PCIe controller sends all MRW/MRD TLPs to the endpoint
device, which matches them against ALL BAR registers (all PFs
and all VFs) on the device to determine which function
the memory read or memory write TLP is targeting.
What is supposed to happen when more than 1 BAR matches
the TLP address on any given bus??
(Note that the VF BARs are actually fixed, the VF
memory spaces are equal sized and contiguous, so
the endpoint only needs to CAM on six BARs at most)
It gets a bit more complicated when there is a PCIe
switch on the device, likewise if there is a PCIe to PCI
bridge on the endpoint (very unlikely nowadays).
If the TLP address doesn't match any _enabled_ function
BAR, a memory write (posted) will be dropped[*] and a memory read
will return a UR (Unsupported Request) completion TLP.
But what happens when more than 1 BAR matches the supplied address ??
HW would typically have each matching BAR capture the data
being written, or upon a read, read all the control registers
and either AND them or OR them together (wired OR read-out
bus). Neither of which is what SW will be expecting.
On Fri, 7 Feb 2025 13:57:51 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:
On 2/6/2025 10:51 AM, EricP wrote:
MitchAlsup1 wrote:
On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: >>>>>>-------------------Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV >>>>> is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory >>>>> frame that it has no access to itself. How does one block the HV from >>>>> setting the IOMMU to DMA the device's secrets into its own memory?
Hmmm... something like: once a secure HV passes a physical frame address >>>>> to a secure kernel then it cannot take it back, it can only ask that >>>>> kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.
That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...
I am a little confused here. When you talk about I0MMU addresses, are >>>> you talking about memory addresses or disk addresses?
I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.
That is indeed the case. The IOMMU is on the inbound path
from the PCIe controller to the internal bus/mesh structure.
Note that there is a translation on the outbound path from
the host address space to the PCIe memory space - this is
often 1:1, but need not be so. This translation happens
in the PCIe controller when creating the a TLP that contains
an address before sending the TLP to the endpoint. Take
Is there any reason this cannot happen in the core MMU ??
Guest OS uses a virtual device address given to it from HV.
HV sets up the 2nd nesting of translation to translate this
to "what HostBridge needs" to route commands to device control
registers. The handoff can be done by spoofing config space
of having HV simply hand Guest OS a list of devices it can >discover/configure/use.
an AHCI controller, for example, where the only device
BAR is 32-bits; if a host wants to map the AHCI controller
at a 64-bit address, the controller needs to map that 64-bit
address window into a 32-bit 3DW TLP to be sent to the endpoint
function.
This is one of the reasons My 66000 architecture has a unique
MMI/O address space--you can setup a 32-bit BAR to put a
page of control registers in 32-bit address space without
conflict. {{If I understand correctly}} Core MMU, then,
translates normal device virtual control register addresses
such that the request is routed to where the device is looking
{{which has 32 high order bits zero.}}
On the other hand--it would take a very big system indeed to
overflow the 32-bit MMI/O space, although ECAM can access
42-bit device CR MMI/O space.
On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
This is basically how all modern CPUs handle it, yes.
But it is not relevent to the inbound traffic initiated
-by the device- which can't be translated by the CPU,
rather must be translated at some point between the
PCIe controller and the internal processor interconnect
(e.g. mesh).
I am tracing the path from user of device (core Guest OS).
If I don't understand this "should be simple" path, I am
too lost to continue from the device side looking towards
memory.
I can see how core can write Guest Physical Address into
device.BAR using config space access (with appropriate
MMU permissions).
But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.
core executes the following instruction::
STH R7,[Rdevice,#controlreg]
Rdevice has a Virtual Address bit pattern which after 1 level of
translation matches the bit pattern put into device.BAR at config.
#controlreg is the offset to the control reg.
On Tue, 11 Feb 2025 23:29:04 +0000, MitchAlsup1 wrote:
On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
This is basically how all modern CPUs handle it, yes.
But it is not relevent to the inbound traffic initiated
-by the device- which can't be translated by the CPU,
rather must be translated at some point between the
PCIe controller and the internal processor interconnect
(e.g. mesh).
I am tracing the path from user of device (core Guest OS).
If I don't understand this "should be simple" path, I am
too lost to continue from the device side looking towards
memory.
I can see how core can write Guest Physical Address into
device.BAR using config space access (with appropriate
MMU permissions).
But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.
core executes the following instruction::
STH R7,[Rdevice,#controlreg]
Rdevice has a Virtual Address bit pattern which after 1 level of
translation matches the bit pattern put into device.BAR at config.
#controlreg is the offset to the control reg.
Update:
I have figured out how to re-attach Guest Physical BAR back
as MMI/O commands enter the top of a PCIe tree.
Thanks to Scott Lurndal for being gentle with me.
mitchalsup@aol.com (MitchAlsup1) writes:-------------
Update:
I have figured out how to re-attach Guest Physical BAR back
as MMI/O commands enter the top of a PCIe tree.
Thanks to Scott Lurndal for being gentle with me.
Here is an example topology from a Raptor Lake system:
bus:dev.function (bus 0 is a traditional PCI bus)
Region X is BAR .
This devices are all built-in to either the core or
the PCH/southbridge.
The first plugin-pci card would reside on bus 4.
Only intel systems provide or use the I/O port (legacy 8086) BARs.
$ lspci -vvv | egrep "^[0-9]|Region "
00:00.0 Host bridge: Intel Corporation Raptor Lake-S 8+12 - Host
Bridge/DRAM Controller (rev 01)
00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-S GT1
[UHD Graphics 770] (rev 04) (prog-if 00 [VGA controller])
Region 0: Memory at 6000000000 (64-bit, non-prefetchable)
[size=16M]
Region 2: Memory at 4000000000 (64-bit, prefetchable)
[size=256M]
Region 4: I/O ports at 5000 [size=64]
00:04.0 Signal processing controller: Intel Corporation Raptor Lake
Dynamic Platform and Thermal Framework Processor Participant (rev 01)
Region 0: Memory at 6001100000 (64-bit, non-prefetchable)
[size=128K]
00:08.0 System peripheral: Intel Corporation GNA Scoring Accelerator
module (rev 01)
Region 0: Memory at 600113b000 (64-bit, non-prefetchable)
[disabled] [size=4K]
00:14.0 USB controller: Intel Corporation Alder Lake-S PCH USB 3.2 Gen
2x2 XHCI Controller (rev 11) (prog-if 30 [XHCI])
Region 0: Memory at 6001120000 (64-bit, non-prefetchable)
[size=64K]
00:14.2 RAM memory: Intel Corporation Alder Lake-S PCH Shared SRAM (rev
11)
Region 0: Memory at 6001134000 (64-bit, non-prefetchable)
[disabled] [size=16K]
Region 2: Memory at 600113a000 (64-bit, non-prefetchable)
[disabled] [size=4K]
00:16.0 Communication controller: Intel Corporation Alder Lake-S PCH
HECI Controller #1 (rev 11)
Region 0: Memory at 6001139000 (64-bit, non-prefetchable)
[size=4K]
00:17.0 SATA controller: Intel Corporation Alder Lake-S PCH SATA
Controller [AHCI Mode] (rev 11) (prog-if 01 [AHCI 1.0])
Region 0: Memory at 70700000 (32-bit, non-prefetchable)
[size=8K]
Region 1: Memory at 70704000 (32-bit, non-prefetchable)
[size=256]
Region 2: I/O ports at 5080 [size=8]
Region 3: I/O ports at 5088 [size=4]
Region 4: I/O ports at 5060 [size=32]
Region 5: Memory at 70703000 (32-bit, non-prefetchable)
[size=2K]
00:1a.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root
Port #25 (rev 11) (prog-if 00 [Normal decode])
00:1c.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root
Port #3 (rev 11) (prog-if 00 [Normal decode])
00:1c.3 PCI bridge: Intel Corporation Device 7abb (rev 11) (prog-if 00 [Normal decode])
00:1f.0 ISA bridge: Intel Corporation Device 7a86 (rev 11)
00:1f.3 Audio device: Intel Corporation Alder Lake-S HD Audio Controller
(rev 11)
Region 0: Memory at 6001130000 (64-bit, non-prefetchable)
[size=16K]
Region 4: Memory at 6001000000 (64-bit, non-prefetchable)
[size=1M]
00:1f.4 SMBus: Intel Corporation Alder Lake-S PCH SMBus Controller (rev
11)
Region 0: Memory at 6001138000 (64-bit, non-prefetchable)
[size=256]
Region 4: I/O ports at efa0 [size=32]
00:1f.5 Serial bus controller: Intel Corporation Alder Lake-S PCH SPI Controller (rev 11)
Region 0: Memory at 70702000 (32-bit, non-prefetchable)
[size=4K]
01:00.0 Non-Volatile memory controller: Sandisk Corp WD PC SN5000S M.2
2230 NVMe SSD (DRAM-less) (prog-if 02 [NVM Express])
Region 0: Memory at 70600000 (64-bit, non-prefetchable)
[size=16K]
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller (rev 1b)
Region 0: I/O ports at 4000 [size=256]
Region 2: Memory at 70504000 (64-bit, non-prefetchable) [size=4
Region 4: Memory at 70500000 (64-bit, non-prefetchable)
[size=16K]
03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8852BE
PCIe 802.11ax Wireless Network Controller
Region 0: I/O ports at 3000 [size=256]
Region 2: Memory at 70400000 (64-bit, non-prefetchable)
[size=1M]
The traditional PCI bus supports a 5-bit device# and a 3-bit function #.
A PCIe bus supports either a 3-bit function number and the high-order
five bits MBZ (i.e. only device 0 is legal on PCIe) or, if the function advertises
the Alternate Routing Identifier (ARI) capability, an 8-bit function
number
(SRIOV leverages ARI to support dense routing IDs, but any bus that
supports
ARI can handle 256 physical functions.
PCI supports two forms of configuration space addresses in TLPs:
Type 0 contains only a function number and the register address.
Type 1 contains the bus number, function number and register address.
A PCI-PCI bridge (such as the root complex port bridge) will translate
a type 1 transaction to type 0 when the target RID is on the
configured secondary bus, or forward the type 1 transaction to
a subordinate bus bridge. With ARI, the upstream bridge
from the endpoint needs to be configured as ARI enabled so that
it forwards type 1 transactions to the secondary bus, as the
SRIOV Routing IDs can extend into the 8-bit bus space (allowing
up to 65535 virtual functions associated with a single PF).