Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Rixter
  Tue Mar 11 08:58:59 2025
  from Madison, Nc via SSH
- Guest
  Sat Mar 8 12:21:38 2025
  from A via SSH
- Amr
  Thu Mar 6 03:38:18 2025
  from Fayetteville, Nc via Telnet
- Amr
  Thu Mar 6 03:36:28 2025
  from Fayetteville, Nc via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	35
Nodes:	6 (1 / 5)
Uptime:	18:48:35
Calls:	321
Calls today:	1
Files:	957
Messages:	82,382

Re: Stacks, was Segments

From Scott Lurndal@21:1/5 to John Levine on Sat Jan 18 16:30:20 2025

John Levine <johnl@taugh.com> writes:

According to MitchAlsup1 <mitchalsup@aol.com>:

Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.

Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.

Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.

For some flavors of Algol _everything_ was on the stack.
(e.g. B5500 and successors).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Jan 18 17:40:00 2025

On Sat, 18 Jan 2025 16:30:20 +0000, Scott Lurndal wrote:

John Levine <johnl@taugh.com> writes:

According to MitchAlsup1 <mitchalsup@aol.com>:

Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.

Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.

Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.

For some flavors of Algol _everything_ was on the stack.
(e.g. B5500 and successors).

1108 Algol had everything on the stack.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat Jan 18 19:41:56 2025

According to Niklas Holsti <niklas.holsti@tidorum.invalid>:

Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.

Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.

Mitch's point AIUI was that Algol 60 had no heap allocation (and no
explicit pointer types), so indeed all data were either on the stack or >statically allocated.

It sounded to me like he said that dynamically sized arrays were on the
stack, nothing else was. I think we agree that everything but "own"
is on the stack.

Algol 60 did need a heap because own arrays could have variable size.
That wasn't an accident since sec 5.2.2 shows an example of a variable
size own array. I suspect they didn't realize the implications both
of resizing non-stack data, and what happens in an upper level call
if a lower level call resizes the array underneath it.

It wasn't the only mistake like that. Alan Perlis told me that they
intended call by name to be an elegantly phrased definition of call
by reference, and it wasn't until Jensen's device that they realized what
they had actually done.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Niklas Holsti on Sun Jan 19 17:33:53 2025

On 18/01/2025 09:59, Niklas Holsti wrote:

On 2025-01-18 5:08, John Levine wrote:

According to MitchAlsup1 <mitchalsup@aol.com>:

Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.

Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.

Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.

Mitch's point AIUI was that Algol 60 had no heap allocation (and no
explicit pointer types), so indeed all data were either on the stack or statically allocated.

I'm not an English native speaker, but it seems to me that Mitch should
have written "Algol 60 had only stack allocation" instead of "Algol 60
only had stack allocation".

The most-used Ada compiler, GNAT, uses a "secondary stack" to reduce the
need for heap. Dynamically sized local data are placed on the secondary stack, and dynamically sized return values of functions are returned on
the secondary stack. So a function can return "by value" an array sized
1..N, with N a function parameter, without needing the heap.

Of course the programmer then has the problem of setting sufficient
sizes for /two/ stacks, the primary and the secondary. For
embedded-systems programs one usually avoids constructs that would need
a secondary stack.

A two-stack setup can be used in C too. (The C standards don't require
a stack at all.) On the AVR microcontroller, it is not uncommon for C implementations to work with a dual stack, since it does not have any
kind of "[SP + n]" or "[SP + r]" addressing modes, but it /does/ have an
"[Y + n]" addressing mode using an index register.

Two stacks are also pretty much required for FORTH.

The use of a dual stack could also significantly improve the security of systems by separating call/return addresses from data.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Sun Jan 19 18:28:40 2025

On Sun, 19 Jan 2025 16:33:53 +0000, David Brown wrote:

A two-stack setup can be used in C too. (The C standards don't require
a stack at all.) On the AVR microcontroller, it is not uncommon for C implementations to work with a dual stack, since it does not have any
kind of "[SP + n]" or "[SP + r]" addressing modes, but it /does/ have an
"[Y + n]" addressing mode using an index register.

Two stacks are also pretty much required for FORTH.

The use of a dual stack could also significantly improve the security of systems by separating call/return addresses from data.

In My 66000 the code cannot read/write that other stack with LD and ST instructions. It can only be accessed by ENTER (stores) and EXIT (LDs).
The mapping PTE is marked RWE = 000.

So, while you can still overrun buffers, you cannot damage the call/
return stack or the preserved registers !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to David Brown on Sun Jan 19 23:37:17 2025

On 2025-01-19 18:33, David Brown wrote:

On 18/01/2025 09:59, Niklas Holsti wrote:

[...]

The most-used Ada compiler, GNAT, uses a "secondary stack" to reduce
the need for heap. Dynamically sized local data are placed on the
secondary stack, and dynamically sized return values of functions are
returned on the secondary stack. So a function can return "by value"
an array sized 1..N, with N a function parameter, without needing the
heap.

Of course the programmer then has the problem of setting sufficient
sizes for /two/ stacks, the primary and the secondary. For
embedded-systems programs one usually avoids constructs that would
need a secondary stack.

A two-stack setup can be used in C too. (The C standards don't require
a stack at all.) On the AVR microcontroller, it is not uncommon for C implementations to work with a dual stack, since it does not have any
kind of "[SP + n]" or "[SP + r]" addressing modes, but it /does/ have an
"[Y + n]" addressing mode using an index register.

Yes. Other C compilers use a single stack but use Y as a frame pointer
so they can use "[Y + n]" to access stack-frame locations.

The issue is more acute for 8051/MCS-51 systems where the call/return
stack is in the very small "internal" RAM, so C compilers often allocate
a larger "SW stack" for stack data in the larger "external" RAM. But
they do so only for potentially recursive or reentrant functions, and
instead use statically allocated space for the call-frames of other
functions (with smart whole-program analysis to share such space for
functions that can never be active at the same time).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Niklas Holsti on Mon Jan 20 09:00:43 2025

On 19/01/2025 22:37, Niklas Holsti wrote:

On 2025-01-19 18:33, David Brown wrote:

On 18/01/2025 09:59, Niklas Holsti wrote:

[...]

The most-used Ada compiler, GNAT, uses a "secondary stack" to reduce
the need for heap. Dynamically sized local data are placed on the
secondary stack, and dynamically sized return values of functions are
returned on the secondary stack. So a function can return "by value"
an array sized 1..N, with N a function parameter, without needing the
heap.

Of course the programmer then has the problem of setting sufficient
sizes for /two/ stacks, the primary and the secondary. For
embedded-systems programs one usually avoids constructs that would
need a secondary stack.

A two-stack setup can be used in C too. (The C standards don't
require a stack at all.) On the AVR microcontroller, it is not
uncommon for C implementations to work with a dual stack, since it
does not have any kind of "[SP + n]" or "[SP + r]" addressing modes,
but it /does/ have an "[Y + n]" addressing mode using an index register.

Yes. Other C compilers use a single stack but use Y as a frame pointer
so they can use "[Y + n]" to access stack-frame locations.

gcc for the AVR does that. I assume that it would be a massive effort
to introduce a secondary data stack to gcc, whereas the original AVR
port of gcc was much simpler at the cost of inefficiencies (basically
the 32 8-bit registers were paired up and viewed as 16 16-bit registers,
making the AVR appear like 16-bit RISC processors already well
supported, with peephole optimisations to reduce redundant operations
after code generation).

Other AVR compilers that were made from scratch, or from compilers that
already had complicated stack setups (such as ones for the 8051 you
mention below), were more likely to use a separate data stack.

The efficiency advantages and disadvantages of these two arrangements
are not clear-cut for the AVR - it depends a lot on the way the code is written.

The issue is more acute for 8051/MCS-51 systems where the call/return
stack is in the very small "internal" RAM, so C compilers often allocate
a larger "SW stack" for stack data in the larger "external" RAM. But
they do so only for potentially recursive or reentrant functions, and
instead use statically allocated space for the call-frames of other
functions (with smart whole-program analysis to share such space for functions that can never be active at the same time).

Yes. This also applies to several other "brain-dead" 8-bit CISC
architectures.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Michael S on Mon Jan 20 11:12:54 2025

Michael S <already5chosen@yahoo.com> wrote:

On Sun, 19 Jan 2025 18:28:40 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

In My 66000 the code cannot read/write that other stack with LD and ST
instructions. It can only be accessed by ENTER (stores) and EXIT
(LDs). The mapping PTE is marked RWE = 000.

So, while you can still overrun buffers, you cannot damage the call/
return stack or the preserved registers !!

Not that I am a specialist in GC, but according to my understanding
the most common and the best performing variants of GC can not work
without read access to preserved registers. Compacting collector seems
to need write access as well.
As to return addresses, I would think that read access to stack of
return addresses is necessary for exception handling.

_Correctness_ of GC depends on ability to see preserved registers
and return address: return address may be the only live reference
to some function and similarly for preserved registers. On could
try to work around lack of access using separate software-managed
stack duplicating data from "hardware" stack, but that is ugly
and is likely to kill any performance advantage from hardware
features.

BTW, the same holds for debuggers and exception handling. Those
clearly need some way to go around hardware limitations.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Mon Jan 20 12:55:37 2025

On Sun, 19 Jan 2025 18:28:40 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

In My 66000 the code cannot read/write that other stack with LD and ST instructions. It can only be accessed by ENTER (stores) and EXIT
(LDs). The mapping PTE is marked RWE = 000.

So, while you can still overrun buffers, you cannot damage the call/
return stack or the preserved registers !!

Not that I am a specialist in GC, but according to my understanding
the most common and the best performing variants of GC can not work
without read access to preserved registers. Compacting collector seems
to need write access as well.
As to return addresses, I would think that read access to stack of
return addresses is necessary for exception handling.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Waldek Hebisch on Mon Jan 20 22:05:10 2025

On Mon, 20 Jan 2025 11:12:54 +0000, Waldek Hebisch wrote:

Michael S <already5chosen@yahoo.com> wrote:

On Sun, 19 Jan 2025 18:28:40 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

In My 66000 the code cannot read/write that other stack with LD and ST
instructions. It can only be accessed by ENTER (stores) and EXIT
(LDs). The mapping PTE is marked RWE = 000.

So, while you can still overrun buffers, you cannot damage the call/
return stack or the preserved registers !!

Not that I am a specialist in GC, but according to my understanding
the most common and the best performing variants of GC can not work
without read access to preserved registers. Compacting collector seems
to need write access as well.
As to return addresses, I would think that read access to stack of
return addresses is necessary for exception handling.

_Correctness_ of GC depends on ability to see preserved registers
and return address: return address may be the only live reference
to some function and similarly for preserved registers. On could
try to work around lack of access using separate software-managed
stack duplicating data from "hardware" stack, but that is ugly
and is likely to kill any performance advantage from hardware
features.

BTW, the same holds for debuggers and exception handling. Those
clearly need some way to go around hardware limitations.

Yes, there is a way to do all those things, but I am not in a position
to discuss due to USPTO rules.

It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?

I/O works similarly--in that to the application a page may be marked
RWE=001 (execute only) but the swap disk is allowed to read or write
those pages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Tue Jan 21 01:25:19 2025

On Mon, 20 Jan 2025 22:05:10 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 20 Jan 2025 11:12:54 +0000, Waldek Hebisch wrote:

Michael S <already5chosen@yahoo.com> wrote:

On Sun, 19 Jan 2025 18:28:40 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

In My 66000 the code cannot read/write that other stack with LD
and ST instructions. It can only be accessed by ENTER (stores)
and EXIT (LDs). The mapping PTE is marked RWE = 000.

So, while you can still overrun buffers, you cannot damage the
call/ return stack or the preserved registers !!

Not that I am a specialist in GC, but according to my understanding
the most common and the best performing variants of GC can not work
without read access to preserved registers. Compacting collector
seems to need write access as well.
As to return addresses, I would think that read access to stack of
return addresses is necessary for exception handling.

_Correctness_ of GC depends on ability to see preserved registers
and return address: return address may be the only live reference
to some function and similarly for preserved registers. On could
try to work around lack of access using separate software-managed
stack duplicating data from "hardware" stack, but that is ugly
and is likely to kill any performance advantage from hardware
features.

BTW, the same holds for debuggers and exception handling. Those
clearly need some way to go around hardware limitations.

Yes, there is a way to do all those things, but I am not in a position
to discuss due to USPTO rules.

It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?

Call it 'user'. Then rename the level that you now call 'application'
to 'sandbox'.

I/O works similarly--in that to the application a page may be marked
RWE=001 (execute only) but the swap disk is allowed to read or write
those pages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Jan 21 00:17:56 2025

On Mon, 20 Jan 2025 23:25:19 +0000, Michael S wrote:

On Mon, 20 Jan 2025 22:05:10 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 20 Jan 2025 11:12:54 +0000, Waldek Hebisch wrote:

Michael S <already5chosen@yahoo.com> wrote:

On Sun, 19 Jan 2025 18:28:40 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

In My 66000 the code cannot read/write that other stack with LD
and ST instructions. It can only be accessed by ENTER (stores)
and EXIT (LDs). The mapping PTE is marked RWE = 000.

So, while you can still overrun buffers, you cannot damage the
call/ return stack or the preserved registers !!

Not that I am a specialist in GC, but according to my understanding
the most common and the best performing variants of GC can not work
without read access to preserved registers. Compacting collector
seems to need write access as well.
As to return addresses, I would think that read access to stack of
return addresses is necessary for exception handling.

_Correctness_ of GC depends on ability to see preserved registers
and return address: return address may be the only live reference
to some function and similarly for preserved registers. On could
try to work around lack of access using separate software-managed
stack duplicating data from "hardware" stack, but that is ugly
and is likely to kill any performance advantage from hardware
features.

BTW, the same holds for debuggers and exception handling. Those
clearly need some way to go around hardware limitations.

Yes, there is a way to do all those things, but I am not in a position
to discuss due to USPTO rules.

It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?

Call it 'user'. Then rename the level that you now call 'application'
to 'sandbox'.

Realistically--there are 3 levels in each privilege layer::

least) sandbox--for Jitted code
medium) {user, JIT, Dynamic library, ...}
higher) {debug, GC, Exception, interrupt, Dynamic loader, device DMA,
..}
{{none of which need access to other-than-user VAS, or other-than-user privileges}}

All sharing a single address space, and a software stack of supervision, interrupt table, file-ids, socket-ids,...

The higher level of privilege allows this level to disobey the
permissions
in the PTE (possibly under a flag from ROOT).

So, while sandbox is a fine name for the least privileged running
environment, we are still needing a for the medium level. It is almost
like the higher level is a good portion of GuestOS kernel--those parts requiring no privilege in any normal sense.

I/O works similarly--in that to the application a page may be marked
RWE=001 (execute only) but the swap disk is allowed to read or write
those pages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Jan 21 06:21:36 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?

Authorized? Or a numbering system for different privilege levels,
like it was used for rings?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Findlay@21:1/5 to All on Tue Jan 21 10:36:04 2025

On 20 Jan 2025, MitchAlsup1 wrote
(in article<43e21bd0bddea1733cd672c07a6319d4@www.novabbs.org>):

It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?

Entitled? 8-)
--
Bill Findlay

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Bill Findlay on Tue Jan 21 17:49:13 2025

On Tue, 21 Jan 2025 10:36:04 +0000, Bill Findlay wrote:

On 20 Jan 2025, MitchAlsup1 wrote
(in article<43e21bd0bddea1733cd672c07a6319d4@www.novabbs.org>):

It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?

Entitled? 8-)

Not bad, not bad at all ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Wed Feb 5 12:11:57 2025

MitchAlsup1 wrote:

On Mon, 3 Feb 2025 22:47:24 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 3 Feb 2025 21:13:24 +0000, Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

It is like there is a privilege level between application and
GuestOS.
{{I spent all afternoon trying to think of a name for this privilege >>>>>> above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?

handyman?

Application -> Library -> OS -> Hypervisor -> Secure Monitor

{Sandbox -> user -> application -> Library} ->{sual}×{GuestOS, HV, SM}

??

You need to precisely define your terms. What are sandbox
and user in this context?

It is all about manipulating access rights without modifying
what is stored in the TLB (so you don't have to reload any
entries to change access rights.) It is sort of like what
the G-bit does (global) {except in my architecture globality
is controlled by ASID.}

Sandbox is a privilege level where one cannot be granted both
write and execute access at the same time. There may be other
restrictions, too; like access to control registers user may
be allowed to write.

Library would include all the trusted stuff, but also ld.so
and any JITs. JITs can only create code for sandboxes. So,
JIT can write to JITcache but sandbox cannot using the same
PTE entry. ld.so can write GOT while user and application
cannot write GOT (or execute GOT).

User is the privilege level where sandbox does not apply but
also there is no ability to over-access things protected by
PTE.RWE.

Application is a privilege level where PTE.RWE can sometimes
be usurped--such as DMA from a device needing to write into
a execute only page.

Where does memmove() come from is not the library ??

Libraries have a SW-kind of trust even if they are
devoid of HW kinds of trust (PTE.RWE overrides).

But these levels are just talking point at this point.

It sounds you want something like the VAX privilege/protection mechanism.
It had 4 privilege levels: User, Supervisor, Executive, Kernel.
Each PTE grants R, RW or na (no-access) rights for each priv level.
(Read access implied Execute)
Naively that would take 4*2 = 8 bits in each 32-bit PTE.

However they reduce the combinations with a simple set of rules:
- if any priv level has read access then higher levels have read also.
- if any priv level has write access then higher levels have write also.

That brings the PTE access control field down to 4-bits for all
for priv levels.

For comparison, x64 PTE has 3 bits for 2 priv levels.

=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.

If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW, REW.
- no higher priv level can have less access than a lower priv level.
- we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.

we can get this all down to a 4-bit PTE field:

Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
...
R R R R
...
REW REW REW REW

The core's (thread's) privilege mode would enable access to the pages.
The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different threads.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Wed Feb 5 14:55:14 2025

EricP wrote:

=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.

If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW, REW.
- no higher priv level can have less access than a lower priv level.
- we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.

we can get this all down to a 4-bit PTE field:

Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW

The core's (thread's) privilege mode would enable access to the pages.
The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different threads.

Or if you want the flexibility to choose your own REW combinations,
the 4-bit PTE access control field is an index to a 16 entry array
of 12-bit values for the four privilege levels.

That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.

The next problem though might be finding 4 bits in the PTE.

(That is how I see the PTE's 2 or 3 Cache Control bits to work.
Also there are separate CC lookup tables for interior table PTE's
and leaf table PTE's entries.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Feb 5 21:31:05 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 3 Feb 2025 22:47:24 +0000, Scott Lurndal wrote:

User is the privilege level where sandbox does not apply but
also there is no ability to over-access things protected by
PTE.RWE.

Application is a privilege level where PTE.RWE can sometimes
be usurped--such as DMA from a device needing to write into
a execute only page.

For most modern server CPUs (Intel/AMD/ARM) that is the
responsibility of the IOMMU, not the processor/core/thread.

Where does memmove() come from is not the library ??

Some applications roll their own. In higher level languages,
such as C++, explicit calls to memmove are rare to non-existent
(the standard C++ library and compiler handle data movement).

Libraries have a SW-kind of trust even if they are
devoid of HW kinds of trust (PTE.RWE overrides).

Libraries are easy to usurp in many systems (e.g. with LD_PRELOAD);
precautions are in place to prevent such interpositions
for applications with security constraints (e.g. installed with
enhance capabilities or with UID==0).

But these levels are just talking point at this point.

The hypervisor is optional, as would be a library.

It cannot be a library of process !!

Why not? See either Burroughs or HP-3000 for example
of libraries as first-class objects with independent
security contexts.

It is not a library of GuestOS !
it is certainly not a library of Secure Monitor !!

Why should such code not be able to leverage all the
advantages of libraries, given suitable security controls?

The Burroughs Large systems and HP-3000 segmented libraries
were distinct entities with attributes.

And could change (update/upgrade) the library while the process
was running !!

Under certain well-defined conditions, yes.

Code in a library could be more privileged than the application
when acting on behalf of the application, for example; but the
application could not take advantage of the permissions assigned
to the library it was linked with without using interfaces
provided by the library.

No disagreement.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Feb 5 23:36:58 2025

On Wed, 5 Feb 2025 17:11:57 +0000, EricP wrote:

MitchAlsup1 wrote:

But these levels are just talking point at this point.

It sounds you want something like the VAX privilege/protection
mechanism.
It had 4 privilege levels: User, Supervisor, Executive, Kernel.
Each PTE grants R, RW or na (no-access) rights for each priv level.
(Read access implied Execute)
Naively that would take 4*2 = 8 bits in each 32-bit PTE.

However they reduce the combinations with a simple set of rules:
- if any priv level has read access then higher levels have read also.
- if any priv level has write access then higher levels have write also.

That brings the PTE access control field down to 4-bits for all
for priv levels.

Yes, but in VAX's time we did not have applications that did not want
the OS to look at their data (banking, video streaming, ...) or a
massive number of attackers causing an increased demand for protection
{even to the point of resurrecting capability machines (CHERRI)}.

For comparison, x64 PTE has 3 bits for 2 priv levels.

EricP: thank you for the following thoughts.

On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

EricP wrote:

=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.

If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW, REW.
- no higher priv level can have less access than a lower priv level.
- we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.

we can get this all down to a 4-bit PTE field:

Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW

The core's (thread's) privilege mode would enable access to the pages.
The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different
threads.

Or if you want the flexibility to choose your own REW combinations,
the 4-bit PTE access control field is an index to a 16 entry array
of 12-bit values for the four privilege levels.

That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.

The next problem though might be finding 4 bits in the PTE.

Another PTE bit I can find. Placing the 16×12 vector is more difficult,
even when I position it as 4 places of 16×3.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Thu Feb 6 11:41:45 2025

MitchAlsup1 wrote:

On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

EricP wrote:

=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.

If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW, REW.
- no higher priv level can have less access than a lower priv level.
- we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.

we can get this all down to a 4-bit PTE field:

Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW

The core's (thread's) privilege mode would enable access to the pages.
The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different
threads.

Or if you want the flexibility to choose your own REW combinations,
the 4-bit PTE access control field is an index to a 16 entry array
of 12-bit values for the four privilege levels.

That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.

The next problem though might be finding 4 bits in the PTE.

Another PTE bit I can find. Placing the 16×12 vector is more difficult,
even when I position it as 4 places of 16×3.

I don't understand what you said.
The 4-bit Access Control (AC) field is in the PTE.

The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
is in the MMU.

The core's 2-bit mode selects-muxes one of the 3-bit allowed access fields
from the indexed 12-bits to extract the 3 R-E-W bits.

The 2-bit mode comes from the LD/ST uOp, which was set to the mode active
when the instruction was decoded (so it can pipeline mode changes).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Feb 6 17:13:15 2025

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:

MitchAlsup1 wrote:

On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

EricP wrote:

=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.

If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW, REW. >>>> - no higher priv level can have less access than a lower priv level.
- we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.

we can get this all down to a 4-bit PTE field:

Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW

The core's (thread's) privilege mode would enable access to the pages. >>>> The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different
threads.

Or if you want the flexibility to choose your own REW combinations,
the 4-bit PTE access control field is an index to a 16 entry array
of 12-bit values for the four privilege levels.

That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.

The next problem though might be finding 4 bits in the PTE.

Another PTE bit I can find. Placing the 16×12 vector is more difficult,
even when I position it as 4 places of 16×3.

I don't understand what you said.
The 4-bit Access Control (AC) field is in the PTE.

Currently, PTE uses a 3-bit access control field, and PTE has
2-bits spare. So making access control larger is easy.

The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
is in the MMU.

How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super
cannot see things Hyper can see and the same with secure. So, somewhere
in the various control blocks I need to find space without changing
the overall use pattern of the control blocks and tables. Which is
why I alluded to 4×16×3 each interpretation of the 4-bit access control
is stored it its own natural place. It also means each layer can apply
its own interpretation (mapping).

The core's 2-bit mode selects-muxes one of the 3-bit allowed access
fields from the indexed 12-bits to extract the 3 R-E-W bits.

That much is straightforward.

The 2-bit mode comes from the LD/ST uOp, which was set to the mode
active when the instruction was decoded (so it can pipeline mode
changes).

Yes, core-state index follows the memref down the pipe.
Core-state index is written into MMI/O/device control block for the
DMA portion of a command, other CD indexes are associated with I/O
page faults and device errors.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Thu Feb 6 13:51:12 2025

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:

MitchAlsup1 wrote:

On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

EricP wrote:

=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.

If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW,
REW.
- no higher priv level can have less access than a lower priv level. >>>>> - we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.

we can get this all down to a 4-bit PTE field:

Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW

The core's (thread's) privilege mode would enable access to the pages. >>>>> The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different
threads.

Or if you want the flexibility to choose your own REW combinations,
the 4-bit PTE access control field is an index to a 16 entry array
of 12-bit values for the four privilege levels.

That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.

The next problem though might be finding 4 bits in the PTE.

Another PTE bit I can find. Placing the 16×12 vector is more difficult, >>> even when I position it as 4 places of 16×3.

I don't understand what you said.
The 4-bit Access Control (AC) field is in the PTE.

Currently, PTE uses a 3-bit access control field, and PTE has
2-bits spare. So making access control larger is easy.

The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
is in the MMU.

How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super cannot see things Hyper can see and the same with secure. So, somewhere
in the various control blocks I need to find space without changing
the overall use pattern of the control blocks and tables. Which is
why I alluded to 4×16×3 each interpretation of the 4-bit access control
is stored it its own natural place. It also means each layer can apply
its own interpretation (mapping).

Its just an SRAM loaded by the boot ROM before the Hypervisor boots.
The super-secure version of boot ROM loads a table with values
(Sandbox, User, Kernel, Hypervisor):

Snd Usr Krn Hyp
na na na R
na na na RE
na na na RW
na na na REW
na na R na
na na RE na
na na RW na
na na REW na
na R R na
na RE RE na
...
REW REW REW na

which grants mode 0 (Hyp) no direct RW access to any memory outside itself. Boot ROM sets an optional table lock so even hypervisor cannot later
grant itself access permission to less priv memory by changing the table.

The core's 2-bit mode selects-muxes one of the 3-bit allowed access
fields from the indexed 12-bits to extract the 3 R-E-W bits.

That much is straightforward.

The 2-bit mode comes from the LD/ST uOp, which was set to the mode
active when the instruction was decoded (so it can pipeline mode
changes).

Yes, core-state index follows the memref down the pipe.
Core-state index is written into MMI/O/device control block for the
DMA portion of a command, other CD indexes are associated with I/O
page faults and device errors.

Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV
is not authorized to see (so HV operators don't need Top Secret clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?

Hmmm... something like: once a secure HV passes a physical frame address
to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.

That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to EricP on Thu Feb 6 12:06:31 2025

On 2/6/2025 10:51 AM, EricP wrote:

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:

MitchAlsup1 wrote:

On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

EricP wrote:

=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.

If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE, RW, >>>>>> REW.
- no higher priv level can have less access than a lower priv level. >>>>>> - we can save 1 combo because all 4 priv levels = na is redundant
with the PTE Present bit being clear.

we can get this all down to a 4-bit PTE field:

Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R   R
na na RE RE
....
R   R   R   R
....
REW REW REW REW

The core's (thread's) privilege mode would enable access to the
pages.
The PTE's access control field, which is derived from the kind of
mapped memory section, would not have to change between different
threads.

Or if you want the flexibility to choose your own REW combinations,
the 4-bit PTE access control field is an index to a 16 entry array
of 12-bit values for the four privilege levels.

That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.

The next problem though might be finding 4 bits in the PTE.

Another PTE bit I can find. Placing the 16×12 vector is more difficult, >>>> even when I position it as 4 places of 16×3.

I don't understand what you said.
The 4-bit Access Control (AC) field is in the PTE.

Currently, PTE uses a 3-bit access control field, and PTE has
2-bits spare. So making access control larger is easy.

The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
is in the MMU.

How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super
cannot see things Hyper can see and the same with secure. So, somewhere
in the various control blocks I need to find space without changing
the overall use pattern of the control blocks and tables. Which is
why I alluded to 4×16×3 each interpretation of the 4-bit access control
is stored it its own natural place. It also means each layer can apply
its own interpretation (mapping).

Its just an SRAM loaded by the boot ROM before the Hypervisor boots.
The super-secure version of boot ROM loads a table with values
(Sandbox, User, Kernel, Hypervisor):

Snd Usr Krn Hyp
na na na R
na na na RE
na na na RW
na na na REW
na na R   na
na na RE na
na na RW na
na na REW na
na R   R   na
na RE RE na
...
REW REW REW na

which grants mode 0 (Hyp) no direct RW access to any memory outside itself. Boot ROM sets an optional table lock so even hypervisor cannot later
grant itself access permission to less priv memory by changing the table.

The core's 2-bit mode selects-muxes one of the 3-bit allowed access
fields from the indexed 12-bits to extract the 3 R-E-W bits.

That much is straightforward.

The 2-bit mode comes from the LD/ST uOp, which was set to the mode
active when the instruction was decoded (so it can pipeline mode
changes).

Yes, core-state index follows the memref down the pipe.
Core-state index is written into MMI/O/device control block for the
DMA portion of a command, other CD indexes are associated with I/O
page faults and device errors.

Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV
is not authorized to see (so HV operators don't need Top Secret clearance). The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?

Hmmm... something like: once a secure HV passes a physical frame address
to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.

That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...

I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses? ISTM that
protecting memory of lower privileged programs is useless if a higher privileged program can force a page out to disk, then can read the data
from the disk drive itself. Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Feb 6 20:49:39 2025

On Thu, 6 Feb 2025 18:51:12 +0000, EricP wrote:

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: -----------------------------------

Currently, PTE uses a 3-bit access control field, and PTE has
2-bits spare. So making access control larger is easy.

The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
is in the MMU.

How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super
cannot see things Hyper can see and the same with secure. So, somewhere
in the various control blocks I need to find space without changing
the overall use pattern of the control blocks and tables. Which is
why I alluded to 4×16×3 each interpretation of the 4-bit access control
is stored it its own natural place. It also means each layer can apply
its own interpretation (mapping).

Its just an SRAM loaded by the boot ROM before the Hypervisor boots.
The super-secure version of boot ROM loads a table with values
(Sandbox, User, Kernel, Hypervisor):

Snd Usr Krn Hyp
na na na R
na na na RE
na na na RW
na na na REW
na na R na
na na RE na
na na RW na
na na REW na
na R R na
na RE RE na
....
REW REW REW na

which grants mode 0 (Hyp) no direct RW access to any memory outside
itself.
Boot ROM sets an optional table lock so even hypervisor cannot later
grant itself access permission to less priv memory by changing the
table.

The core's 2-bit mode selects-muxes one of the 3-bit allowed access
fields from the indexed 12-bits to extract the 3 R-E-W bits.

That much is straightforward.

The 2-bit mode comes from the LD/ST uOp, which was set to the mode
active when the instruction was decoded (so it can pipeline mode
changes).

Yes, core-state index follows the memref down the pipe.
Core-state index is written into MMI/O/device control block for the
DMA portion of a command, other CD indexes are associated with I/O
page faults and device errors.

Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV
is not authorized to see (so HV operators don't need Top Secret
clearance).

In this case, HV needs to know its limitations and not access the
device nor the secure memory. Probably by taking the memory out
of the pool it "does normal stuff with" and take the device out
of its list of accessible devices (at least for a while).

The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?

I am thinking more like SR-IOV where HV loans a virtual device to a
Guest OS, and Guest OS performs the I/O request, HV and SM are only
there to deal with HV page faults and device errors. If a HV page
fault occurs (which it will) a pretty secure corner of HV will
construct a PTE mapping that/those page[s] only to page the missing
page into memory so I/O can proceed. HV will then have to dismantle
said mapping after the page arrives to restart device DMA.

Hmmm... something like: once a secure HV passes a physical frame address
to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.
That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...

Yes, exactly. No normal access to the page, only swap access is allowed--although this can be alleviated by not paging secure
memory::then HV just knows nothing about that/those pages until
the process terminates where it can put the pages back in the
normal pool after cleaning them out.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Stephen Fuld on Thu Feb 6 16:53:27 2025

Stephen Fuld wrote:

On 2/6/2025 10:51 AM, EricP wrote:

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:

MitchAlsup1 wrote:

On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

EricP wrote:

=====================================
For the present day we would want REW access control.
Naively this would require 4*3 = 12 bits in each PTE.

If apply the rules:
- we only need a meaningful subset of combinations: na, R, RE,
RW, REW.
- no higher priv level can have less access than a lower priv level. >>>>>>> - we can save 1 combo because all 4 priv levels = na is redundant >>>>>>> with the PTE Present bit being clear.

we can get this all down to a 4-bit PTE field:

Usr Sup Exc Krn
--- --- --- ---
na na na R
na na na RE
na na na RW
na na na REW
na na R R
na na RE RE
....
R R R R
....
REW REW REW REW

The core's (thread's) privilege mode would enable access to the
pages.
The PTE's access control field, which is derived from the kind of >>>>>>> mapped memory section, would not have to change between different >>>>>>> threads.

Or if you want the flexibility to choose your own REW combinations, >>>>>> the 4-bit PTE access control field is an index to a 16 entry array >>>>>> of 12-bit values for the four privilege levels.

That's better because then the OS can decide how it wants
the different memory sections and thread to behave and
removes the strict hardwired hierarchy of the prior rules.

The next problem though might be finding 4 bits in the PTE.

Another PTE bit I can find. Placing the 16×12 vector is more
difficult,
even when I position it as 4 places of 16×3.

I don't understand what you said.
The 4-bit Access Control (AC) field is in the PTE.

Currently, PTE uses a 3-bit access control field, and PTE has
2-bits spare. So making access control larger is easy.

The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
is in the MMU.

How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super
cannot see things Hyper can see and the same with secure. So, somewhere
in the various control blocks I need to find space without changing
the overall use pattern of the control blocks and tables. Which is
why I alluded to 4×16×3 each interpretation of the 4-bit access control >>> is stored it its own natural place. It also means each layer can apply
its own interpretation (mapping).

Its just an SRAM loaded by the boot ROM before the Hypervisor boots.
The super-secure version of boot ROM loads a table with values
(Sandbox, User, Kernel, Hypervisor):

Snd Usr Krn Hyp
na na na R
na na na RE
na na na RW
na na na REW
na na R na
na na RE na
na na RW na
na na REW na
na R R na
na RE RE na
...
REW REW REW na

which grants mode 0 (Hyp) no direct RW access to any memory outside
itself.
Boot ROM sets an optional table lock so even hypervisor cannot later
grant itself access permission to less priv memory by changing the table.

The core's 2-bit mode selects-muxes one of the 3-bit allowed access
fields from the indexed 12-bits to extract the 3 R-E-W bits.

That much is straightforward.

The 2-bit mode comes from the LD/ST uOp, which was set to the mode
active when the instruction was decoded (so it can pipeline mode
changes).

Yes, core-state index follows the memref down the pipe.
Core-state index is written into MMI/O/device control block for the
DMA portion of a command, other CD indexes are associated with I/O
page faults and device errors.

Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV
is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?

Hmmm... something like: once a secure HV passes a physical frame address
to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.

That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...

I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses? ISTM that
protecting memory of lower privileged programs is useless if a higher privileged program can force a page out to disk, then can read the data
from the disk drive itself. Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.

I'm just kinda free associating what the consequences of restricting
the HV's access to lesser privilege levels might be.

As I understand it, the AMD secure HV approach is that memory owned by a
guest kernel and its applications is encrypted and only the guest kernel
has the key. Memory content is only decrypted while inside the core.
As the key is only stored inside that guest kernel memory there is
no way for HV to get at it.

So it doesn't matter that the HV has access to guest memory because it
can only see encrypted memory values. Presumably such data is encrypted
on disk so intercepting a DMA gives you nothing.

But it doesn't look like blocking HV access to the guest kernel, user,
or sandbox memory accomplishes the same security because the HV can
always diddle its own page tables to grant itself access.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Fri Feb 7 02:39:06 2025

On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

On 2/6/2025 10:51 AM, EricP wrote:

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:
-------------------

Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV
is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?

Hmmm... something like: once a secure HV passes a physical frame address
to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.

That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...

I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses?

I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.

In my block diagrams of HostBridge, I show I/O MMU only on the
receiving side of PCIe transport links. all the outbound traffic
has already been translated by a core MMMU, unless one allows
a device to send commands to another device.

I/O MMU sees the virtual address of where DMA is accessing, translating accordingly.

I/O MMU sees the virtual address of MSI-X interrupts, page faults and
errors.

ISTM that
protecting memory of lower privileged programs is useless if a higher privileged program can force a page out to disk, then can read the data
from the disk drive itself.

Protecting a process without privilege from a process WITH privilege
requires more than a little trust in the privileged process(s).
This is why there is a Secure Monitor over HyperVisor to take HV out
of the control loop for "secure stuff". By assuming the duties of
HV wrt accessing unprivileged memory of storage, SM minimizes the
footprint where trust is required.

Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Feb 7 02:53:29 2025

On Thu, 6 Feb 2025 21:53:27 +0000, EricP wrote:

Stephen Fuld wrote:

-----------------

I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses? ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data
from the disk drive itself. Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.

I'm just kinda free associating what the consequences of restricting
the HV's access to lesser privilege levels might be.

As I understand it, the AMD secure HV approach is that memory owned by a guest kernel and its applications is encrypted and only the guest kernel
has the key. Memory content is only decrypted while inside the core.
As the key is only stored inside that guest kernel memory there is
no way for HV to get at it.

Interesting, so you can see the data you just can't interpret the
bit-patterns. This still leaves the door open for malicious action.

So it doesn't matter that the HV has access to guest memory because it
can only see encrypted memory values. Presumably such data is encrypted
on disk so intercepting a DMA gives you nothing.

It maters if HV can access (especially modify) what is in storage
(not memory), making it impossible for secure process from using
his own data !

But it doesn't look like blocking HV access to the guest kernel, user,
or sandbox memory accomplishes the same security because the HV can
always diddle its own page tables to grant itself access.

My 66000 does it differently. HV can create a PTE that translates
"anywhere", but cannot use the VA of Guest OS or SM in order to
use that PTE. There are 4 VAS privilege levels::

HoB = 0 nest HoB = 1 nest
application: application VAS yes no access X
Guest OS application VAS yes Guest OS VAS yes
HyperVisor HyperVisor VAS no Guest OS VAS yes
Secure HyperVisor VAS no SM VAS no

So, HV has no 'direct' path to Application VAS using standard
memory access protocols.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat Jan 18 03:08:47 2025

According to MitchAlsup1 <mitchalsup@aol.com>:

Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.

Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.

Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.

On my FreeBSD server the default stack limit is half a gigabyte. I
don't ever recall running into it.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to John Levine on Sat Jan 18 10:59:05 2025

On 2025-01-18 5:08, John Levine wrote:

According to MitchAlsup1 <mitchalsup@aol.com>:

Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.

Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.

Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.

Mitch's point AIUI was that Algol 60 had no heap allocation (and no
explicit pointer types), so indeed all data were either on the stack or statically allocated.

I'm not an English native speaker, but it seems to me that Mitch should
have written "Algol 60 had only stack allocation" instead of "Algol 60
only had stack allocation".

The most-used Ada compiler, GNAT, uses a "secondary stack" to reduce the
need for heap. Dynamically sized local data are placed on the secondary
stack, and dynamically sized return values of functions are returned on
the secondary stack. So a function can return "by value" an array sized
1..N, with N a function parameter, without needing the heap.

Of course the programmer then has the problem of setting sufficient
sizes for /two/ stacks, the primary and the secondary. For
embedded-systems programs one usually avoids constructs that would need
a secondary stack.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Feb 7 13:57:51 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

On 2/6/2025 10:51 AM, EricP wrote:

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:
-------------------

Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV >>> is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?

Hmmm... something like: once a secure HV passes a physical frame address >>> to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.

That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...

I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses?

I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.

That is indeed the case. The IOMMU is on the inbound path
from the PCIe controller to the internal bus/mesh structure.

Note that there is a translation on the outbound path from
the host address space to the PCIe memory space - this is
often 1:1, but need not be so. This translation happens
in the PCIe controller when creating the a TLP that contains
an address before sending the TLP to the endpoint. Take
an AHCI controller, for example, where the only device
BAR is 32-bits; if a host wants to map the AHCI controller
at a 64-bit address, the controller needs to map that 64-bit
address window into a 32-bit 3DW TLP to be sent to the endpoint
function.

The ARM SMMU is split into two - one that translates inbound
addresses that are not marked secure by the endpoint, and
one that translates addresses that are marked secure by the
endpoint (or by some host bridge between the endpoint and
the host internal bus structures which is configured by
the secure software). The secure side is managed by the
secure monitor; the non-secure side by the HV or bare-metal
OS.

In my block diagrams of HostBridge, I show I/O MMU only on the
receiving side of PCIe transport links. all the outbound traffic
has already been translated by a core MMMU, unless one allows
a device to send commands to another device.

I/O MMU sees the virtual address of where DMA is accessing, translating >accordingly.

I/O MMU sees the virtual address of MSI-X interrupts, page faults and
errors.

By page faults, I assume you're referring to the PCIe PRI (Page Request Interface) and ATS capabilities.

ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data
from the disk drive itself.

Protecting a process without privilege from a process WITH privilege
requires more than a little trust in the privileged process(s).
This is why there is a Secure Monitor over HyperVisor to take HV out
of the control loop for "secure stuff". By assuming the duties of
HV wrt accessing unprivileged memory of storage, SM minimizes the
footprint where trust is required.

ARM has a "RM" (Realm Monitor) that sits between the HV and the SM
to manage memory visiblity and security.

https://developer.arm.com/documentation/den0127/0200/Software-components/Realm-Management-Monitor

Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.

Assuming the file is not secured via other means such as cryptography.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Feb 7 18:25:34 2025

On Fri, 7 Feb 2025 13:57:51 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

On 2/6/2025 10:51 AM, EricP wrote:

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:
-------------------

Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV >>>> is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory >>>> frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?

Hmmm... something like: once a secure HV passes a physical frame address >>>> to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.

That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...

I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses?

I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.

That is indeed the case. The IOMMU is on the inbound path
from the PCIe controller to the internal bus/mesh structure.

Note that there is a translation on the outbound path from
the host address space to the PCIe memory space - this is
often 1:1, but need not be so. This translation happens
in the PCIe controller when creating the a TLP that contains
an address before sending the TLP to the endpoint. Take

Is there any reason this cannot happen in the core MMU ??

Guest OS uses a virtual device address given to it from HV.
HV sets up the 2nd nesting of translation to translate this
to "what HostBridge needs" to route commands to device control
registers. The handoff can be done by spoofing config space
of having HV simply hand Guest OS a list of devices it can discover/configure/use.

an AHCI controller, for example, where the only device
BAR is 32-bits; if a host wants to map the AHCI controller
at a 64-bit address, the controller needs to map that 64-bit
address window into a 32-bit 3DW TLP to be sent to the endpoint
function.

This is one of the reasons My 66000 architecture has a unique
MMI/O address space--you can setup a 32-bit BAR to put a
page of control registers in 32-bit address space without
conflict. {{If I understand correctly}} Core MMU, then,
translates normal device virtual control register addresses
such that the request is routed to where the device is looking
{{which has 32 high order bits zero.}}

On the other hand--it would take a very big system indeed to
overflow the 32-bit MMI/O space, although ECAM can access
42-bit device CR MMI/O space.

The ARM SMMU is split into two - one that translates inbound
addresses that are not marked secure by the endpoint, and
one that translates addresses that are marked secure by the
endpoint (or by some host bridge between the endpoint and
the host internal bus structures which is configured by
the secure software). The secure side is managed by the
secure monitor; the non-secure side by the HV or bare-metal
OS.

In my block diagrams of HostBridge, I show I/O MMU only on the
receiving side of PCIe transport links. all the outbound traffic
has already been translated by a core MMMU, unless one allows
a device to send commands to another device.

I/O MMU sees the virtual address of where DMA is accessing, translating >>accordingly.

I/O MMU sees the virtual address of MSI-X interrupts, page faults and >>errors.

By page faults, I assume you're referring to the PCIe PRI (Page Request Interface) and ATS capabilities.

ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data
from the disk drive itself.

Protecting a process without privilege from a process WITH privilege >>requires more than a little trust in the privileged process(s).
This is why there is a Secure Monitor over HyperVisor to take HV out
of the control loop for "secure stuff". By assuming the duties of
HV wrt accessing unprivileged memory of storage, SM minimizes the
footprint where trust is required.

ARM has a "RM" (Realm Monitor) that sits between the HV and the SM
to manage memory visiblity and security.

https://developer.arm.com/documentation/den0127/0200/Software-components/Realm-Management-Monitor

Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.

Assuming the file is not secured via other means such as cryptography.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Feb 8 22:19:47 2025

On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Feb 2025 13:57:51 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

On 2/6/2025 10:51 AM, EricP wrote:

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: >>>>>>>-------------------

Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV >>>>>> is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory >>>>>> frame that it has no access to itself. How does one block the HV from >>>>>> setting the IOMMU to DMA the device's secrets into its own memory? >>>>>>
Hmmm... something like: once a secure HV passes a physical frame address >>>>>> to a secure kernel then it cannot take it back, it can only ask that >>>>>> kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.

That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission. >>>>>> Hmmm...

I am a little confused here. When you talk about I0MMU addresses, are >>>>> you talking about memory addresses or disk addresses?

I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.

That is indeed the case. The IOMMU is on the inbound path
from the PCIe controller to the internal bus/mesh structure.

Note that there is a translation on the outbound path from
the host address space to the PCIe memory space - this is
often 1:1, but need not be so. This translation happens
in the PCIe controller when creating the a TLP that contains
an address before sending the TLP to the endpoint. Take

Is there any reason this cannot happen in the core MMU ??

How do you map the translation table to the device?

device is configured by setting BAR[s] to an addressable
page. Accesses to this page are performed by the device
consisting of Rd and Wt to control registers. Physical
addresses matching BAR aperture are routed to device.

HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.
Thus, HV MMU maps guest OS physical address into universal
MMI/O address.

A long time before accessing the device, HyperVisor sets up
a device control block and places it in a table indexed
by segment:bus;device and stores table address in a control
register of the I/O MMU {HostBridge}. This control block
contains several context pointers an interrupt table
pointer and four event coordinators--one for DMA, page
faults, errors, and interrupts. The EC provides an index
into the root pointers.

Guest OS uses the virtual device address in code, Guest OS
MMU maps it to the aperture maintained by HyperVisor. HV
then maps GPA to MMI/O:device_address. Using said trans-
lations, Guest OS writes commands to the function:register
of the addressed device.

The path from core virtual address to device control register
address does not pass through the I/O MMU.

When device responds with DMA request it uses a device virtual
address (not a virtual device address), said request is routed
to the top of PCIe tree, where I/O MMU uses ECAM to identify
the MMU tables for this device, once identified, translates*
the device virtual address into a universal address (almost
invariably targeting DRAM) Once translated and checked, the
command is allowed to proceed. (*) assuming ATS was not used.

When device responds with Interrupt request, I/O MMU uses
ECAM (again) to find the associated interrupt table,
and then translates the device interrupt address in to a
universal MMI/O write to the attached interrupt table.

Said universal MMI/O write knocks on the door of interrupt
table service port, where the interrupt message is logged
into the table. And when the priority of the table increases
the service port broadcasts the new priority vector of this
table to all cores.

Should a core monitoring this table see a higher priority
interrupt pending than it is currently running, the core
begins interrupt negotiation.

When a device responds with a page fault, the device control
block identifies the level of the software stack to handle
this exception, and the I?O MMU sends a suitable interrupt
to that level of the interrupt table.

When a device responds with a device error, the device
control block identifies the level and ISR to deal with
this device problem, and the I/O MMU sends a suitable
interrupt to that level of the interrupt table.

So, the I/O MMU responds and guides all requests coming
up the PCIe tree--not just DMA. --------------------------------------------------------

How do you map the translation table to the device?

HostBridge has a configuration register that points at
the I/O MMU ROOT table, which is used to map segment:
bus;device to Originating context. Originating Context
contains a snapshot of the software stack managing the
application. This is where the ROOT pointers, ASIDs,
priorities, and levels are stored. And, in addition,
there is an interrupt table pointer virtual address, ...

A tree is used to map ECAM to device control block, and
other than not starting at a page boundary, and not ending
on a page boundary, it is essentially identical to the std
page mapping tree. The final level of said tree points at
the device control block--a cache line of data where the
I/O MMU gets the data it needs for that particular device.

Why
would you wish to have the CPU translating I/O virtual
addresses?

This is pure mischaracterization on you part. You always
want the MMU closest to the access to perform the trans-
lation. I suspect you read virtual device address and
device virtual address interchangeably--they are entirely
different things used in different places.

The IOMMU tables are per device, and they
can be configured to map the minimum amount of the address
space (even updated per-I/O if desired) required to support
the completion of an inbound DMA from the device.

This still leaves the door open for a parity error to
allow one application DMA to damage another application
process memory, since commands to a single device share
a translation table and both translations are valid at
the same instant. One can essentially eliminate this
with dead pages between different application mappings--
preventing DMA from walking into a wrong VAS.

Guest OS uses a virtual device address given to it from HV.
HV sets up the 2nd nesting of translation to translate this
to "what HostBridge needs" to route commands to device control
registers. The handoff can be done by spoofing config space
of having HV simply hand Guest OS a list of devices it can >>discover/configure/use.

The IOMMU only is involved in DMA transactions _initiated_ by
the device, not by the CPUs. They're two completely different
concepts.

If the I/O MMU does not participate in interrupts, page faults,
and errors, who does ?? The requests coming up from the device
are still virtual and need mapping and routing.

an AHCI controller, for example, where the only device
BAR is 32-bits; if a host wants to map the AHCI controller
at a 64-bit address, the controller needs to map that 64-bit
address window into a 32-bit 3DW TLP to be sent to the endpoint
function.

This is one of the reasons My 66000 architecture has a unique
MMI/O address space--you can setup a 32-bit BAR to put a
page of control registers in 32-bit address space without
conflict. {{If I understand correctly}} Core MMU, then,
translates normal device virtual control register addresses
such that the request is routed to where the device is looking
{{which has 32 high order bits zero.}}

Most systems have DRAM located at physical address zero, and
a 4GB DRAM is pretty small these days.

DRAM: 0x0000000000000000 is not the same address as
IOMM: 0x0000000000000000.
The former is routed to DRAM controller the later is routed
to HostBridge. Both (all 4) spaces have 18446744073709551616
bytes.

So you either need
to make a hole in the DRAM or provide a mapping mechanism to
map a 64-bit address into a 32-bit bar when sending TLPs
to the AHCI controller.

The 32-bit BAR simply maps into IOMM: 0x00000000-0xFFFFFFFF
it does not overlay and of DRAM: 0x00000000-0xFFFFFFFF

Now:: By using the SM and HV 2nd level of translation, SW
C A N setup an aperture in either (or both) guest translations
and host translations where it appears portions of DRAM are
overlaid by MMI/O, it is simply not necessary from a HW point
of view.

Systems that aren't intel compatible will designate a range
of the 64-bit physical address space (near the top) and will
map regions in that range to the 32-bit bar via translation
registers in the PCIe controller.

You are using an aperture to place said MMI/O region.

I am using PTEs such that the MMI/O region(s) can be
pages scattered around without any common locality.
Now, by suitable use, you can use the tools provided
and end up with a MMI/O region easily denoted by an
aperture.

On the other hand--it would take a very big system indeed to
overflow the 32-bit MMI/O space, although ECAM can access
42-bit device CR MMI/O space.

Leaving aside the small size of the legacy Intel I/O space
(16-bit addresses), history seems to have favored single
address space systems, so I suspect such a MMI/O space will
not be favored by many.

It is a SINGLE address system, it happens to have 66-bits of
addressable space, and I use the MMU to translate virtual
64-bit addresses into a universal 66-bit physical address
that can be routed anywhere in the system. So any LD or ST
can touch any of the 4×18446744073709551616 bytes addressable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Feb 15 15:31:44 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 14 Feb 2025 19:51:44 +0000, Scott Lurndal wrote:

The traditional PCI bus supports a 5-bit device# and a 3-bit function #.
A PCIe bus supports either a 3-bit function number and the high-order
five bits MBZ (i.e. only device 0 is legal on PCIe) or, if the function
advertises
the Alternate Routing Identifier (ARI) capability, an 8-bit function
number
(SRIOV leverages ARI to support dense routing IDs, but any bus that
supports
ARI can handle 256 physical functions.

PCI supports two forms of configuration space addresses in TLPs:
Type 0 contains only a function number and the register address.
Type 1 contains the bus number, function number and register address.

PCIe segments go where ? Are they "picked off" prior to being routed
down the tree ??

Logically, they can be considered a prefix to the RID for routing
purposes (inbound to the IOMMU, for example, the PCIE controller
will prepend its segment number to the RID and use that as an
ID to the IOMMU). ARM calls it a streamid.

For PCI configuration transactions initiated by the CPU,
"PCIe segments go where?" is an interesting question.

For traditional Intel-based PCI Local Bus implementations, there
was a 'peek and poke' mechanism that uses intel IN and OUT
instructions to access a pair of registers:

0xCF8: Address register
0xCFC: Data register

The CPU would store the RID (16-bit BDF) in the address register
then read or write the data register to access an 8/16/32 bit
PCI configuration space register (such the BAR registers, for
example). The PCI controller that owned those registers
would convert that to a PCI configuration transaction and put
it on the PCI bus. The target device would capture the transaction
and respond (writes were always non-posted) the remaining devices
on the bus would ignore the transaction. If the bus field
in the config TLP was the same as the downstream bus of the
host file, a type 0 transaction would be sent, if the bus
field was not, a type 1 transaction would be sent, cap[tured
by a bridge on the source bus and forwarded to a downstream bus.
Ad infinitum until the bus space (8-bits) is exhausted.

In this model an individual device on the bus had a fixed
(e.g. via DIP switches, EEPROM, etc) 'device' number and
that device could offer up to 8 'functions'. The device number
was encoded in bits <7:3> of the RID and the function number
was encoded in bits <2:0> of the RID.

PCI Express has been designed as a point-to-point protocol
using serial connections rather than the wide PCI local
bus, which changes the topology of the system. With PCIe
the device number in the RID -must be zero- (unless the
BUS is an ARI bus, in which case bits <7:0> of the RID
are a function number provided by the PCIe device (up
to 256 functions per each - more with SRIOV as it can
consume additional space in the bus <15:8> field of the
RID to support up to 65535 virtual functions on a single
device). A non-SRIOV and non-ARI device can only provide
from one to eight functions.

Since SRIOV can consume the entire 16-bit RID on a single
PCIe device, each PCIe device (of which there can be one
per PCIe controller 'root port') is assigned an unique
segment number (as if it were prepended to the 16-bit RID).

To support non-Intel systems and higher performance accesses
to configuration space for PCIe devices, PCIe specified an
Extended Configuration Access Method (ECAM) which maps the
configuration space of each PCIe segment into a region of
the physical address space (chosen by the implementation).

This allows regular memory loads and stores to access the
PCI configuration space rather than the intel-specific
(and other architecture specific) peek-and-poke access methods.

At the base address of the ECAM, the decoder would look
at the remaining bits with a layout like:

<11:00> Byte-granularity address of configuration space (4KB) register
<19:12> Function number (<19:15> == 0 for non-ARI bus)
<27:20> Bus Number (0 for the controller root complex bridge, downstream
bus numbers assigned by software (can be sparse assignment).

The specification allows the implementation to provide a single ECAM
segment per PCIe controller, or and implementation may provide a single
ECAM region and used bits <63:28> as a segment number. This is how
most non-intel systems handle this today; an processors that supports
six PCIe controllers would have perhaps 7 segments (one or more for the
root bus 0 for the on-chip devices such as memory controllers, and
one for each PCIe controller.

Software simply constructs the target ECAM address and issues normal
loads and stores to access it - no need to use the clumsy, slow and non-standard PCI peek-and-poke configuration space accesses.

For outbound memory space (non-config space) transactions from the CPU to
the device, the RID is not used - once the CPU I/O fabric has routed
the request by PA to the proper PCIe controller (address based routing
through a heirarchy of host bridges), it is simply sent to the device
which CAMs the address against the programmed bars and reacts appropriately (e.g. by responding with a UR (Unsupported Request) on a non-posted
request or dropping a posted request if it doesn't match a BAR).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Feb 15 23:28:28 2025

On Sat, 15 Feb 2025 15:31:44 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

-----------

PCI supports two forms of configuration space addresses in TLPs:
Type 0 contains only a function number and the register address.
Type 1 contains the bus number, function number and register address.

PCIe segments go where ? Are they "picked off" prior to being routed
down the tree ??

Logically, they can be considered a prefix to the RID for routing
purposes (inbound to the IOMMU, for example, the PCIE controller
will prepend its segment number to the RID and use that as an
ID to the IOMMU). ARM calls it a streamid.

For PCI configuration transactions initiated by the CPU,
"PCIe segments go where?" is an interesting question.

Indeed. Snipping Intel brain damage
-----------------------

PCI Express has been designed as a point-to-point protocol
using serial connections rather than the wide PCI local
bus, which changes the topology of the system. With PCIe
the device number in the RID -must be zero- (unless the
BUS is an ARI bus,

What is an ARI bus, and do non x86 systems have them ??

in which case bits <7:0> of the RID
are a function number provided by the PCIe device (up
to 256 functions per each - more with SRIOV as it can
consume additional space in the bus <15:8> field of the
RID to support up to 65535 virtual functions on a single
device). A non-SRIOV and non-ARI device can only provide
from one to eight functions.

snipping

The specification allows the implementation to provide a single ECAM
segment per PCIe controller, or and implementation may provide a single
ECAM region and used bits <63:28> as a segment number.

Wikipedia states ECAM contains 42 bits::
16-bit segment, 8-bit bus, 5-bit device, 3-bit function, 4-bit
xReg, and 6-bit reg. Is this wrong, misleading, of out of date ?

This is how
most non-intel systems handle this today; an processors that supports
six PCIe controllers would have perhaps 7 segments (one or more for the
root bus 0 for the on-chip devices such as memory controllers, and
one for each PCIe controller.

Software simply constructs the target ECAM address and issues normal
loads and stores to access it - no need to use the clumsy, slow and non-standard PCI peek-and-poke configuration space accesses.

We still have to translate 'target ECAM' into a bit pattern matching
that device's BAR. And thinks to your help I have a means to do so
that has overhead only when Booting a new Guest OS.

For outbound memory space (non-config space) transactions from the CPU
to
the device, the RID is not used - once the CPU I/O fabric has routed
the request by PA to the proper PCIe controller (address based routing through a heirarchy of host bridges), it is simply sent to the device
which CAMs the address against the programmed bars and reacts
appropriately
(e.g. by responding with a UR (Unsupported Request) on a non-posted
request or dropping a posted request if it doesn't match a BAR).

Agreed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sun Feb 9 15:45:13 2025

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 21:53:27 +0000, EricP wrote:

Stephen Fuld wrote:

-----------------

I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses? ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data
from the disk drive itself. Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.

I'm just kinda free associating what the consequences of restricting
the HV's access to lesser privilege levels might be.

As I understand it, the AMD secure HV approach is that memory owned by a
guest kernel and its applications is encrypted and only the guest kernel
has the key. Memory content is only decrypted while inside the core.
As the key is only stored inside that guest kernel memory there is
no way for HV to get at it.

Interesting, so you can see the data you just can't interpret the bit-patterns. This still leaves the door open for malicious action.

Yes. Their Secure VM sounds exactly like what a timeshare vendor might want
if they desire to sell services to governments and businesses that have
secrets that rogue operators must provably not access.
Also blocks accidental leaks between guests so you could have multiple
such secure guest OS on the same host.

So it doesn't matter that the HV has access to guest memory because it
can only see encrypted memory values. Presumably such data is encrypted
on disk so intercepting a DMA gives you nothing.

It maters if HV can access (especially modify) what is in storage
(not memory), making it impossible for secure process from using
his own data !

Yes a rogue or buggy HV could DoS a guest OS by scrambling its data.
Or an ECC memory error on critical memory location, like the key.

But it doesn't look like blocking HV access to the guest kernel, user,
or sandbox memory accomplishes the same security because the HV can
always diddle its own page tables to grant itself access.

My 66000 does it differently. HV can create a PTE that translates
"anywhere", but cannot use the VA of Guest OS or SM in order to
use that PTE. There are 4 VAS privilege levels::

HoB = 0 nest HoB = 1 nest
application: application VAS yes no access X
Guest OS application VAS yes Guest OS VAS yes
HyperVisor HyperVisor VAS no Guest OS VAS yes
Secure HyperVisor VAS no SM VAS no

So, HV has no 'direct' path to Application VAS using standard
memory access protocols.

But the physical memory in use by that guest can be remapped
by rogue HV to be in its own virtual space and then accessed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sun Feb 9 21:03:15 2025

On Sun, 9 Feb 2025 20:45:13 +0000, EricP wrote:

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 21:53:27 +0000, EricP wrote:

Stephen Fuld wrote:

-----------------

I am a little confused here. When you talk about I0MMU addresses, are >>>> you talking about memory addresses or disk addresses? ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data >>>> from the disk drive itself. Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.

I'm just kinda free associating what the consequences of restricting
the HV's access to lesser privilege levels might be.

As I understand it, the AMD secure HV approach is that memory owned by a >>> guest kernel and its applications is encrypted and only the guest kernel >>> has the key. Memory content is only decrypted while inside the core.
As the key is only stored inside that guest kernel memory there is
no way for HV to get at it.

Interesting, so you can see the data you just can't interpret the
bit-patterns. This still leaves the door open for malicious action.

Yes. Their Secure VM sounds exactly like what a timeshare vendor might
want
if they desire to sell services to governments and businesses that have secrets that rogue operators must provably not access.

Just enough to gain the confidence of *.gov buyers, without enough
to prevent NSA from using the data.

Also blocks accidental leaks between guests so you could have multiple
such secure guest OS on the same host.

So it doesn't matter that the HV has access to guest memory because it
can only see encrypted memory values. Presumably such data is encrypted
on disk so intercepting a DMA gives you nothing.

It maters if HV can access (especially modify) what is in storage
(not memory), making it impossible for secure process from using
his own data !

Yes a rogue or buggy HV could DoS a guest OS by scrambling its data.
Or an ECC memory error on critical memory location, like the key.

But it doesn't look like blocking HV access to the guest kernel, user,
or sandbox memory accomplishes the same security because the HV can
always diddle its own page tables to grant itself access.

My 66000 does it differently. HV can create a PTE that translates
"anywhere", but cannot use the VA of Guest OS or SM in order to
use that PTE. There are 4 VAS privilege levels::

HoB = 0 nest HoB = 1 nest
application: application VAS yes no access X
Guest OS application VAS yes Guest OS VAS yes
HyperVisor HyperVisor VAS no Guest OS VAS yes
Secure HyperVisor VAS no SM VAS no

So, HV has no 'direct' path to Application VAS using standard
memory access protocols.

But the physical memory in use by that guest can be remapped
by rogue HV to be in its own virtual space and then accessed.

You can never get away from the notion that the code creating
PTP and PTE bit patterns, or manipulating Root pointers requires
a certain amount of trust.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sun Feb 16 19:56:06 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 15 Feb 2025 15:31:44 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

-----------

PCI supports two forms of configuration space addresses in TLPs:
Type 0 contains only a function number and the register address.
Type 1 contains the bus number, function number and register address. >>>

PCIe segments go where ? Are they "picked off" prior to being routed
down the tree ??

Logically, they can be considered a prefix to the RID for routing
purposes (inbound to the IOMMU, for example, the PCIE controller
will prepend its segment number to the RID and use that as an
ID to the IOMMU). ARM calls it a streamid.

For PCI configuration transactions initiated by the CPU,
"PCIe segments go where?" is an interesting question.

Indeed. Snipping Intel brain damage
-----------------------

PCI Express has been designed as a point-to-point protocol
using serial connections rather than the wide PCI local
bus, which changes the topology of the system. With PCIe
the device number in the RID -must be zero- (unless the
BUS is an ARI bus,

What is an ARI bus, and do non x86 systems have them ??

Alternate Routing ID. It is a PCIe standard "capability",
albeit optional. It is required if the function
implements SR-IOV[*] or if a device needs to support more
than 8 physical functions.

[*] There is a legacy mapping that can be used if the OD
doesn't understand the scanning rules if physical
function zero has the ARI capability. It's deprecated
in modern systems.

in which case bits <7:0> of the RID
are a function number provided by the PCIe device (up
to 256 functions per each - more with SRIOV as it can
consume additional space in the bus <15:8> field of the
RID to support up to 65535 virtual functions on a single
device). A non-SRIOV and non-ARI device can only provide
from one to eight functions.

snipping

The specification allows the implementation to provide a single ECAM
segment per PCIe controller, or and implementation may provide a single
ECAM region and used bits <63:28> as a segment number.

Wikipedia states ECAM contains 42 bits::
16-bit segment, 8-bit bus, 5-bit device, 3-bit function, 4-bit
xReg, and 6-bit reg. Is this wrong, misleading, of out of date ?

Misleading - perhaps specific to Intel's implementation.
The PCIe specification defines only bits <27:0>. The
remaining bits are defined by the implementation; the
higher bits usually select the root complex implementation
to which the transaction should be directed.

While the specification uses the "register number" nomenclature,
in the real world bits <11:0> are the offset from the start of
the device configuration space to the desired register. Usually
4-byte aligned, but the legacy space includes some registers that
support byte and 2-byte accesses.

This is how
most non-intel systems handle this today; an processors that supports
six PCIe controllers would have perhaps 7 segments (one or more for the
root bus 0 for the on-chip devices such as memory controllers, and
one for each PCIe controller.

Software simply constructs the target ECAM address and issues normal
loads and stores to access it - no need to use the clumsy, slow and
non-standard PCI peek-and-poke configuration space accesses.

We still have to translate 'target ECAM' into a bit pattern matching
that device's BAR. And thinks to your help I have a means to do so
that has overhead only when Booting a new Guest OS.

When software accesses the ECAM, the PCIe host bridge or
PCIe Root complex implementation (generally
transparent to software) is required to translate the load or
store into a PCIe CFGRD or CFGWR transaction (type 0 or type 1)
to send to the device.

This includes configuration transactions that read, size and modify
the PCI configuration space Base Address Registers (specifically at
addresses 0x10, 0x14, 0x18, 0x1c, 0x20, 0x24).

I'll stress that these configuration transactions are rare, and usually
occur during operating system device discovery and initialization.

The ECAM is not involved in any transactions that target any
physical address region mapped in the function BARs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Niklas Holsti on Mon Jan 27 17:26:51 2025

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

On 2025-01-18 5:08, John Levine wrote:

According to MitchAlsup1 <mitchalsup@aol.com>:

Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.

Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.

Huh? Algol 60 routines could be mutually recursive so unless it was
a leaf procedure or the outer block, everything not declared "own"
went on the stack.

Mitch's point AIUI was that Algol 60 had no heap allocation (and no
explicit pointer types), so indeed all data were either on the stack
or statically allocated.

I'm not an English native speaker, but it seems to me that Mitch
should have written "Algol 60 had only stack allocation" instead of
"Algol 60 only had stack allocation".

Yes. I have seen this situation described as a rule for "only" to
be put as late in the sentence as still make sense.

Putting on my editor hat, I would recommend revising the sentence
more thoroughly, as for example, "Algol 60 had no way of allocating
memory except by means local variables on the stack" (assuming that
is the case; my memories of the rules of Algol may have undetected
ECC errors).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Feb 3 14:09:49 2025

It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?

handyman?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stefan Monnier on Mon Feb 3 21:13:24 2025

Stefan Monnier <monnier@iro.umontreal.ca> writes:

It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?

handyman?

Application -> Library -> OS -> Hypervisor -> Secure Monitor

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Feb 3 21:23:47 2025

On Mon, 3 Feb 2025 21:13:24 +0000, Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

It is like there is a privilege level between application and GuestOS.
{{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?

handyman?

Application -> Library -> OS -> Hypervisor -> Secure Monitor

{Sandbox -> user -> application -> Library} ->{sual}×{GuestOS, HV, SM}

??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Feb 3 22:47:24 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 3 Feb 2025 21:13:24 +0000, Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

It is like there is a privilege level between application and GuestOS. >>>> {{I spent all afternoon trying to think of a name for this privilege
above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?

handyman?

Application -> Library -> OS -> Hypervisor -> Secure Monitor

{Sandbox -> user -> application -> Library} ->{sual}×{GuestOS, HV, SM}

??

You need to precisely define your terms. What are sandbox
and user in this context?

The hypervisor is optional, as would be a library.

The Burroughs Large systems and HP-3000 segmented libraries
were distinct entities with attributes.

Code in a library could be more privileged than the application
when acting on behalf of the application, for example; but the
application could not take advantage of the permissions assigned
to the library it was linked with without using interfaces
provided by the library.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Feb 3 23:11:03 2025

On Mon, 3 Feb 2025 22:47:24 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 3 Feb 2025 21:13:24 +0000, Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

It is like there is a privilege level between application and GuestOS. >>>>> {{I spent all afternoon trying to think of a name for this privilege >>>>> above application "non-privileged" and below "privileged". Maybe
meso-privileged ?!?

handyman?

Application -> Library -> OS -> Hypervisor -> Secure Monitor

{Sandbox -> user -> application -> Library} ->{sual}×{GuestOS, HV, SM}

??

You need to precisely define your terms. What are sandbox
and user in this context?

It is all about manipulating access rights without modifying
what is stored in the TLB (so you don't have to reload any
entries to change access rights.) It is sort of like what
the G-bit does (global) {except in my architecture globality
is controlled by ASID.}

Sandbox is a privilege level where one cannot be granted both
write and execute access at the same time. There may be other
restrictions, too; like access to control registers user may
be allowed to write.

Library would include all the trusted stuff, but also ld.so
and any JITs. JITs can only create code for sandboxes. So,
JIT can write to JITcache but sandbox cannot using the same
PTE entry. ld.so can write GOT while user and application
cannot write GOT (or execute GOT).

User is the privilege level where sandbox does not apply but
also there is no ability to over-access things protected by
PTE.RWE.

Application is a privilege level where PTE.RWE can sometimes
be usurped--such as DMA from a device needing to write into
a execute only page.

Where does memmove() come from is not the library ??

Libraries have a SW-kind of trust even if they are
devoid of HW kinds of trust (PTE.RWE overrides).

But these levels are just talking point at this point.

The hypervisor is optional, as would be a library.

It cannot be a library of process !!
It is not a library of GuestOS !
it is certainly not a library of Secure Monitor !!

The Burroughs Large systems and HP-3000 segmented libraries
were distinct entities with attributes.

And could change (update/upgrade) the library while the process
was running !!

Code in a library could be more privileged than the application
when acting on behalf of the application, for example; but the
application could not take advantage of the permissions assigned
to the library it was linked with without using interfaces
provided by the library.

No disagreement.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Feb 10 20:18:04 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Feb 2025 13:57:51 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

On 2/6/2025 10:51 AM, EricP wrote:

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: >>>>>>>>-------------------

Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV
is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory >>>>>>> frame that it has no access to itself. How does one block the HV from >>>>>>> setting the IOMMU to DMA the device's secrets into its own memory? >>>>>>>
Hmmm... something like: once a secure HV passes a physical frame address
to a secure kernel then it cannot take it back, it can only ask that >>>>>>> kernel for it back. Which means that the HV looses control of any >>>>>>> core or IOMMU PTE's that map that frame until it is handed back. >>>>>>>
That would seem to imply that once an HV gives memory to a secure >>>>>>> guest kernel that it can only page that guest with its permission. >>>>>>> Hmmm...

I am a little confused here. When you talk about I0MMU addresses, are >>>>>> you talking about memory addresses or disk addresses?

I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.

That is indeed the case. The IOMMU is on the inbound path
from the PCIe controller to the internal bus/mesh structure.

Note that there is a translation on the outbound path from
the host address space to the PCIe memory space - this is
often 1:1, but need not be so. This translation happens
in the PCIe controller when creating the a TLP that contains
an address before sending the TLP to the endpoint. Take

Is there any reason this cannot happen in the core MMU ??

How do you map the translation table to the device?

device is configured by setting BAR[s] to an addressable
page. Accesses to this page are performed by the device
consisting of Rd and Wt to control registers. Physical
addresses matching BAR aperture are routed to device.

I was referring to transactions initiated by the endpoint,
not one of the processing elements. E.g. DMA.

Outbound addressing is a solved problem. The complexities
are in the design of the bus system. Programmable
BARs need logical bridges (for OS configuration) and the
hardware needs to route the addresses to the appropriate
destination (onchip devices and external PCIe devices). It's
not as simple as one might think when you have several
hundred functions (each with 3 64-bit or 6 32-bit BARs).

The routing is simple on old-fashioned bus where each
function sees all transactions and can respond when the
address matches one of the function BARs. That doesn't scale
(which is why we have PCIe rather than PCI Local Bus).

Modern mesh/ring routing systems need to know which stop
or mesh point that the I/O bridge is on, the I/O bridge needs
to know how to route those addresses to the proper
controller - long before the endpoint(device) actually can match
the address to one of its BAR registers in its configuration space.

HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.

Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.

Likewise any addresses programmed into the
function DMA engine(s) will be guest PA (many IOMMU also
allow guest application VA directly, if the guest OS
allows the guest application to directly access an SR-IOV virtual
function) and need to be translated inbound from
the function (and routed appropriately to either the
memory/LLC subsystem, or perhaps another PCIe controller
when the system supports PCIe Peer-to-Peer routing.

Thus, HV MMU maps guest OS physical address into universal
MMI/O address.

I think the software folks may be quite unhappy to support a
unusual 32-bit MMIO address space, not to mention the lack
of support for 64-bit device BARs. There are a lot of PCI
devices that require aperatures larger than 4GB.

A long time before accessing the device, HyperVisor sets up
a device control block and places it in a table indexed
by segment:bus;device and stores table address in a control
register of the I/O MMU {HostBridge}. This control block
contains several context pointers an interrupt table
pointer and four event coordinators--one for DMA, page
faults, errors, and interrupts. The EC provides an index
into the root pointers.

I need to spend time thinking about this, which I don't
currently have. It adds a lot of complexity to the
software that shouldn't be necessary. And the linux
folks will _refuse_ to support anything that requires
any quirks or non-standard access to PCIe devices.

Guest OS uses the virtual device address in code, Guest OS
MMU maps it to the aperture maintained by HyperVisor. HV
then maps GPA to MMI/O:device_address. Using said trans-
lations, Guest OS writes commands to the function:register
of the addressed device.

The path from core virtual address to device control register
address does not pass through the I/O MMU.

That's true for intel, amd and arm I/O MMU - they're only
concerned with DMA addresses from the device.. Not outbound
transactions from the host cpus.

When device responds with DMA request it uses a device virtual
address (not a virtual device address),

To be compatible with the current operating systems, a DMA
address must be a guest physical address (for a device owned
by a guest OS), or the host physical address (for a bare-metal
OS).

You'll find a lot of pushback from the OS vendors if that
is not the case.

said request is routed
to the top of PCIe tree, where I/O MMU uses ECAM to identify
the MMU tables for this device, once identified, translates*
the device virtual address into a universal address (almost
invariably targeting DRAM) Once translated and checked, the
command is allowed to proceed. (*) assuming ATS was not used.

It is not uncommon to have several I/OMMU to support throughput
of high-bandwidth devices. They may be managed as a
unit, but the the translation engines are distributed - ARM
supports this model.

When device responds with Interrupt request, I/O MMU uses
ECAM (again) to find the associated interrupt table,
and then translates the device interrupt address in to a
universal MMI/O write to the attached interrupt table.

That's the intel/amd model. ARM64 separates interrupt
routing from address routing, with the former handled
by the interrupt controller (GIC) and the latter by the
SMMU.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Feb 10 23:40:24 2025

On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: -----------------isolating---------------------------------

HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.

Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.

I need a "WHY" on this sentence before responding to the rest.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Feb 11 14:04:59 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >-----------------isolating---------------------------------

HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.

Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.

I need a "WHY" on this sentence before responding to the rest.

The guest OS (or user application) is directly programming the addresses in the PCIe function DMA engine[*]. The point is to completely
avoid the hypervisor from being involved in the I/O,
so the guest programs the function DMA engine using guest PA and the
IOMMU translates the guest PA into machine
addresses for all transactions initiated by the
function.

(Each PCIe function is treated as an individual device
by the OS/HV. The entire purpose of SR-IOV is to
present the device directly to the guest to avoid any
hypervisor involvment in the I/O path).

[*] Take an SR-IOV capable NIC; each SR-IOV virtual
function (VF) can be assigned to a different guest -
from the guest point of view, it owns the
entire function and programs the DMA engines for the
function directly, with no hypervisor intervention.

This allows the same driver to be used in the operating system
for either bare-metal or guest OS. No paravirt required.

So when the NIC needs to DMA an inbound ethernet packet
to the guest OS buffers, it sends memory write TLPs to
the root complex using the guest physical address programmed
into the DMA engine. The IOMMU translates that into
the machine address and passes it to the proper entity
(i.e. DRAM/LLC or another device for Peer-to-peer).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Tue Feb 11 09:30:47 2025

On 2/6/2025 6:39 PM, MitchAlsup1 wrote:

On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

On 2/6/2025 10:51 AM, EricP wrote:

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:
-------------------

Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV >>> is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory
frame that it has no access to itself. How does one block the HV from
setting the IOMMU to DMA the device's secrets into its own memory?

Hmmm... something like: once a secure HV passes a physical frame address >>> to a secure kernel then it cannot take it back, it can only ask that
kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.

That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...

I am a little confused here. When you talk about I0MMU addresses, are
you talking about memory addresses or disk addresses?

I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.

I am terribly out of date with all of this, but what if the device is a
SATA disk? It at least used to be that you sent a command packet to the
disk and said packet contained the disk relative block number. I know
of no way to initiate an I/O by writing to the disk's "control registers".

In my block diagrams of HostBridge, I show I/O MMU only on the
receiving side of PCIe transport links. all the outbound traffic
has already been translated by a core MMMU, unless one allows
a device to send commands to another device.

I/O MMU sees the virtual address of where DMA is accessing, translating accordingly.

I/O MMU sees the virtual address of MSI-X interrupts, page faults and
errors.

ISTM that
protecting memory of lower privileged programs is useless if a higher
privileged program can force a page out to disk, then can read the data
from the disk drive itself.

Protecting a process without privilege from a process WITH privilege
requires more than a little trust in the privileged process(s).

Yes.

This is why there is a Secure Monitor over HyperVisor to take HV out
of the control loop for "secure stuff".

Which, of course, means you have to trust the "Secure Monitor". :-)

By assuming the duties of
HV wrt accessing unprivileged memory of storage, SM minimizes the
footprint where trust is required.

Fair enough. But minimizes is not the same as eliminating.

Of course, the same is true for data
written to disk by a lesser privileged program. If the higher
privileged program can read the file, then it can compromise security.

See my comments above about SATA disks.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Tue Feb 11 18:19:14 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 2/6/2025 6:39 PM, MitchAlsup1 wrote:

On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.

I am terribly out of date with all of this, but what if the device is a
SATA disk? It at least used to be that you sent a command packet to the
disk and said packet contained the disk relative block number. I know
of no way to initiate an I/O by writing to the disk's "control registers".

The SATA device implements the Advanced Host Controller Interface (AHCI).

https://en.wikipedia.org/wiki/Advanced_Host_Controller_Interface

The controller supports multiple modes - a legacy IDE mode (which hasn't
been actively used for a couple of decades), the Native AHCI mode, and optionally a Raid mode.

Then native AHCI and RAID modes have DMA engines in the controller that
perform bulk data transfer to satisfy a controller command. For example,
a command from the driver to transfer 100 sectors to a buffer starting at address 0x23510000 will cause the controller to start reading the disk
at the starting sector and streaming the data to the host root complex.
A AHCI device has one command queue with up to 32 outstanding commands
at any one time.

A TLP (PCIe Transport Layer Packet) can be up to 1024 bytes, so the
controller can push that much to the host in a single transaction.
The root complex will pass the data to a host bridge with an IOMMU,
and the IOMMU will translate the DMA addresses into host addresses
before the bridge passes the data to the memory subsystem (mesh, ring,
bus).

NVMe (PCIe attached SSD) has a more modern interface. The OS
(or HV) driver allocates a region of physical memory that holds
input and output queues. The OS driver inserts one or more requests
into a queue (there can be one or more queues per guest, for example) and
pokes a doorbell register in the NVMe hardware. The buffer
addresses in the request will be OS physical addresses. NVMe
supports up to 65536 command rings.

The doorbell write causes the
NVMe hardware to read the new data structure(s) from the queue (describing
the I/O) and execute those requests by either issuing MRD
(Memory Read TLPs) or MWR (Memory Write TLPs) to the host
root complex/pcie controller until the request has been satisfied.
A request can be arbitrarily large and supports a scatter gather
list so that an inbound read can stored in discontigous
regions in the applicable physical address regime (bare metal, guest
or user-application).

Once the request has been satisifed, the controller posts a
completion entry on a completion queue (using DMA) and sends an MSI-X
message (using DMA) to the root complex which passes it to the host
interrupt controller which generates an 'complete' interrupt
to the driver. These interrupts can be optionally coalesced
(something more useful in a network interface controller than
in a disk/ssd controller, to be sure). The driver reads the
completion status from the completion queue (and wakes any
threads waiting for the data).

https://en.wikipedia.org/wiki/NVM_Express#Comparison_with_AHCI

Most modern PCIe network interface controllers use similar ring
structures to pass work between the driver and the hardware
with complicated traffic shaping, rss and interrupt coalescing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Feb 11 20:19:32 2025

On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>-----------------isolating---------------------------------

HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.

Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.

I need a "WHY" on this sentence before responding to the rest.

The guest OS (or user application) is directly programming the addresses
in the
PCIe function DMA engine[*]. The point is to completely
avoid the hypervisor from being involved in the I/O,
so the guest programs the function DMA engine using guest PA and the
IOMMU translates the guest PA into machine
addresses for all transactions initiated by the
function.

(Each PCIe function is treated as an individual device
by the OS/HV. The entire purpose of SR-IOV is to
present the device directly to the guest to avoid any
hypervisor involvment in the I/O path).

This is the missing part:: When a user performs a LD or ST
the guest Virtual address is translated to Guest Physical
by Guest OS translation tables. Guest Physical is then
interpreted as Host virtual and translated a second time
by {SM,HV} mapping tables. This nested MMU does both
translations in a single access; so core TLB is organized
to associate guest virtual directly with machine physical !
as if there were a level crossing PTE providing all the
"right bits".

Now that the device has been configured, Guest OS decides
to write some control registers of the device. Guest OS
has its own translation tables for Guest Virtual to Guest
Physical--but the core MMU then translates guest Physical
to Machine Physical before it gets transported over the
interconnect. So, by the time said address gets to Host-
Bridge it is already in Machine Physical, not the Guest
physical you mention.*

------------------------------------------------------

UNLESS HV and Guest OS have an agreement that Guest Physical
has a range by which Guest Physical == Machine Physical !!
And if there is such an agreement, calling the address
Machine Physical is just as valid as calling it Guest
Physical.

(*) it sounds like you are assuming there is a way to trans-
late the first nesting level to guest physical and then
sidestep the translation of guest physical to machine
physical until the request gets to the I/O MMU, allowing
the I/O MMU to perform the second level of nesting trans-
lation.

{{Back to the other stuff later}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Feb 11 20:49:24 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>>-----------------isolating---------------------------------

HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.

Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.

I need a "WHY" on this sentence before responding to the rest.

The guest OS (or user application) is directly programming the addresses
in the
PCIe function DMA engine[*]. The point is to completely
avoid the hypervisor from being involved in the I/O,
so the guest programs the function DMA engine using guest PA and the
IOMMU translates the guest PA into machine
addresses for all transactions initiated by the
function.

(Each PCIe function is treated as an individual device
by the OS/HV. The entire purpose of SR-IOV is to
present the device directly to the guest to avoid any
hypervisor involvment in the I/O path).

This is the missing part:: When a user performs a LD or ST

You're missing the point completely. There no user
performing a LD or ST. The DMA controller on the device
is initiating the transaction, not the host CPU.

the guest Virtual address is translated to Guest Physical
by Guest OS translation tables. Guest Physical is then
interpreted as Host virtual and translated a second time
by {SM,HV} mapping tables. This nested MMU does both
translations in a single access; so core TLB is organized
to associate guest virtual directly with machine physical !
as if there were a level crossing PTE providing all the
"right bits".

This is basically how all modern CPUs handle it, yes.

But it is not relevent to the inbound traffic initiated
-by the device- which can't be translated by the CPU,
rather must be translated at some point between the
PCIe controller and the internal processor interconnect
(e.g. mesh).

Now that the device has been configured, Guest OS decides
to write some control registers of the device. Guest OS
has its own translation tables for Guest Virtual to Guest
Physical--but the core MMU then translates guest Physical
to Machine Physical before it gets transported over the
interconnect. So, by the time said address gets to Host-
Bridge it is already in Machine Physical, not the Guest
physical you mention.*

my66000 as no insight into the device, you can't know
a priori which 64-bit write to the device contains
a physical address. Particularly in all modern
devices where there may be only one "control register"[*]
the guest driver writes commands and s/g lists to one of
several hundred queues in local dram, then signals
the device to initiate a DMA operation to read the
entry from the queue in DRAM (using a guest physical
address). The CPU never sees that read,
nor can you know a priori that a particular write
from a device driver is an guest address that might need
to be translated - the driver is writing the command
to main memory and just poking the device to read the
command directly there is no way to associate that
write with any particular device in the CPU.

[*] The doorbell. There will generally be a few more
to set global characteristics, configurations, etc,
but they'll only be actively used during driver initialization
and will likely not contain addresses of any form.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Feb 11 23:29:04 2025

On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>>>-----------------isolating---------------------------------

HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.

Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.

I need a "WHY" on this sentence before responding to the rest.

The guest OS (or user application) is directly programming the addresses >>> in the
PCIe function DMA engine[*]. The point is to completely
avoid the hypervisor from being involved in the I/O,
so the guest programs the function DMA engine using guest PA and the
IOMMU translates the guest PA into machine
addresses for all transactions initiated by the
function.

(Each PCIe function is treated as an individual device
by the OS/HV. The entire purpose of SR-IOV is to
present the device directly to the guest to avoid any
hypervisor involvment in the I/O path).

This is the missing part:: When a user performs a LD or ST

You're missing the point completely. There no user
performing a LD or ST. The DMA controller on the device
is initiating the transaction, not the host CPU.

the guest Virtual address is translated to Guest Physical
by Guest OS translation tables. Guest Physical is then
interpreted as Host virtual and translated a second time
by {SM,HV} mapping tables. This nested MMU does both
translations in a single access; so core TLB is organized
to associate guest virtual directly with machine physical !
as if there were a level crossing PTE providing all the
"right bits".

This is basically how all modern CPUs handle it, yes.

But it is not relevent to the inbound traffic initiated
-by the device- which can't be translated by the CPU,
rather must be translated at some point between the
PCIe controller and the internal processor interconnect
(e.g. mesh).

I am tracing the path from user of device (core Guest OS).
If I don't understand this "should be simple" path, I am
too lost to continue from the device side looking towards
memory.

I can see how core can write Guest Physical Address into
device.BAR using config space access (with appropriate
MMU permissions).

But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.

core executes the following instruction::

STH R7,[Rdevice,#controlreg]

Rdevice has a Virtual Address bit pattern which after 1 level of
translation matches the bit pattern put into device.BAR at config.

#controlreg is the offset to the control reg.

core TLB translates GVA to GPA; GPA contains the same bit pattern
that was written into device.BAR.

Since we use nested page tables, GPA is interpreted as Host
virtual address, HVA is translated to machine PA before access
leaves the confines of the core.

Unless a range of bits in MPA == GPA, I can't see how the address
can be routed (over the interconnect, into HostBridge,) and then on
to the device and match the configured BAR bit pattern ?!?

That is I can't see how device.BAR == MPA can match when it
has HPA != GPA !!

Now that the device has been configured, Guest OS decides
to write some control registers of the device. Guest OS
has its own translation tables for Guest Virtual to Guest
Physical--but the core MMU then translates guest Physical
to Machine Physical before it gets transported over the
interconnect. So, by the time said address gets to Host-
Bridge it is already in Machine Physical, not the Guest
physical you mention.*

my66000 as no insight into the device, you can't know
a priori which 64-bit write to the device contains
a physical address. Particularly in all modern
devices where there may be only one "control register"[*]
the guest driver writes commands and s/g lists to one of
several hundred queues in local dram, then signals
the device to initiate a DMA operation to read the
entry from the queue in DRAM (using a guest physical
address). The CPU never sees that read,
nor can you know a priori that a particular write
from a device driver is an guest address that might need
to be translated - the driver is writing the command
to main memory and just poking the device to read the
command directly there is no way to associate that
write with any particular device in the CPU.

I have a touchy feely knowledge of the above paragraph.
And I am trying to process without anyone seeing any
interconnect transaction is should not.

But the core does see its own writes to control registers
which are recognized by/at the Device.BAR set bit-pattern
from config, as meaningful to this device.

[*] The doorbell. There will generally be a few more
to set global characteristics, configurations, etc,
but they'll only be actively used during driver initialization
and will likely not contain addresses of any form.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Feb 12 00:34:18 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>>>>-----------------isolating---------------------------------

HyperVisor maintains a PTE to map guest physical addresses
within an aperture to the page matching the device's BAR.

Standard HV stuff. Although you may want to consider that
the value in the function's BAR, must be a guest PA, not a
machine PA.

I need a "WHY" on this sentence before responding to the rest.

The guest OS (or user application) is directly programming the addresses >>>> in the
PCIe function DMA engine[*]. The point is to completely
avoid the hypervisor from being involved in the I/O,
so the guest programs the function DMA engine using guest PA and the
IOMMU translates the guest PA into machine
addresses for all transactions initiated by the
function.

(Each PCIe function is treated as an individual device
by the OS/HV. The entire purpose of SR-IOV is to
present the device directly to the guest to avoid any
hypervisor involvment in the I/O path).

This is the missing part:: When a user performs a LD or ST

You're missing the point completely. There no user
performing a LD or ST. The DMA controller on the device
is initiating the transaction, not the host CPU.

the guest Virtual address is translated to Guest Physical
by Guest OS translation tables. Guest Physical is then
interpreted as Host virtual and translated a second time
by {SM,HV} mapping tables. This nested MMU does both
translations in a single access; so core TLB is organized
to associate guest virtual directly with machine physical !
as if there were a level crossing PTE providing all the
"right bits".

This is basically how all modern CPUs handle it, yes.

But it is not relevent to the inbound traffic initiated
-by the device- which can't be translated by the CPU,
rather must be translated at some point between the
PCIe controller and the internal processor interconnect
(e.g. mesh).

I am tracing the path from user of device (core Guest OS).
If I don't understand this "should be simple" path, I am
too lost to continue from the device side looking towards
memory.

I can see how core can write Guest Physical Address into
device.BAR using config space access (with appropriate
MMU permissions).

But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.

The hardware is generally described to the operating system
through tables provided by the firmware (supplied by the system
builder). There are two different mechanisms in widespread use;
ACPI and DeviceTree. The former, sponsered primarily by Microsoft
and Intel, provides a standard mechanism for the hardware to
describe itself to the operating system/hypervisor.

An entry in those tables for a particular device will describe
in standard terms the path from the memory/processor complex
to the device. Say, for example, your CPU uses a high-speed
mesh interface between cores. The outer ring on the mesh
will connect to memory controllers and I/O bridges.

Each point in the path is described by a table entry
in the ACPI tables.

Assume a simple config

+=========+ +========+ +=======+
| CPU | | HOST | | PCIe |
| Memory |-----| Bridge |--| Ctlr |------+---------------+-----------
| complex | | | | | | |
+=========+ +========+ +=======+ Func 0 Func 1
BAR[0]=X BAR[2]=Y

The tables will have objects describing the bridge
and the heirarchy below the bridge (in this case, there is one
PCIe controller and a PCIe device with two functions).

Each function is a distinct device with its own config space, including
the standard 6 (32-bit) or 3 (64-bit) BARS.

The firmware will "size" the bars by writing all-ones to the bar[*] then reading the value back, inverting it and adding 1. This results in
the size, in bytes required by the function for each bar.

[*] through the PCI configuration space, before the BARs are initialized.

The firmware will repeat that for each function discovered below the
root complex bridge in the PCIe controller and sum the total.

That sum will be associated with the bridge aperture, so that software
knows to allocate all the BARs downstream of the bridge from the bridge
range of physical addresses (assigned by the firmware).

If there are multiple bridges in the path, rinse and repeat.

The bridge entry in the ACPI tables will have been programmed with
a base physical address and a size. The OS (Windows/Linux) will
subsequently allocate from that range when storing addresses into
the BARs.

In the case of a guest, the Hypervisor provides the tables with
guest physical addresses for the devices (but the bridge is
transparent and virtual).

The reason for all of this is that there is no standard mechanism
for linux or windows or other commercial operating systems to
discover and program the topology of the chip, the I/O bridges,
the mesh etc. The chip provider and platform abstract all
that in the APCI tables (or device tree) provided to the OS/HV.

Intel, AMD and ARM all support ACPI, and ARM also supports device tree
which is a simpler mechanism used more in embedded systems than
general purpose computer systems.

There are a lot of moving parts, most of which are hidden by
the firmware, in routing non-coherent accesses outside the core mesh.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Feb 13 16:42:59 2025

On Wed, 12 Feb 2025 0:34:18 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote: >>>>---------------------
This is the missing part:: When a user performs a LD or ST

You're missing the point completely. There no user
performing a LD or ST. The DMA controller on the device
is initiating the transaction, not the host CPU.

the guest Virtual address is translated to Guest Physical
by Guest OS translation tables. Guest Physical is then
interpreted as Host virtual and translated a second time
by {SM,HV} mapping tables. This nested MMU does both
translations in a single access; so core TLB is organized
to associate guest virtual directly with machine physical !
as if there were a level crossing PTE providing all the
"right bits".

This is basically how all modern CPUs handle it, yes.

But it is not relevent to the inbound traffic initiated
-by the device- which can't be translated by the CPU,
rather must be translated at some point between the
PCIe controller and the internal processor interconnect
(e.g. mesh).

I am tracing the path from user of device (core Guest OS).
If I don't understand this "should be simple" path, I am
too lost to continue from the device side looking towards
memory.

I can see how core can write Guest Physical Address into
device.BAR using config space access (with appropriate
MMU permissions).

But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.

We will ignore this for a while.

The hardware is generally described to the operating system
through tables provided by the firmware (supplied by the system
builder). There are two different mechanisms in widespread use;
ACPI and DeviceTree. The former, sponsered primarily by Microsoft
and Intel, provides a standard mechanism for the hardware to
describe itself to the operating system/hypervisor.

Side topic::
Why not just spoof PCIe config space, and setup config space
headers that cause Guest OS to load the paravirtualized driver ??
instead of the direct device driver ?!?
{{It has to be a speed/latency issue}}

An entry in those tables for a particular device will describe
in standard terms the path from the memory/processor complex
to the device. Say, for example, your CPU uses a high-speed
mesh interface between cores. The outer ring on the mesh
will connect to memory controllers and I/O bridges.

Each point in the path is described by a table entry
in the ACPI tables.

Assume a simple config

+=========+ +========+ +=======+
| CPU | | HOST | | PCIe |
| Memory |-----| Bridge |--| Ctlr
|------+---------------+-----------
| complex | | | | | | |
+=========+ +========+ +=======+ Func 0 Func 1
BAR[0]=X BAR[2]=Y

Now, consider a device with 16 virtual functions, and 16 Guest
OSs. All 16 Guest OSs use the same Guest Physical Address in
their virtual function BARs.

How does the PCIe controller figure out that an MMI/O space
sized write is to GuestOS[7] or GuestOS[14] ?? {{Obviously,
once the command is place in the routing tree, it will be
matched by all the GuestOS BARs.}}

Thus, there is still information missing for my understanding.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Feb 13 18:12:52 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 12 Feb 2025 0:34:18 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.

We will ignore this for a while.

The hardware is generally described to the operating system
through tables provided by the firmware (supplied by the system
builder). There are two different mechanisms in widespread use;
ACPI and DeviceTree. The former, sponsered primarily by Microsoft
and Intel, provides a standard mechanism for the hardware to
describe itself to the operating system/hypervisor.

Side topic::
Why not just spoof PCIe config space, and setup config space
headers that cause Guest OS to load the paravirtualized driver ??
instead of the direct device driver ?!?

That was the approach prior to the PCI-SIG introducing
SRIOV in the mid 2000s. Performance sucked.

{{It has to be a speed/latency issue}}

Throughput reduced and extra HV overhead.

An entry in those tables for a particular device will describe
in standard terms the path from the memory/processor complex
to the device. Say, for example, your CPU uses a high-speed
mesh interface between cores. The outer ring on the mesh
will connect to memory controllers and I/O bridges.

Each point in the path is described by a table entry
in the ACPI tables.

Assume a simple config

+=========+ +========+ +=======+
| CPU | | HOST | | PCIe |
| Memory |-----| Bridge |--| Ctlr
|------+---------------+-----------
| complex | | | | | | |
+=========+ +========+ +=======+ Func 0 Func 1
BAR[0]=X BAR[2]=Y

Now, consider a device with 16 virtual functions, and 16 Guest
OSs. All 16 Guest OSs use the same Guest Physical Address in
their virtual function BARs.

Technically, there would be one physical function (owned
by the hypervisor - with control CSRs apportioning resources
between virtual functions) and up to 65535 virtual functions.

The 'bus' and 'function' together make up a 16-bit routing
id (RID) which is the target id field in the Config Read/Write TLP. The
PF will generally have a function number between 0 and 7,
(although with ARI, it can be any function number) and the
VF routing IDs will start at some offset from the PF with
a programmable stride (e.g. for a PF with 3 VFs):

PF0 RID = 0
VF1 RID = 8
VF2 RID = 16
VF3 RID = 24

When the VF number exceeds 255, it's RID will be function 0
on the next higher bus number.

As an endpoint function, the bus field in the RID will be the secondary
bus of the Root Complex bridge (generally 1).

So the PCI config space RID (BDF) for VF2 would be 0x0110.

This RID (in combination with a PCIe controller index (segment
in Intel terminology)) is used by the IOMMU to select the
translation table to use for inbound DMA from this function.

How does the PCIe controller figure out that an MMI/O space
sized write is to GuestOS[7] or GuestOS[14] ?? {{Obviously,
once the command is place in the routing tree, it will be
matched by all the GuestOS BARs.}}

The PCIe controller sends all MRW/MRD TLPs to the endpoint
device, which matches them against ALL BAR registers (all PFs
and all VFs) on the device to determine which function
the memory read or memory write TLP is targeting.
(Note that the VF BARs are actually fixed, the VF
memory spaces are equal sized and contiguous, so
the endpoint only needs to CAM on six BARs at most)

It gets a bit more complicated when there is a PCIe
switch on the device, likewise if there is a PCIe to PCI
bridge on the endpoint (very unlikely nowadays).

If the TLP address doesn't match any _enabled_ function
BAR, a memory write (posted) will be dropped[*] and a memory read
will return a UR (Unsupported Request) completion TLP.

The real question is how does the CPU route the
load or store request to the proper PCIe controller
port, and that's the hard part, particularly to
follow the proper PCIe transaction ordering rules.

Thus, there is still information missing for my understanding.

[*] it may post a RAS error in some implementation defined manner
or via the Advanced Error Reporting (AER) PCI Express capability.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Feb 13 21:48:10 2025

On Thu, 13 Feb 2025 18:12:52 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 12 Feb 2025 0:34:18 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.

We will ignore this for a while.

The hardware is generally described to the operating system
through tables provided by the firmware (supplied by the system
builder). There are two different mechanisms in widespread use;
ACPI and DeviceTree. The former, sponsered primarily by Microsoft
and Intel, provides a standard mechanism for the hardware to
describe itself to the operating system/hypervisor.

Side topic::
Why not just spoof PCIe config space, and setup config space
headers that cause Guest OS to load the paravirtualized driver ??
instead of the direct device driver ?!?

That was the approach prior to the PCI-SIG introducing
SRIOV in the mid 2000s. Performance sucked.

{{It has to be a speed/latency issue}}

Throughput reduced and extra HV overhead.

An entry in those tables for a particular device will describe
in standard terms the path from the memory/processor complex
to the device. Say, for example, your CPU uses a high-speed
mesh interface between cores. The outer ring on the mesh
will connect to memory controllers and I/O bridges.

Each point in the path is described by a table entry
in the ACPI tables.

Assume a simple config

+=========+ +========+ +=======+
| CPU | | HOST | | PCIe |
| Memory |-----| Bridge |--| Ctlr
|------+---------------+-----------
| complex | | | | | | |
+=========+ +========+ +=======+ Func 0 Func 1
BAR[0]=X BAR[2]=Y

Now, consider a device with 16 virtual functions, and 16 Guest
OSs. All 16 Guest OSs use the same Guest Physical Address in
their virtual function BARs.

Technically, there would be one physical function (owned
by the hypervisor - with control CSRs apportioning resources
between virtual functions) and up to 65535 virtual functions.

The 'bus' and 'function' together make up a 16-bit routing
id (RID) which is the target id field in the Config Read/Write TLP. The
PF will generally have a function number between 0 and 7,
(although with ARI, it can be any function number) and the
VF routing IDs will start at some offset from the PF with
a programmable stride (e.g. for a PF with 3 VFs):

PF0 RID = 0
VF1 RID = 8
VF2 RID = 16
VF3 RID = 24

When the VF number exceeds 255, it's RID will be function 0
on the next higher bus number.

As an endpoint function, the bus field in the RID will be the secondary
bus of the Root Complex bridge (generally 1).

So the PCI config space RID (BDF) for VF2 would be 0x0110.

This RID (in combination with a PCIe controller index (segment
in Intel terminology)) is used by the IOMMU to select the
translation table to use for inbound DMA from this function.

How does the PCIe controller figure out that an MMI/O space
sized write is to GuestOS[7] or GuestOS[14] ?? {{Obviously,
once the command is place in the routing tree, it will be
matched by all the GuestOS BARs.}}

The PCIe controller sends all MRW/MRD TLPs to the endpoint
device, which matches them against ALL BAR registers (all PFs
and all VFs) on the device to determine which function
the memory read or memory write TLP is targeting.

What is supposed to happen when more than 1 BAR matches
the TLP address on any given bus??

(Note that the VF BARs are actually fixed, the VF
memory spaces are equal sized and contiguous, so
the endpoint only needs to CAM on six BARs at most)

It gets a bit more complicated when there is a PCIe
switch on the device, likewise if there is a PCIe to PCI
bridge on the endpoint (very unlikely nowadays).

If the TLP address doesn't match any _enabled_ function
BAR, a memory write (posted) will be dropped[*] and a memory read
will return a UR (Unsupported Request) completion TLP.

But what happens when more than 1 BAR matches the supplied address ??

HW would typically have each matching BAR capture the data
being written, or upon a read, read all the control registers
and either AND them or OR them together (wired OR read-out
bus). Neither of which is what SW will be expecting.

The real question is how does the CPU route the
load or store request to the proper PCIe controller
port, and that's the hard part, particularly to
follow the proper PCIe transaction ordering rules.

Thus, there is still information missing for my understanding.

[*] it may post a RAS error in some implementation defined manner
or via the Advanced Error Reporting (AER) PCI Express capability.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Feb 13 22:23:36 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 13 Feb 2025 18:12:52 +0000, Scott Lurndal wrote:

How does the PCIe controller figure out that an MMI/O space
sized write is to GuestOS[7] or GuestOS[14] ?? {{Obviously,
once the command is place in the routing tree, it will be
matched by all the GuestOS BARs.}}

The PCIe controller sends all MRW/MRD TLPs to the endpoint
device, which matches them against ALL BAR registers (all PFs
and all VFs) on the device to determine which function
the memory read or memory write TLP is targeting.

What is supposed to happen when more than 1 BAR matches
the TLP address on any given bus??

It lets out the magic smoke.

Actually, the results are indeterminate; such programming
is a violation of the specification. It could match any
of the matching bars, or none of them. Software certainly
cannot rely on any particular behavior in that case and
must not do that.

(Note that the VF BARs are actually fixed, the VF
memory spaces are equal sized and contiguous, so
the endpoint only needs to CAM on six BARs at most)

It gets a bit more complicated when there is a PCIe
switch on the device, likewise if there is a PCIe to PCI
bridge on the endpoint (very unlikely nowadays).

If the TLP address doesn't match any _enabled_ function
BAR, a memory write (posted) will be dropped[*] and a memory read
will return a UR (Unsupported Request) completion TLP.

But what happens when more than 1 BAR matches the supplied address ??

HW would typically have each matching BAR capture the data
being written, or upon a read, read all the control registers
and either AND them or OR them together (wired OR read-out
bus). Neither of which is what SW will be expecting.

Indeed. Overlapping BARs downstream of the root
complex are a programming bug.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Feb 7 20:32:39 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Feb 2025 13:57:51 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

On 2/6/2025 10:51 AM, EricP wrote:

MitchAlsup1 wrote:

On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: >>>>>>-------------------

Not sure how this would work with device IO and DMA.
Say a secure kernel that owns a disk drive with secrets that even the HV >>>>> is not authorized to see (so HV operators don't need Top Secret
clearance).
The Hypervisor has to pass to a hardware device DMA access to a memory >>>>> frame that it has no access to itself. How does one block the HV from >>>>> setting the IOMMU to DMA the device's secrets into its own memory?

Hmmm... something like: once a secure HV passes a physical frame address >>>>> to a secure kernel then it cannot take it back, it can only ask that >>>>> kernel for it back. Which means that the HV looses control of any
core or IOMMU PTE's that map that frame until it is handed back.

That would seem to imply that once an HV gives memory to a secure
guest kernel that it can only page that guest with its permission.
Hmmm...

I am a little confused here. When you talk about I0MMU addresses, are >>>> you talking about memory addresses or disk addresses?

I/O MMU does not see the device commands containing the sector on
the disk to be accessed, Mostly, CPUs write directly to the CRs
of the device to start a command, bypassing I/O MMU as raw data.

That is indeed the case. The IOMMU is on the inbound path
from the PCIe controller to the internal bus/mesh structure.

Note that there is a translation on the outbound path from
the host address space to the PCIe memory space - this is
often 1:1, but need not be so. This translation happens
in the PCIe controller when creating the a TLP that contains
an address before sending the TLP to the endpoint. Take

Is there any reason this cannot happen in the core MMU ??

How do you map the translation table to the device? Why
would you wish to have the CPU translating I/O virtual
addresses? The IOMMU tables are per device, and they
can be configured to map the minimum amount of the address
space (even updated per-I/O if desired) required to support
the completion of an inbound DMA from the device.

Guest OS uses a virtual device address given to it from HV.
HV sets up the 2nd nesting of translation to translate this
to "what HostBridge needs" to route commands to device control
registers. The handoff can be done by spoofing config space
of having HV simply hand Guest OS a list of devices it can >discover/configure/use.

The IOMMU only is involved in DMA transactions _initiated_ by
the device, not by the CPUs. They're two completely different
concepts.

an AHCI controller, for example, where the only device
BAR is 32-bits; if a host wants to map the AHCI controller
at a 64-bit address, the controller needs to map that 64-bit
address window into a 32-bit 3DW TLP to be sent to the endpoint
function.

This is one of the reasons My 66000 architecture has a unique
MMI/O address space--you can setup a 32-bit BAR to put a
page of control registers in 32-bit address space without
conflict. {{If I understand correctly}} Core MMU, then,
translates normal device virtual control register addresses
such that the request is routed to where the device is looking
{{which has 32 high order bits zero.}}

Most systems have DRAM located at physical address zero, and
a 4GB DRAM is pretty small these days. So you either need
to make a hole in the DRAM or provide a mapping mechanism to
map a 64-bit address into a 32-bit bar when sending TLPs
to the AHCI controller.

Systems that aren't intel compatible will designate a range
of the 64-bit physical address space (near the top) and will
map regions in that range to the 32-bit bar via translation
registers in the PCIe controller.

On the other hand--it would take a very big system indeed to
overflow the 32-bit MMI/O space, although ECAM can access
42-bit device CR MMI/O space.

Leaving aside the small size of the legacy Intel I/O space
(16-bit addresses), history seems to have favored single
address space systems, so I suspect such a MMI/O space will
not be favored by many.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Fri Feb 14 19:13:29 2025

On Tue, 11 Feb 2025 23:29:04 +0000, MitchAlsup1 wrote:

On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

This is basically how all modern CPUs handle it, yes.

But it is not relevent to the inbound traffic initiated
-by the device- which can't be translated by the CPU,
rather must be translated at some point between the
PCIe controller and the internal processor interconnect
(e.g. mesh).

I am tracing the path from user of device (core Guest OS).
If I don't understand this "should be simple" path, I am
too lost to continue from the device side looking towards
memory.

I can see how core can write Guest Physical Address into
device.BAR using config space access (with appropriate
MMU permissions).

But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.

core executes the following instruction::

STH R7,[Rdevice,#controlreg]

Rdevice has a Virtual Address bit pattern which after 1 level of
translation matches the bit pattern put into device.BAR at config.

#controlreg is the offset to the control reg.

Update:

I have figured out how to re-attach Guest Physical BAR back
as MMI/O commands enter the top of a PCIe tree.

Thanks to Scott Lurndal for being gentle with me.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Feb 14 19:51:44 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Feb 2025 23:29:04 +0000, MitchAlsup1 wrote:

On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

This is basically how all modern CPUs handle it, yes.

But it is not relevent to the inbound traffic initiated
-by the device- which can't be translated by the CPU,
rather must be translated at some point between the
PCIe controller and the internal processor interconnect
(e.g. mesh).

I am tracing the path from user of device (core Guest OS).
If I don't understand this "should be simple" path, I am
too lost to continue from the device side looking towards
memory.

I can see how core can write Guest Physical Address into
device.BAR using config space access (with appropriate
MMU permissions).

But, right now: I can't see how the appropriate bit pattern
from core gets to HostBridge in MMI/O space and is recognized
by matching device.BAR down the PCIe tree.

core executes the following instruction::

STH R7,[Rdevice,#controlreg]

Rdevice has a Virtual Address bit pattern which after 1 level of
translation matches the bit pattern put into device.BAR at config.

#controlreg is the offset to the control reg.

Update:

I have figured out how to re-attach Guest Physical BAR back
as MMI/O commands enter the top of a PCIe tree.

Thanks to Scott Lurndal for being gentle with me.

Here is an example topology from a Raptor Lake system:

bus:dev.function (bus 0 is a traditional PCI bus)
Region X is BAR .

This devices are all built-in to either the core or
the PCH/southbridge.

The first plugin-pci card would reside on bus 4.

Only intel systems provide or use the I/O port (legacy 8086) BARs.

$ lspci -vvv | egrep "^[0-9]|Region "
00:00.0 Host bridge: Intel Corporation Raptor Lake-S 8+12 - Host Bridge/DRAM Controller (rev 01)
00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-S GT1 [UHD Graphics 770] (rev 04) (prog-if 00 [VGA controller])
Region 0: Memory at 6000000000 (64-bit, non-prefetchable) [size=16M]
Region 2: Memory at 4000000000 (64-bit, prefetchable) [size=256M]
Region 4: I/O ports at 5000 [size=64]
00:04.0 Signal processing controller: Intel Corporation Raptor Lake Dynamic Platform and Thermal Framework Processor Participant (rev 01)
Region 0: Memory at 6001100000 (64-bit, non-prefetchable) [size=128K] 00:08.0 System peripheral: Intel Corporation GNA Scoring Accelerator module (rev 01)
Region 0: Memory at 600113b000 (64-bit, non-prefetchable) [disabled] [size=4K]
00:14.0 USB controller: Intel Corporation Alder Lake-S PCH USB 3.2 Gen 2x2 XHCI Controller (rev 11) (prog-if 30 [XHCI])
Region 0: Memory at 6001120000 (64-bit, non-prefetchable) [size=64K] 00:14.2 RAM memory: Intel Corporation Alder Lake-S PCH Shared SRAM (rev 11)
Region 0: Memory at 6001134000 (64-bit, non-prefetchable) [disabled] [size=16K]
Region 2: Memory at 600113a000 (64-bit, non-prefetchable) [disabled] [size=4K]
00:16.0 Communication controller: Intel Corporation Alder Lake-S PCH HECI Controller #1 (rev 11)
Region 0: Memory at 6001139000 (64-bit, non-prefetchable) [size=4K] 00:17.0 SATA controller: Intel Corporation Alder Lake-S PCH SATA Controller [AHCI Mode] (rev 11) (prog-if 01 [AHCI 1.0])
Region 0: Memory at 70700000 (32-bit, non-prefetchable) [size=8K]
Region 1: Memory at 70704000 (32-bit, non-prefetchable) [size=256]
Region 2: I/O ports at 5080 [size=8]
Region 3: I/O ports at 5088 [size=4]
Region 4: I/O ports at 5060 [size=32]
Region 5: Memory at 70703000 (32-bit, non-prefetchable) [size=2K] 00:1a.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root Port #25 (rev 11) (prog-if 00 [Normal decode])
00:1c.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root Port #3 (rev 11) (prog-if 00 [Normal decode])
00:1c.3 PCI bridge: Intel Corporation Device 7abb (rev 11) (prog-if 00 [Normal decode])
00:1f.0 ISA bridge: Intel Corporation Device 7a86 (rev 11)
00:1f.3 Audio device: Intel Corporation Alder Lake-S HD Audio Controller (rev 11)
Region 0: Memory at 6001130000 (64-bit, non-prefetchable) [size=16K]
Region 4: Memory at 6001000000 (64-bit, non-prefetchable) [size=1M] 00:1f.4 SMBus: Intel Corporation Alder Lake-S PCH SMBus Controller (rev 11)
Region 0: Memory at 6001138000 (64-bit, non-prefetchable) [size=256]
Region 4: I/O ports at efa0 [size=32]
00:1f.5 Serial bus controller: Intel Corporation Alder Lake-S PCH SPI Controller (rev 11)
Region 0: Memory at 70702000 (32-bit, non-prefetchable) [size=4K] 01:00.0 Non-Volatile memory controller: Sandisk Corp WD PC SN5000S M.2 2230 NVMe SSD (DRAM-less) (prog-if 02 [NVM Express])
Region 0: Memory at 70600000 (64-bit, non-prefetchable) [size=16K] 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller (rev 1b)
Region 0: I/O ports at 4000 [size=256]
Region 2: Memory at 70504000 (64-bit, non-prefetchable) [size=4
Region 4: Memory at 70500000 (64-bit, non-prefetchable) [size=16K] 03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8852BE PCIe 802.11ax Wireless Network Controller
Region 0: I/O ports at 3000 [size=256]
Region 2: Memory at 70400000 (64-bit, non-prefetchable) [size=1M]

The traditional PCI bus supports a 5-bit device# and a 3-bit function #.
A PCIe bus supports either a 3-bit function number and the high-order
five bits MBZ (i.e. only device 0 is legal on PCIe) or, if the function advertises
the Alternate Routing Identifier (ARI) capability, an 8-bit function number (SRIOV leverages ARI to support dense routing IDs, but any bus that supports ARI can handle 256 physical functions.

PCI supports two forms of configuration space addresses in TLPs:
Type 0 contains only a function number and the register address.
Type 1 contains the bus number, function number and register address.

A PCI-PCI bridge (such as the root complex port bridge) will translate
a type 1 transaction to type 0 when the target RID is on the
configured secondary bus, or forward the type 1 transaction to
a subordinate bus bridge. With ARI, the upstream bridge
from the endpoint needs to be configured as ARI enabled so that
it forwards type 1 transactions to the secondary bus, as the
SRIOV Routing IDs can extend into the 8-bit bus space (allowing
up to 65535 virtual functions associated with a single PF).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Feb 14 21:50:13 2025

On Fri, 14 Feb 2025 19:51:44 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

-------------

Update:

I have figured out how to re-attach Guest Physical BAR back
as MMI/O commands enter the top of a PCIe tree.

Thanks to Scott Lurndal for being gentle with me.

Here is an example topology from a Raptor Lake system:

bus:dev.function (bus 0 is a traditional PCI bus)
Region X is BAR .

This devices are all built-in to either the core or
the PCH/southbridge.

The first plugin-pci card would reside on bus 4.

Only intel systems provide or use the I/O port (legacy 8086) BARs.

It is going to take me some time to dig through this.

$ lspci -vvv | egrep "^[0-9]|Region "
00:00.0 Host bridge: Intel Corporation Raptor Lake-S 8+12 - Host
Bridge/DRAM Controller (rev 01)
00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-S GT1
[UHD Graphics 770] (rev 04) (prog-if 00 [VGA controller])
Region 0: Memory at 6000000000 (64-bit, non-prefetchable)
[size=16M]
Region 2: Memory at 4000000000 (64-bit, prefetchable)
[size=256M]
Region 4: I/O ports at 5000 [size=64]
00:04.0 Signal processing controller: Intel Corporation Raptor Lake
Dynamic Platform and Thermal Framework Processor Participant (rev 01)
Region 0: Memory at 6001100000 (64-bit, non-prefetchable)
[size=128K]
00:08.0 System peripheral: Intel Corporation GNA Scoring Accelerator
module (rev 01)
Region 0: Memory at 600113b000 (64-bit, non-prefetchable)
[disabled] [size=4K]
00:14.0 USB controller: Intel Corporation Alder Lake-S PCH USB 3.2 Gen
2x2 XHCI Controller (rev 11) (prog-if 30 [XHCI])
Region 0: Memory at 6001120000 (64-bit, non-prefetchable)
[size=64K]
00:14.2 RAM memory: Intel Corporation Alder Lake-S PCH Shared SRAM (rev
11)
Region 0: Memory at 6001134000 (64-bit, non-prefetchable)
[disabled] [size=16K]
Region 2: Memory at 600113a000 (64-bit, non-prefetchable)
[disabled] [size=4K]
00:16.0 Communication controller: Intel Corporation Alder Lake-S PCH
HECI Controller #1 (rev 11)
Region 0: Memory at 6001139000 (64-bit, non-prefetchable)
[size=4K]
00:17.0 SATA controller: Intel Corporation Alder Lake-S PCH SATA
Controller [AHCI Mode] (rev 11) (prog-if 01 [AHCI 1.0])
Region 0: Memory at 70700000 (32-bit, non-prefetchable)
[size=8K]
Region 1: Memory at 70704000 (32-bit, non-prefetchable)
[size=256]
Region 2: I/O ports at 5080 [size=8]
Region 3: I/O ports at 5088 [size=4]
Region 4: I/O ports at 5060 [size=32]
Region 5: Memory at 70703000 (32-bit, non-prefetchable)
[size=2K]
00:1a.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root
Port #25 (rev 11) (prog-if 00 [Normal decode])
00:1c.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root
Port #3 (rev 11) (prog-if 00 [Normal decode])
00:1c.3 PCI bridge: Intel Corporation Device 7abb (rev 11) (prog-if 00 [Normal decode])
00:1f.0 ISA bridge: Intel Corporation Device 7a86 (rev 11)
00:1f.3 Audio device: Intel Corporation Alder Lake-S HD Audio Controller
(rev 11)
Region 0: Memory at 6001130000 (64-bit, non-prefetchable)
[size=16K]
Region 4: Memory at 6001000000 (64-bit, non-prefetchable)
[size=1M]
00:1f.4 SMBus: Intel Corporation Alder Lake-S PCH SMBus Controller (rev
11)
Region 0: Memory at 6001138000 (64-bit, non-prefetchable)
[size=256]
Region 4: I/O ports at efa0 [size=32]
00:1f.5 Serial bus controller: Intel Corporation Alder Lake-S PCH SPI Controller (rev 11)
Region 0: Memory at 70702000 (32-bit, non-prefetchable)
[size=4K]
01:00.0 Non-Volatile memory controller: Sandisk Corp WD PC SN5000S M.2
2230 NVMe SSD (DRAM-less) (prog-if 02 [NVM Express])
Region 0: Memory at 70600000 (64-bit, non-prefetchable)
[size=16K]
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller (rev 1b)
Region 0: I/O ports at 4000 [size=256]
Region 2: Memory at 70504000 (64-bit, non-prefetchable) [size=4
Region 4: Memory at 70500000 (64-bit, non-prefetchable)
[size=16K]
03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8852BE
PCIe 802.11ax Wireless Network Controller
Region 0: I/O ports at 3000 [size=256]
Region 2: Memory at 70400000 (64-bit, non-prefetchable)
[size=1M]

The traditional PCI bus supports a 5-bit device# and a 3-bit function #.
A PCIe bus supports either a 3-bit function number and the high-order
five bits MBZ (i.e. only device 0 is legal on PCIe) or, if the function advertises
the Alternate Routing Identifier (ARI) capability, an 8-bit function
number
(SRIOV leverages ARI to support dense routing IDs, but any bus that
supports
ARI can handle 256 physical functions.

PCI supports two forms of configuration space addresses in TLPs:
Type 0 contains only a function number and the register address.
Type 1 contains the bus number, function number and register address.

PCIe segments go where ? Are they "picked off" prior to being routed
down the tree ??

A PCI-PCI bridge (such as the root complex port bridge) will translate
a type 1 transaction to type 0 when the target RID is on the
configured secondary bus, or forward the type 1 transaction to
a subordinate bus bridge. With ARI, the upstream bridge
from the endpoint needs to be configured as ARI enabled so that
it forwards type 1 transactions to the secondary bus, as the
SRIOV Routing IDs can extend into the 8-bit bus space (allowing
up to 65535 virtual functions associated with a single PF).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

Recent Visitors

System Info

Re: Stacks, was Segments