• Re: Stacks, was Segments

    From Scott Lurndal@21:1/5 to John Levine on Sat Jan 18 16:30:20 2025
    John Levine <johnl@taugh.com> writes:
    According to MitchAlsup1 <mitchalsup@aol.com>:
    Stacks are small because OS people make them small, not because of
    a valid technical reason that has ever been explained to me.
    "To avoid infinite recursion" is not a valid reason, IMHO.

    Algol 60 only had stack allocation for dynamically sized arrays,
    so stacks had to be as big as the data are.

    Huh? Algol 60 routines could be mutually recursive so unless it was
    a leaf procedure or the outer block, everything not declared "own"
    went on the stack.

    For some flavors of Algol _everything_ was on the stack.
    (e.g. B5500 and successors).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Jan 18 17:40:00 2025
    On Sat, 18 Jan 2025 16:30:20 +0000, Scott Lurndal wrote:

    John Levine <johnl@taugh.com> writes:
    According to MitchAlsup1 <mitchalsup@aol.com>:
    Stacks are small because OS people make them small, not because of
    a valid technical reason that has ever been explained to me.
    "To avoid infinite recursion" is not a valid reason, IMHO.

    Algol 60 only had stack allocation for dynamically sized arrays,
    so stacks had to be as big as the data are.

    Huh? Algol 60 routines could be mutually recursive so unless it was
    a leaf procedure or the outer block, everything not declared "own"
    went on the stack.

    For some flavors of Algol _everything_ was on the stack.
    (e.g. B5500 and successors).

    1108 Algol had everything on the stack.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Jan 18 19:41:56 2025
    According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
    Algol 60 only had stack allocation for dynamically sized arrays,
    so stacks had to be as big as the data are.

    Huh? Algol 60 routines could be mutually recursive so unless it was
    a leaf procedure or the outer block, everything not declared "own"
    went on the stack.

    Mitch's point AIUI was that Algol 60 had no heap allocation (and no
    explicit pointer types), so indeed all data were either on the stack or >statically allocated.

    It sounded to me like he said that dynamically sized arrays were on the
    stack, nothing else was. I think we agree that everything but "own"
    is on the stack.

    Algol 60 did need a heap because own arrays could have variable size.
    That wasn't an accident since sec 5.2.2 shows an example of a variable
    size own array. I suspect they didn't realize the implications both
    of resizing non-stack data, and what happens in an upper level call
    if a lower level call resizes the array underneath it.

    It wasn't the only mistake like that. Alan Perlis told me that they
    intended call by name to be an elegantly phrased definition of call
    by reference, and it wasn't until Jensen's device that they realized what
    they had actually done.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Niklas Holsti on Sun Jan 19 17:33:53 2025
    On 18/01/2025 09:59, Niklas Holsti wrote:
    On 2025-01-18 5:08, John Levine wrote:
    According to MitchAlsup1 <mitchalsup@aol.com>:
    Stacks are small because OS people make them small, not because of
    a valid technical reason that has ever been explained to me.
    "To avoid infinite recursion" is not a valid reason, IMHO.

    Algol 60 only had stack allocation for dynamically sized arrays,
    so stacks had to be as big as the data are.

    Huh?  Algol 60 routines could be mutually recursive so unless it was
    a leaf procedure or the outer block, everything not declared "own"
    went on the stack.


    Mitch's point AIUI was that Algol 60 had no heap allocation (and no
    explicit pointer types), so indeed all data were either on the stack or statically allocated.

    I'm not an English native speaker, but it seems to me that Mitch should
    have written "Algol 60 had only stack allocation" instead of "Algol 60
    only had stack allocation".

    The most-used Ada compiler, GNAT, uses a "secondary stack" to reduce the
    need for heap. Dynamically sized local data are placed on the secondary stack, and dynamically sized return values of functions are returned on
    the secondary stack. So a function can return "by value" an array sized
    1..N, with N a function parameter, without needing the heap.

    Of course the programmer then has the problem of setting sufficient
    sizes for /two/ stacks, the primary and the secondary. For
    embedded-systems programs one usually avoids constructs that would need
    a secondary stack.


    A two-stack setup can be used in C too. (The C standards don't require
    a stack at all.) On the AVR microcontroller, it is not uncommon for C implementations to work with a dual stack, since it does not have any
    kind of "[SP + n]" or "[SP + r]" addressing modes, but it /does/ have an
    "[Y + n]" addressing mode using an index register.

    Two stacks are also pretty much required for FORTH.

    The use of a dual stack could also significantly improve the security of systems by separating call/return addresses from data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Sun Jan 19 18:28:40 2025
    On Sun, 19 Jan 2025 16:33:53 +0000, David Brown wrote:

    A two-stack setup can be used in C too. (The C standards don't require
    a stack at all.) On the AVR microcontroller, it is not uncommon for C implementations to work with a dual stack, since it does not have any
    kind of "[SP + n]" or "[SP + r]" addressing modes, but it /does/ have an
    "[Y + n]" addressing mode using an index register.

    Two stacks are also pretty much required for FORTH.

    The use of a dual stack could also significantly improve the security of systems by separating call/return addresses from data.

    In My 66000 the code cannot read/write that other stack with LD and ST instructions. It can only be accessed by ENTER (stores) and EXIT (LDs).
    The mapping PTE is marked RWE = 000.

    So, while you can still overrun buffers, you cannot damage the call/
    return stack or the preserved registers !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to David Brown on Sun Jan 19 23:37:17 2025
    On 2025-01-19 18:33, David Brown wrote:
    On 18/01/2025 09:59, Niklas Holsti wrote:

    [...]


    The most-used Ada compiler, GNAT, uses a "secondary stack" to reduce
    the need for heap. Dynamically sized local data are placed on the
    secondary stack, and dynamically sized return values of functions are
    returned on the secondary stack. So a function can return "by value"
    an array sized 1..N, with N a function parameter, without needing the
    heap.

    Of course the programmer then has the problem of setting sufficient
    sizes for /two/ stacks, the primary and the secondary. For
    embedded-systems programs one usually avoids constructs that would
    need a secondary stack.


    A two-stack setup can be used in C too.  (The C standards don't require
    a stack at all.)  On the AVR microcontroller, it is not uncommon for C implementations to work with a dual stack, since it does not have any
    kind of "[SP + n]" or "[SP + r]" addressing modes, but it /does/ have an
    "[Y + n]" addressing mode using an index register.


    Yes. Other C compilers use a single stack but use Y as a frame pointer
    so they can use "[Y + n]" to access stack-frame locations.

    The issue is more acute for 8051/MCS-51 systems where the call/return
    stack is in the very small "internal" RAM, so C compilers often allocate
    a larger "SW stack" for stack data in the larger "external" RAM. But
    they do so only for potentially recursive or reentrant functions, and
    instead use statically allocated space for the call-frames of other
    functions (with smart whole-program analysis to share such space for
    functions that can never be active at the same time).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Niklas Holsti on Mon Jan 20 09:00:43 2025
    On 19/01/2025 22:37, Niklas Holsti wrote:
    On 2025-01-19 18:33, David Brown wrote:
    On 18/01/2025 09:59, Niklas Holsti wrote:

       [...]


    The most-used Ada compiler, GNAT, uses a "secondary stack" to reduce
    the need for heap. Dynamically sized local data are placed on the
    secondary stack, and dynamically sized return values of functions are
    returned on the secondary stack. So a function can return "by value"
    an array sized 1..N, with N a function parameter, without needing the
    heap.

    Of course the programmer then has the problem of setting sufficient
    sizes for /two/ stacks, the primary and the secondary. For
    embedded-systems programs one usually avoids constructs that would
    need a secondary stack.


    A two-stack setup can be used in C too.  (The C standards don't
    require a stack at all.)  On the AVR microcontroller, it is not
    uncommon for C implementations to work with a dual stack, since it
    does not have any kind of "[SP + n]" or "[SP + r]" addressing modes,
    but it /does/ have an "[Y + n]" addressing mode using an index register.


    Yes. Other C compilers use a single stack but use Y as a frame pointer
    so they can use "[Y + n]" to access stack-frame locations.


    gcc for the AVR does that. I assume that it would be a massive effort
    to introduce a secondary data stack to gcc, whereas the original AVR
    port of gcc was much simpler at the cost of inefficiencies (basically
    the 32 8-bit registers were paired up and viewed as 16 16-bit registers,
    making the AVR appear like 16-bit RISC processors already well
    supported, with peephole optimisations to reduce redundant operations
    after code generation).

    Other AVR compilers that were made from scratch, or from compilers that
    already had complicated stack setups (such as ones for the 8051 you
    mention below), were more likely to use a separate data stack.

    The efficiency advantages and disadvantages of these two arrangements
    are not clear-cut for the AVR - it depends a lot on the way the code is written.

    The issue is more acute for 8051/MCS-51 systems where the call/return
    stack is in the very small "internal" RAM, so C compilers often allocate
    a larger "SW stack" for stack data in the larger "external" RAM. But
    they do so only for potentially recursive or reentrant functions, and
    instead use statically allocated space for the call-frames of other
    functions (with smart whole-program analysis to share such space for functions that can never be active at the same time).


    Yes. This also applies to several other "brain-dead" 8-bit CISC
    architectures.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Michael S on Mon Jan 20 11:12:54 2025
    Michael S <already5chosen@yahoo.com> wrote:
    On Sun, 19 Jan 2025 18:28:40 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    In My 66000 the code cannot read/write that other stack with LD and ST
    instructions. It can only be accessed by ENTER (stores) and EXIT
    (LDs). The mapping PTE is marked RWE = 000.

    So, while you can still overrun buffers, you cannot damage the call/
    return stack or the preserved registers !!

    Not that I am a specialist in GC, but according to my understanding
    the most common and the best performing variants of GC can not work
    without read access to preserved registers. Compacting collector seems
    to need write access as well.
    As to return addresses, I would think that read access to stack of
    return addresses is necessary for exception handling.

    _Correctness_ of GC depends on ability to see preserved registers
    and return address: return address may be the only live reference
    to some function and similarly for preserved registers. On could
    try to work around lack of access using separate software-managed
    stack duplicating data from "hardware" stack, but that is ugly
    and is likely to kill any performance advantage from hardware
    features.

    BTW, the same holds for debuggers and exception handling. Those
    clearly need some way to go around hardware limitations.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Mon Jan 20 12:55:37 2025
    On Sun, 19 Jan 2025 18:28:40 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    In My 66000 the code cannot read/write that other stack with LD and ST instructions. It can only be accessed by ENTER (stores) and EXIT
    (LDs). The mapping PTE is marked RWE = 000.

    So, while you can still overrun buffers, you cannot damage the call/
    return stack or the preserved registers !!

    Not that I am a specialist in GC, but according to my understanding
    the most common and the best performing variants of GC can not work
    without read access to preserved registers. Compacting collector seems
    to need write access as well.
    As to return addresses, I would think that read access to stack of
    return addresses is necessary for exception handling.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Waldek Hebisch on Mon Jan 20 22:05:10 2025
    On Mon, 20 Jan 2025 11:12:54 +0000, Waldek Hebisch wrote:

    Michael S <already5chosen@yahoo.com> wrote:
    On Sun, 19 Jan 2025 18:28:40 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    In My 66000 the code cannot read/write that other stack with LD and ST
    instructions. It can only be accessed by ENTER (stores) and EXIT
    (LDs). The mapping PTE is marked RWE = 000.

    So, while you can still overrun buffers, you cannot damage the call/
    return stack or the preserved registers !!

    Not that I am a specialist in GC, but according to my understanding
    the most common and the best performing variants of GC can not work
    without read access to preserved registers. Compacting collector seems
    to need write access as well.
    As to return addresses, I would think that read access to stack of
    return addresses is necessary for exception handling.

    _Correctness_ of GC depends on ability to see preserved registers
    and return address: return address may be the only live reference
    to some function and similarly for preserved registers. On could
    try to work around lack of access using separate software-managed
    stack duplicating data from "hardware" stack, but that is ugly
    and is likely to kill any performance advantage from hardware
    features.

    BTW, the same holds for debuggers and exception handling. Those
    clearly need some way to go around hardware limitations.

    Yes, there is a way to do all those things, but I am not in a position
    to discuss due to USPTO rules.

    It is like there is a privilege level between application and GuestOS.
    {{I spent all afternoon trying to think of a name for this privilege
    above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?

    I/O works similarly--in that to the application a page may be marked
    RWE=001 (execute only) but the swap disk is allowed to read or write
    those pages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Tue Jan 21 01:25:19 2025
    On Mon, 20 Jan 2025 22:05:10 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 20 Jan 2025 11:12:54 +0000, Waldek Hebisch wrote:

    Michael S <already5chosen@yahoo.com> wrote:
    On Sun, 19 Jan 2025 18:28:40 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    In My 66000 the code cannot read/write that other stack with LD
    and ST instructions. It can only be accessed by ENTER (stores)
    and EXIT (LDs). The mapping PTE is marked RWE = 000.

    So, while you can still overrun buffers, you cannot damage the
    call/ return stack or the preserved registers !!

    Not that I am a specialist in GC, but according to my understanding
    the most common and the best performing variants of GC can not work
    without read access to preserved registers. Compacting collector
    seems to need write access as well.
    As to return addresses, I would think that read access to stack of
    return addresses is necessary for exception handling.

    _Correctness_ of GC depends on ability to see preserved registers
    and return address: return address may be the only live reference
    to some function and similarly for preserved registers. On could
    try to work around lack of access using separate software-managed
    stack duplicating data from "hardware" stack, but that is ugly
    and is likely to kill any performance advantage from hardware
    features.

    BTW, the same holds for debuggers and exception handling. Those
    clearly need some way to go around hardware limitations.

    Yes, there is a way to do all those things, but I am not in a position
    to discuss due to USPTO rules.

    It is like there is a privilege level between application and GuestOS.
    {{I spent all afternoon trying to think of a name for this privilege
    above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?


    Call it 'user'. Then rename the level that you now call 'application'
    to 'sandbox'.

    I/O works similarly--in that to the application a page may be marked
    RWE=001 (execute only) but the swap disk is allowed to read or write
    those pages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Jan 21 00:17:56 2025
    On Mon, 20 Jan 2025 23:25:19 +0000, Michael S wrote:

    On Mon, 20 Jan 2025 22:05:10 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 20 Jan 2025 11:12:54 +0000, Waldek Hebisch wrote:

    Michael S <already5chosen@yahoo.com> wrote:
    On Sun, 19 Jan 2025 18:28:40 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    In My 66000 the code cannot read/write that other stack with LD
    and ST instructions. It can only be accessed by ENTER (stores)
    and EXIT (LDs). The mapping PTE is marked RWE = 000.

    So, while you can still overrun buffers, you cannot damage the
    call/ return stack or the preserved registers !!

    Not that I am a specialist in GC, but according to my understanding
    the most common and the best performing variants of GC can not work
    without read access to preserved registers. Compacting collector
    seems to need write access as well.
    As to return addresses, I would think that read access to stack of
    return addresses is necessary for exception handling.

    _Correctness_ of GC depends on ability to see preserved registers
    and return address: return address may be the only live reference
    to some function and similarly for preserved registers. On could
    try to work around lack of access using separate software-managed
    stack duplicating data from "hardware" stack, but that is ugly
    and is likely to kill any performance advantage from hardware
    features.

    BTW, the same holds for debuggers and exception handling. Those
    clearly need some way to go around hardware limitations.

    Yes, there is a way to do all those things, but I am not in a position
    to discuss due to USPTO rules.

    It is like there is a privilege level between application and GuestOS.
    {{I spent all afternoon trying to think of a name for this privilege
    above application "non-privileged" and below "privileged". Maybe
    meso-privileged ?!?


    Call it 'user'. Then rename the level that you now call 'application'
    to 'sandbox'.

    Realistically--there are 3 levels in each privilege layer::

    least) sandbox--for Jitted code
    medium) {user, JIT, Dynamic library, ...}
    higher) {debug, GC, Exception, interrupt, Dynamic loader, device DMA,
    ..}
    {{none of which need access to other-than-user VAS, or other-than-user privileges}}

    All sharing a single address space, and a software stack of supervision, interrupt table, file-ids, socket-ids,...

    The higher level of privilege allows this level to disobey the
    permissions
    in the PTE (possibly under a flag from ROOT).

    So, while sandbox is a fine name for the least privileged running
    environment, we are still needing a for the medium level. It is almost
    like the higher level is a good portion of GuestOS kernel--those parts requiring no privilege in any normal sense.

    I/O works similarly--in that to the application a page may be marked
    RWE=001 (execute only) but the swap disk is allowed to read or write
    those pages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Jan 21 06:21:36 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    It is like there is a privilege level between application and GuestOS.
    {{I spent all afternoon trying to think of a name for this privilege
    above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?

    Authorized? Or a numbering system for different privilege levels,
    like it was used for rings?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Findlay@21:1/5 to All on Tue Jan 21 10:36:04 2025
    On 20 Jan 2025, MitchAlsup1 wrote
    (in article<43e21bd0bddea1733cd672c07a6319d4@www.novabbs.org>):

    It is like there is a privilege level between application and GuestOS.
    {{I spent all afternoon trying to think of a name for this privilege
    above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?

    Entitled? 8-)
    --
    Bill Findlay

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Bill Findlay on Tue Jan 21 17:49:13 2025
    On Tue, 21 Jan 2025 10:36:04 +0000, Bill Findlay wrote:

    On 20 Jan 2025, MitchAlsup1 wrote
    (in article<43e21bd0bddea1733cd672c07a6319d4@www.novabbs.org>):

    It is like there is a privilege level between application and GuestOS.
    {{I spent all afternoon trying to think of a name for this privilege
    above application "non-privileged" and below "privileged". Maybe
    meso-privileged ?!?

    Entitled? 8-)

    Not bad, not bad at all ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Feb 5 12:11:57 2025
    MitchAlsup1 wrote:
    On Mon, 3 Feb 2025 22:47:24 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 3 Feb 2025 21:13:24 +0000, Scott Lurndal wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    It is like there is a privilege level between application and
    GuestOS.
    {{I spent all afternoon trying to think of a name for this privilege >>>>>> above application "non-privileged" and below "privileged". Maybe
    meso-privileged ?!?

    handyman?

    Application -> Library -> OS -> Hypervisor -> Secure Monitor


    {Sandbox -> user -> application -> Library} ->{sual}×{GuestOS, HV, SM}

    ??

    You need to precisely define your terms. What are sandbox
    and user in this context?

    It is all about manipulating access rights without modifying
    what is stored in the TLB (so you don't have to reload any
    entries to change access rights.) It is sort of like what
    the G-bit does (global) {except in my architecture globality
    is controlled by ASID.}

    Sandbox is a privilege level where one cannot be granted both
    write and execute access at the same time. There may be other
    restrictions, too; like access to control registers user may
    be allowed to write.

    Library would include all the trusted stuff, but also ld.so
    and any JITs. JITs can only create code for sandboxes. So,
    JIT can write to JITcache but sandbox cannot using the same
    PTE entry. ld.so can write GOT while user and application
    cannot write GOT (or execute GOT).

    User is the privilege level where sandbox does not apply but
    also there is no ability to over-access things protected by
    PTE.RWE.

    Application is a privilege level where PTE.RWE can sometimes
    be usurped--such as DMA from a device needing to write into
    a execute only page.

    Where does memmove() come from is not the library ??

    Libraries have a SW-kind of trust even if they are
    devoid of HW kinds of trust (PTE.RWE overrides).

    But these levels are just talking point at this point.

    It sounds you want something like the VAX privilege/protection mechanism.
    It had 4 privilege levels: User, Supervisor, Executive, Kernel.
    Each PTE grants R, RW or na (no-access) rights for each priv level.
    (Read access implied Execute)
    Naively that would take 4*2 = 8 bits in each 32-bit PTE.

    However they reduce the combinations with a simple set of rules:
    - if any priv level has read access then higher levels have read also.
    - if any priv level has write access then higher levels have write also.

    That brings the PTE access control field down to 4-bits for all
    for priv levels.

    For comparison, x64 PTE has 3 bits for 2 priv levels.


    =====================================
    For the present day we would want REW access control.
    Naively this would require 4*3 = 12 bits in each PTE.

    If apply the rules:
    - we only need a meaningful subset of combinations: na, R, RE, RW, REW.
    - no higher priv level can have less access than a lower priv level.
    - we can save 1 combo because all 4 priv levels = na is redundant
    with the PTE Present bit being clear.

    we can get this all down to a 4-bit PTE field:

    Usr Sup Exc Krn
    --- --- --- ---
    na na na R
    na na na RE
    na na na RW
    na na na REW
    na na R R
    na na RE RE
    ...
    R R R R
    ...
    REW REW REW REW

    The core's (thread's) privilege mode would enable access to the pages.
    The PTE's access control field, which is derived from the kind of
    mapped memory section, would not have to change between different threads.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Wed Feb 5 14:55:14 2025
    EricP wrote:

    =====================================
    For the present day we would want REW access control.
    Naively this would require 4*3 = 12 bits in each PTE.

    If apply the rules:
    - we only need a meaningful subset of combinations: na, R, RE, RW, REW.
    - no higher priv level can have less access than a lower priv level.
    - we can save 1 combo because all 4 priv levels = na is redundant
    with the PTE Present bit being clear.

    we can get this all down to a 4-bit PTE field:

    Usr Sup Exc Krn
    --- --- --- ---
    na na na R
    na na na RE
    na na na RW
    na na na REW
    na na R R
    na na RE RE
    ....
    R R R R
    ....
    REW REW REW REW

    The core's (thread's) privilege mode would enable access to the pages.
    The PTE's access control field, which is derived from the kind of
    mapped memory section, would not have to change between different threads.

    Or if you want the flexibility to choose your own REW combinations,
    the 4-bit PTE access control field is an index to a 16 entry array
    of 12-bit values for the four privilege levels.

    That's better because then the OS can decide how it wants
    the different memory sections and thread to behave and
    removes the strict hardwired hierarchy of the prior rules.

    The next problem though might be finding 4 bits in the PTE.

    (That is how I see the PTE's 2 or 3 Cache Control bits to work.
    Also there are separate CC lookup tables for interior table PTE's
    and leaf table PTE's entries.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Feb 5 21:31:05 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 3 Feb 2025 22:47:24 +0000, Scott Lurndal wrote:



    User is the privilege level where sandbox does not apply but
    also there is no ability to over-access things protected by
    PTE.RWE.

    Application is a privilege level where PTE.RWE can sometimes
    be usurped--such as DMA from a device needing to write into
    a execute only page.


    For most modern server CPUs (Intel/AMD/ARM) that is the
    responsibility of the IOMMU, not the processor/core/thread.

    Where does memmove() come from is not the library ??

    Some applications roll their own. In higher level languages,
    such as C++, explicit calls to memmove are rare to non-existent
    (the standard C++ library and compiler handle data movement).


    Libraries have a SW-kind of trust even if they are
    devoid of HW kinds of trust (PTE.RWE overrides).

    Libraries are easy to usurp in many systems (e.g. with LD_PRELOAD);
    precautions are in place to prevent such interpositions
    for applications with security constraints (e.g. installed with
    enhance capabilities or with UID==0).


    But these levels are just talking point at this point.

    The hypervisor is optional, as would be a library.

    It cannot be a library of process !!

    Why not? See either Burroughs or HP-3000 for example
    of libraries as first-class objects with independent
    security contexts.

    It is not a library of GuestOS !
    it is certainly not a library of Secure Monitor !!

    Why should such code not be able to leverage all the
    advantages of libraries, given suitable security controls?



    The Burroughs Large systems and HP-3000 segmented libraries
    were distinct entities with attributes.

    And could change (update/upgrade) the library while the process
    was running !!

    Under certain well-defined conditions, yes.


    Code in a library could be more privileged than the application
    when acting on behalf of the application, for example; but the
    application could not take advantage of the permissions assigned
    to the library it was linked with without using interfaces
    provided by the library.

    No disagreement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Feb 5 23:36:58 2025
    On Wed, 5 Feb 2025 17:11:57 +0000, EricP wrote:

    MitchAlsup1 wrote:

    But these levels are just talking point at this point.

    It sounds you want something like the VAX privilege/protection
    mechanism.
    It had 4 privilege levels: User, Supervisor, Executive, Kernel.
    Each PTE grants R, RW or na (no-access) rights for each priv level.
    (Read access implied Execute)
    Naively that would take 4*2 = 8 bits in each 32-bit PTE.

    However they reduce the combinations with a simple set of rules:
    - if any priv level has read access then higher levels have read also.
    - if any priv level has write access then higher levels have write also.

    That brings the PTE access control field down to 4-bits for all
    for priv levels.

    Yes, but in VAX's time we did not have applications that did not want
    the OS to look at their data (banking, video streaming, ...) or a
    massive number of attackers causing an increased demand for protection
    {even to the point of resurrecting capability machines (CHERRI)}.

    For comparison, x64 PTE has 3 bits for 2 priv levels.


    EricP: thank you for the following thoughts.

    On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

    EricP wrote:

    =====================================
    For the present day we would want REW access control.
    Naively this would require 4*3 = 12 bits in each PTE.

    If apply the rules:
    - we only need a meaningful subset of combinations: na, R, RE, RW, REW.
    - no higher priv level can have less access than a lower priv level.
    - we can save 1 combo because all 4 priv levels = na is redundant
    with the PTE Present bit being clear.

    we can get this all down to a 4-bit PTE field:

    Usr Sup Exc Krn
    --- --- --- ---
    na na na R
    na na na RE
    na na na RW
    na na na REW
    na na R R
    na na RE RE
    ....
    R R R R
    ....
    REW REW REW REW

    The core's (thread's) privilege mode would enable access to the pages.
    The PTE's access control field, which is derived from the kind of
    mapped memory section, would not have to change between different
    threads.

    Or if you want the flexibility to choose your own REW combinations,
    the 4-bit PTE access control field is an index to a 16 entry array
    of 12-bit values for the four privilege levels.

    That's better because then the OS can decide how it wants
    the different memory sections and thread to behave and
    removes the strict hardwired hierarchy of the prior rules.

    The next problem though might be finding 4 bits in the PTE.

    Another PTE bit I can find. Placing the 16×12 vector is more difficult,
    even when I position it as 4 places of 16×3.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Thu Feb 6 11:41:45 2025
    MitchAlsup1 wrote:
    On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

    EricP wrote:

    =====================================
    For the present day we would want REW access control.
    Naively this would require 4*3 = 12 bits in each PTE.

    If apply the rules:
    - we only need a meaningful subset of combinations: na, R, RE, RW, REW.
    - no higher priv level can have less access than a lower priv level.
    - we can save 1 combo because all 4 priv levels = na is redundant
    with the PTE Present bit being clear.

    we can get this all down to a 4-bit PTE field:

    Usr Sup Exc Krn
    --- --- --- ---
    na na na R
    na na na RE
    na na na RW
    na na na REW
    na na R R
    na na RE RE
    ....
    R R R R
    ....
    REW REW REW REW

    The core's (thread's) privilege mode would enable access to the pages.
    The PTE's access control field, which is derived from the kind of
    mapped memory section, would not have to change between different
    threads.

    Or if you want the flexibility to choose your own REW combinations,
    the 4-bit PTE access control field is an index to a 16 entry array
    of 12-bit values for the four privilege levels.

    That's better because then the OS can decide how it wants
    the different memory sections and thread to behave and
    removes the strict hardwired hierarchy of the prior rules.

    The next problem though might be finding 4 bits in the PTE.

    Another PTE bit I can find. Placing the 16×12 vector is more difficult,
    even when I position it as 4 places of 16×3.

    I don't understand what you said.
    The 4-bit Access Control (AC) field is in the PTE.

    The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
    is in the MMU.

    The core's 2-bit mode selects-muxes one of the 3-bit allowed access fields
    from the indexed 12-bits to extract the 3 R-E-W bits.

    The 2-bit mode comes from the LD/ST uOp, which was set to the mode active
    when the instruction was decoded (so it can pipeline mode changes).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Feb 6 17:13:15 2025
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

    EricP wrote:

    =====================================
    For the present day we would want REW access control.
    Naively this would require 4*3 = 12 bits in each PTE.

    If apply the rules:
    - we only need a meaningful subset of combinations: na, R, RE, RW, REW. >>>> - no higher priv level can have less access than a lower priv level.
    - we can save 1 combo because all 4 priv levels = na is redundant
    with the PTE Present bit being clear.

    we can get this all down to a 4-bit PTE field:

    Usr Sup Exc Krn
    --- --- --- ---
    na na na R
    na na na RE
    na na na RW
    na na na REW
    na na R R
    na na RE RE
    ....
    R R R R
    ....
    REW REW REW REW

    The core's (thread's) privilege mode would enable access to the pages. >>>> The PTE's access control field, which is derived from the kind of
    mapped memory section, would not have to change between different
    threads.

    Or if you want the flexibility to choose your own REW combinations,
    the 4-bit PTE access control field is an index to a 16 entry array
    of 12-bit values for the four privilege levels.

    That's better because then the OS can decide how it wants
    the different memory sections and thread to behave and
    removes the strict hardwired hierarchy of the prior rules.

    The next problem though might be finding 4 bits in the PTE.

    Another PTE bit I can find. Placing the 16×12 vector is more difficult,
    even when I position it as 4 places of 16×3.

    I don't understand what you said.
    The 4-bit Access Control (AC) field is in the PTE.

    Currently, PTE uses a 3-bit access control field, and PTE has
    2-bits spare. So making access control larger is easy.

    The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
    is in the MMU.

    How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super
    cannot see things Hyper can see and the same with secure. So, somewhere
    in the various control blocks I need to find space without changing
    the overall use pattern of the control blocks and tables. Which is
    why I alluded to 4×16×3 each interpretation of the 4-bit access control
    is stored it its own natural place. It also means each layer can apply
    its own interpretation (mapping).

    The core's 2-bit mode selects-muxes one of the 3-bit allowed access
    fields from the indexed 12-bits to extract the 3 R-E-W bits.

    That much is straightforward.

    The 2-bit mode comes from the LD/ST uOp, which was set to the mode
    active when the instruction was decoded (so it can pipeline mode
    changes).

    Yes, core-state index follows the memref down the pipe.
    Core-state index is written into MMI/O/device control block for the
    DMA portion of a command, other CD indexes are associated with I/O
    page faults and device errors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Thu Feb 6 13:51:12 2025
    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

    EricP wrote:

    =====================================
    For the present day we would want REW access control.
    Naively this would require 4*3 = 12 bits in each PTE.

    If apply the rules:
    - we only need a meaningful subset of combinations: na, R, RE, RW,
    REW.
    - no higher priv level can have less access than a lower priv level. >>>>> - we can save 1 combo because all 4 priv levels = na is redundant
    with the PTE Present bit being clear.

    we can get this all down to a 4-bit PTE field:

    Usr Sup Exc Krn
    --- --- --- ---
    na na na R
    na na na RE
    na na na RW
    na na na REW
    na na R R
    na na RE RE
    ....
    R R R R
    ....
    REW REW REW REW

    The core's (thread's) privilege mode would enable access to the pages. >>>>> The PTE's access control field, which is derived from the kind of
    mapped memory section, would not have to change between different
    threads.

    Or if you want the flexibility to choose your own REW combinations,
    the 4-bit PTE access control field is an index to a 16 entry array
    of 12-bit values for the four privilege levels.

    That's better because then the OS can decide how it wants
    the different memory sections and thread to behave and
    removes the strict hardwired hierarchy of the prior rules.

    The next problem though might be finding 4 bits in the PTE.

    Another PTE bit I can find. Placing the 16×12 vector is more difficult, >>> even when I position it as 4 places of 16×3.

    I don't understand what you said.
    The 4-bit Access Control (AC) field is in the PTE.

    Currently, PTE uses a 3-bit access control field, and PTE has
    2-bits spare. So making access control larger is easy.

    The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
    is in the MMU.

    How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super cannot see things Hyper can see and the same with secure. So, somewhere
    in the various control blocks I need to find space without changing
    the overall use pattern of the control blocks and tables. Which is
    why I alluded to 4×16×3 each interpretation of the 4-bit access control
    is stored it its own natural place. It also means each layer can apply
    its own interpretation (mapping).

    Its just an SRAM loaded by the boot ROM before the Hypervisor boots.
    The super-secure version of boot ROM loads a table with values
    (Sandbox, User, Kernel, Hypervisor):

    Snd Usr Krn Hyp
    na na na R
    na na na RE
    na na na RW
    na na na REW
    na na R na
    na na RE na
    na na RW na
    na na REW na
    na R R na
    na RE RE na
    ...
    REW REW REW na

    which grants mode 0 (Hyp) no direct RW access to any memory outside itself. Boot ROM sets an optional table lock so even hypervisor cannot later
    grant itself access permission to less priv memory by changing the table.

    The core's 2-bit mode selects-muxes one of the 3-bit allowed access
    fields from the indexed 12-bits to extract the 3 R-E-W bits.

    That much is straightforward.

    The 2-bit mode comes from the LD/ST uOp, which was set to the mode
    active when the instruction was decoded (so it can pipeline mode
    changes).

    Yes, core-state index follows the memref down the pipe.
    Core-state index is written into MMI/O/device control block for the
    DMA portion of a command, other CD indexes are associated with I/O
    page faults and device errors.

    Not sure how this would work with device IO and DMA.
    Say a secure kernel that owns a disk drive with secrets that even the HV
    is not authorized to see (so HV operators don't need Top Secret clearance).
    The Hypervisor has to pass to a hardware device DMA access to a memory
    frame that it has no access to itself. How does one block the HV from
    setting the IOMMU to DMA the device's secrets into its own memory?

    Hmmm... something like: once a secure HV passes a physical frame address
    to a secure kernel then it cannot take it back, it can only ask that
    kernel for it back. Which means that the HV looses control of any
    core or IOMMU PTE's that map that frame until it is handed back.

    That would seem to imply that once an HV gives memory to a secure
    guest kernel that it can only page that guest with its permission.
    Hmmm...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to EricP on Thu Feb 6 12:06:31 2025
    On 2/6/2025 10:51 AM, EricP wrote:
    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

    EricP wrote:

    =====================================
    For the present day we would want REW access control.
    Naively this would require 4*3 = 12 bits in each PTE.

    If apply the rules:
    - we only need a meaningful subset of combinations: na, R, RE, RW, >>>>>> REW.
    - no higher priv level can have less access than a lower priv level. >>>>>> - we can save 1 combo because all 4 priv levels = na is redundant
      with the PTE Present bit being clear.

    we can get this all down to a 4-bit PTE field:

    Usr Sup Exc Krn
    --- --- --- ---
    na  na  na  R
    na  na  na  RE
    na  na  na  RW
    na  na  na  REW
    na  na  R   R
    na  na  RE  RE
    ....
    R   R   R   R
    ....
    REW REW REW REW

    The core's (thread's) privilege mode would enable access to the
    pages.
    The PTE's access control field, which is derived from the kind of
    mapped memory section, would not have to change between different
    threads.

    Or if you want the flexibility to choose your own REW combinations,
    the 4-bit PTE access control field is an index to a 16 entry array
    of 12-bit values for the four privilege levels.

    That's better because then the OS can decide how it wants
    the different memory sections and thread to behave and
    removes the strict hardwired hierarchy of the prior rules.

    The next problem though might be finding 4 bits in the PTE.

    Another PTE bit I can find. Placing the 16×12 vector is more difficult, >>>> even when I position it as 4 places of 16×3.

    I don't understand what you said.
    The 4-bit Access Control (AC) field is in the PTE.

    Currently, PTE uses a 3-bit access control field, and PTE has
    2-bits spare. So making access control larger is easy.

    The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
    is in the MMU.

    How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super
    cannot see things Hyper can see and the same with secure. So, somewhere
    in the various control blocks I need to find space without changing
    the overall use pattern of the control blocks and tables. Which is
    why I alluded to 4×16×3 each interpretation of the 4-bit access control
    is stored it its own natural place. It also means each layer can apply
    its own interpretation (mapping).

    Its just an SRAM loaded by the boot ROM before the Hypervisor boots.
    The super-secure version of boot ROM loads a table with values
    (Sandbox, User, Kernel, Hypervisor):

    Snd Usr Krn Hyp
    na  na  na  R
    na  na  na  RE
    na  na  na  RW
    na  na  na  REW
    na  na  R   na
    na  na  RE  na
    na  na  RW  na
    na  na  REW na
    na  R   R   na
    na  RE  RE  na
    ...
    REW REW REW na

    which grants mode 0 (Hyp) no direct RW access to any memory outside itself. Boot ROM sets an optional table lock so even hypervisor cannot later
    grant itself access permission to less priv memory by changing the table.

    The core's 2-bit mode selects-muxes one of the 3-bit allowed access
    fields from the indexed 12-bits to extract the 3 R-E-W bits.

    That much is straightforward.

    The 2-bit mode comes from the LD/ST uOp, which was set to the mode
    active when the instruction was decoded (so it can pipeline mode
    changes).

    Yes, core-state index follows the memref down the pipe.
    Core-state index is written into MMI/O/device control block for the
    DMA portion of a command, other CD indexes are associated with I/O
    page faults and device errors.

    Not sure how this would work with device IO and DMA.
    Say a secure kernel that owns a disk drive with secrets that even the HV
    is not authorized to see (so HV operators don't need Top Secret clearance). The Hypervisor has to pass to a hardware device DMA access to a memory
    frame that it has no access to itself. How does one block the HV from
    setting the IOMMU to DMA the device's secrets into its own memory?

    Hmmm... something like: once a secure HV passes a physical frame address
    to a secure kernel then it cannot take it back, it can only ask that
    kernel for it back. Which means that the HV looses control of any
    core or IOMMU PTE's that map that frame until it is handed back.

    That would seem to imply that once an HV gives memory to a secure
    guest kernel that it can only page that guest with its permission.
    Hmmm...

    I am a little confused here. When you talk about I0MMU addresses, are
    you talking about memory addresses or disk addresses? ISTM that
    protecting memory of lower privileged programs is useless if a higher privileged program can force a page out to disk, then can read the data
    from the disk drive itself. Of course, the same is true for data
    written to disk by a lesser privileged program. If the higher
    privileged program can read the file, then it can compromise security.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Feb 6 20:49:39 2025
    On Thu, 6 Feb 2025 18:51:12 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: -----------------------------------

    Currently, PTE uses a 3-bit access control field, and PTE has
    2-bits spare. So making access control larger is easy.

    The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
    is in the MMU.

    How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super
    cannot see things Hyper can see and the same with secure. So, somewhere
    in the various control blocks I need to find space without changing
    the overall use pattern of the control blocks and tables. Which is
    why I alluded to 4×16×3 each interpretation of the 4-bit access control
    is stored it its own natural place. It also means each layer can apply
    its own interpretation (mapping).

    Its just an SRAM loaded by the boot ROM before the Hypervisor boots.
    The super-secure version of boot ROM loads a table with values
    (Sandbox, User, Kernel, Hypervisor):

    Snd Usr Krn Hyp
    na na na R
    na na na RE
    na na na RW
    na na na REW
    na na R na
    na na RE na
    na na RW na
    na na REW na
    na R R na
    na RE RE na
    ....
    REW REW REW na

    which grants mode 0 (Hyp) no direct RW access to any memory outside
    itself.
    Boot ROM sets an optional table lock so even hypervisor cannot later
    grant itself access permission to less priv memory by changing the
    table.

    The core's 2-bit mode selects-muxes one of the 3-bit allowed access
    fields from the indexed 12-bits to extract the 3 R-E-W bits.

    That much is straightforward.

    The 2-bit mode comes from the LD/ST uOp, which was set to the mode
    active when the instruction was decoded (so it can pipeline mode
    changes).

    Yes, core-state index follows the memref down the pipe.
    Core-state index is written into MMI/O/device control block for the
    DMA portion of a command, other CD indexes are associated with I/O
    page faults and device errors.

    Not sure how this would work with device IO and DMA.
    Say a secure kernel that owns a disk drive with secrets that even the HV
    is not authorized to see (so HV operators don't need Top Secret
    clearance).

    In this case, HV needs to know its limitations and not access the
    device nor the secure memory. Probably by taking the memory out
    of the pool it "does normal stuff with" and take the device out
    of its list of accessible devices (at least for a while).

    The Hypervisor has to pass to a hardware device DMA access to a memory
    frame that it has no access to itself. How does one block the HV from
    setting the IOMMU to DMA the device's secrets into its own memory?

    I am thinking more like SR-IOV where HV loans a virtual device to a
    Guest OS, and Guest OS performs the I/O request, HV and SM are only
    there to deal with HV page faults and device errors. If a HV page
    fault occurs (which it will) a pretty secure corner of HV will
    construct a PTE mapping that/those page[s] only to page the missing
    page into memory so I/O can proceed. HV will then have to dismantle
    said mapping after the page arrives to restart device DMA.

    Hmmm... something like: once a secure HV passes a physical frame address
    to a secure kernel then it cannot take it back, it can only ask that
    kernel for it back. Which means that the HV looses control of any
    core or IOMMU PTE's that map that frame until it is handed back.
    That would seem to imply that once an HV gives memory to a secure
    guest kernel that it can only page that guest with its permission.
    Hmmm...

    Yes, exactly. No normal access to the page, only swap access is allowed--although this can be alleviated by not paging secure
    memory::then HV just knows nothing about that/those pages until
    the process terminates where it can put the pages back in the
    normal pool after cleaning them out.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Stephen Fuld on Thu Feb 6 16:53:27 2025
    Stephen Fuld wrote:
    On 2/6/2025 10:51 AM, EricP wrote:
    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Wed, 5 Feb 2025 19:55:14 +0000, EricP wrote:

    EricP wrote:

    =====================================
    For the present day we would want REW access control.
    Naively this would require 4*3 = 12 bits in each PTE.

    If apply the rules:
    - we only need a meaningful subset of combinations: na, R, RE,
    RW, REW.
    - no higher priv level can have less access than a lower priv level. >>>>>>> - we can save 1 combo because all 4 priv levels = na is redundant >>>>>>> with the PTE Present bit being clear.

    we can get this all down to a 4-bit PTE field:

    Usr Sup Exc Krn
    --- --- --- ---
    na na na R
    na na na RE
    na na na RW
    na na na REW
    na na R R
    na na RE RE
    ....
    R R R R
    ....
    REW REW REW REW

    The core's (thread's) privilege mode would enable access to the
    pages.
    The PTE's access control field, which is derived from the kind of >>>>>>> mapped memory section, would not have to change between different >>>>>>> threads.

    Or if you want the flexibility to choose your own REW combinations, >>>>>> the 4-bit PTE access control field is an index to a 16 entry array >>>>>> of 12-bit values for the four privilege levels.

    That's better because then the OS can decide how it wants
    the different memory sections and thread to behave and
    removes the strict hardwired hierarchy of the prior rules.

    The next problem though might be finding 4 bits in the PTE.

    Another PTE bit I can find. Placing the 16×12 vector is more
    difficult,
    even when I position it as 4 places of 16×3.

    I don't understand what you said.
    The 4-bit Access Control (AC) field is in the PTE.

    Currently, PTE uses a 3-bit access control field, and PTE has
    2-bits spare. So making access control larger is easy.

    The 16 row x 12-bit AC-to-allowed-access programmable HW lookup table
    is in the MMU.

    How does 16×12 get to the MMU (or I/O MMU) ?? in such a way that super
    cannot see things Hyper can see and the same with secure. So, somewhere
    in the various control blocks I need to find space without changing
    the overall use pattern of the control blocks and tables. Which is
    why I alluded to 4×16×3 each interpretation of the 4-bit access control >>> is stored it its own natural place. It also means each layer can apply
    its own interpretation (mapping).

    Its just an SRAM loaded by the boot ROM before the Hypervisor boots.
    The super-secure version of boot ROM loads a table with values
    (Sandbox, User, Kernel, Hypervisor):

    Snd Usr Krn Hyp
    na na na R
    na na na RE
    na na na RW
    na na na REW
    na na R na
    na na RE na
    na na RW na
    na na REW na
    na R R na
    na RE RE na
    ...
    REW REW REW na

    which grants mode 0 (Hyp) no direct RW access to any memory outside
    itself.
    Boot ROM sets an optional table lock so even hypervisor cannot later
    grant itself access permission to less priv memory by changing the table.

    The core's 2-bit mode selects-muxes one of the 3-bit allowed access
    fields from the indexed 12-bits to extract the 3 R-E-W bits.

    That much is straightforward.

    The 2-bit mode comes from the LD/ST uOp, which was set to the mode
    active when the instruction was decoded (so it can pipeline mode
    changes).

    Yes, core-state index follows the memref down the pipe.
    Core-state index is written into MMI/O/device control block for the
    DMA portion of a command, other CD indexes are associated with I/O
    page faults and device errors.

    Not sure how this would work with device IO and DMA.
    Say a secure kernel that owns a disk drive with secrets that even the HV
    is not authorized to see (so HV operators don't need Top Secret
    clearance).
    The Hypervisor has to pass to a hardware device DMA access to a memory
    frame that it has no access to itself. How does one block the HV from
    setting the IOMMU to DMA the device's secrets into its own memory?

    Hmmm... something like: once a secure HV passes a physical frame address
    to a secure kernel then it cannot take it back, it can only ask that
    kernel for it back. Which means that the HV looses control of any
    core or IOMMU PTE's that map that frame until it is handed back.

    That would seem to imply that once an HV gives memory to a secure
    guest kernel that it can only page that guest with its permission.
    Hmmm...

    I am a little confused here. When you talk about I0MMU addresses, are
    you talking about memory addresses or disk addresses? ISTM that
    protecting memory of lower privileged programs is useless if a higher privileged program can force a page out to disk, then can read the data
    from the disk drive itself. Of course, the same is true for data
    written to disk by a lesser privileged program. If the higher
    privileged program can read the file, then it can compromise security.

    I'm just kinda free associating what the consequences of restricting
    the HV's access to lesser privilege levels might be.

    As I understand it, the AMD secure HV approach is that memory owned by a
    guest kernel and its applications is encrypted and only the guest kernel
    has the key. Memory content is only decrypted while inside the core.
    As the key is only stored inside that guest kernel memory there is
    no way for HV to get at it.

    So it doesn't matter that the HV has access to guest memory because it
    can only see encrypted memory values. Presumably such data is encrypted
    on disk so intercepting a DMA gives you nothing.

    But it doesn't look like blocking HV access to the guest kernel, user,
    or sandbox memory accomplishes the same security because the HV can
    always diddle its own page tables to grant itself access.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Fri Feb 7 02:39:06 2025
    On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

    On 2/6/2025 10:51 AM, EricP wrote:
    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:
    -------------------
    Not sure how this would work with device IO and DMA.
    Say a secure kernel that owns a disk drive with secrets that even the HV
    is not authorized to see (so HV operators don't need Top Secret
    clearance).
    The Hypervisor has to pass to a hardware device DMA access to a memory
    frame that it has no access to itself. How does one block the HV from
    setting the IOMMU to DMA the device's secrets into its own memory?

    Hmmm... something like: once a secure HV passes a physical frame address
    to a secure kernel then it cannot take it back, it can only ask that
    kernel for it back. Which means that the HV looses control of any
    core or IOMMU PTE's that map that frame until it is handed back.

    That would seem to imply that once an HV gives memory to a secure
    guest kernel that it can only page that guest with its permission.
    Hmmm...

    I am a little confused here. When you talk about I0MMU addresses, are
    you talking about memory addresses or disk addresses?

    I/O MMU does not see the device commands containing the sector on
    the disk to be accessed, Mostly, CPUs write directly to the CRs
    of the device to start a command, bypassing I/O MMU as raw data.

    In my block diagrams of HostBridge, I show I/O MMU only on the
    receiving side of PCIe transport links. all the outbound traffic
    has already been translated by a core MMMU, unless one allows
    a device to send commands to another device.

    I/O MMU sees the virtual address of where DMA is accessing, translating accordingly.

    I/O MMU sees the virtual address of MSI-X interrupts, page faults and
    errors.

    ISTM that
    protecting memory of lower privileged programs is useless if a higher privileged program can force a page out to disk, then can read the data
    from the disk drive itself.

    Protecting a process without privilege from a process WITH privilege
    requires more than a little trust in the privileged process(s).
    This is why there is a Secure Monitor over HyperVisor to take HV out
    of the control loop for "secure stuff". By assuming the duties of
    HV wrt accessing unprivileged memory of storage, SM minimizes the
    footprint where trust is required.

    Of course, the same is true for data
    written to disk by a lesser privileged program. If the higher
    privileged program can read the file, then it can compromise security.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Feb 7 02:53:29 2025
    On Thu, 6 Feb 2025 21:53:27 +0000, EricP wrote:

    Stephen Fuld wrote:
    -----------------

    I am a little confused here. When you talk about I0MMU addresses, are
    you talking about memory addresses or disk addresses? ISTM that
    protecting memory of lower privileged programs is useless if a higher
    privileged program can force a page out to disk, then can read the data
    from the disk drive itself. Of course, the same is true for data
    written to disk by a lesser privileged program. If the higher
    privileged program can read the file, then it can compromise security.

    I'm just kinda free associating what the consequences of restricting
    the HV's access to lesser privilege levels might be.

    As I understand it, the AMD secure HV approach is that memory owned by a guest kernel and its applications is encrypted and only the guest kernel
    has the key. Memory content is only decrypted while inside the core.
    As the key is only stored inside that guest kernel memory there is
    no way for HV to get at it.

    Interesting, so you can see the data you just can't interpret the
    bit-patterns. This still leaves the door open for malicious action.

    So it doesn't matter that the HV has access to guest memory because it
    can only see encrypted memory values. Presumably such data is encrypted
    on disk so intercepting a DMA gives you nothing.

    It maters if HV can access (especially modify) what is in storage
    (not memory), making it impossible for secure process from using
    his own data !

    But it doesn't look like blocking HV access to the guest kernel, user,
    or sandbox memory accomplishes the same security because the HV can
    always diddle its own page tables to grant itself access.

    My 66000 does it differently. HV can create a PTE that translates
    "anywhere", but cannot use the VA of Guest OS or SM in order to
    use that PTE. There are 4 VAS privilege levels::

    HoB = 0 nest HoB = 1 nest
    application: application VAS yes no access X
    Guest OS application VAS yes Guest OS VAS yes
    HyperVisor HyperVisor VAS no Guest OS VAS yes
    Secure HyperVisor VAS no SM VAS no

    So, HV has no 'direct' path to Application VAS using standard
    memory access protocols.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Jan 18 03:08:47 2025
    According to MitchAlsup1 <mitchalsup@aol.com>:
    Stacks are small because OS people make them small, not because of
    a valid technical reason that has ever been explained to me.
    "To avoid infinite recursion" is not a valid reason, IMHO.

    Algol 60 only had stack allocation for dynamically sized arrays,
    so stacks had to be as big as the data are.

    Huh? Algol 60 routines could be mutually recursive so unless it was
    a leaf procedure or the outer block, everything not declared "own"
    went on the stack.

    On my FreeBSD server the default stack limit is half a gigabyte. I
    don't ever recall running into it.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to John Levine on Sat Jan 18 10:59:05 2025
    On 2025-01-18 5:08, John Levine wrote:
    According to MitchAlsup1 <mitchalsup@aol.com>:
    Stacks are small because OS people make them small, not because of
    a valid technical reason that has ever been explained to me.
    "To avoid infinite recursion" is not a valid reason, IMHO.

    Algol 60 only had stack allocation for dynamically sized arrays,
    so stacks had to be as big as the data are.

    Huh? Algol 60 routines could be mutually recursive so unless it was
    a leaf procedure or the outer block, everything not declared "own"
    went on the stack.


    Mitch's point AIUI was that Algol 60 had no heap allocation (and no
    explicit pointer types), so indeed all data were either on the stack or statically allocated.

    I'm not an English native speaker, but it seems to me that Mitch should
    have written "Algol 60 had only stack allocation" instead of "Algol 60
    only had stack allocation".

    The most-used Ada compiler, GNAT, uses a "secondary stack" to reduce the
    need for heap. Dynamically sized local data are placed on the secondary
    stack, and dynamically sized return values of functions are returned on
    the secondary stack. So a function can return "by value" an array sized
    1..N, with N a function parameter, without needing the heap.

    Of course the programmer then has the problem of setting sufficient
    sizes for /two/ stacks, the primary and the secondary. For
    embedded-systems programs one usually avoids constructs that would need
    a secondary stack.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Feb 7 13:57:51 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

    On 2/6/2025 10:51 AM, EricP wrote:
    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:
    -------------------
    Not sure how this would work with device IO and DMA.
    Say a secure kernel that owns a disk drive with secrets that even the HV >>> is not authorized to see (so HV operators don't need Top Secret
    clearance).
    The Hypervisor has to pass to a hardware device DMA access to a memory
    frame that it has no access to itself. How does one block the HV from
    setting the IOMMU to DMA the device's secrets into its own memory?

    Hmmm... something like: once a secure HV passes a physical frame address >>> to a secure kernel then it cannot take it back, it can only ask that
    kernel for it back. Which means that the HV looses control of any
    core or IOMMU PTE's that map that frame until it is handed back.

    That would seem to imply that once an HV gives memory to a secure
    guest kernel that it can only page that guest with its permission.
    Hmmm...

    I am a little confused here. When you talk about I0MMU addresses, are
    you talking about memory addresses or disk addresses?

    I/O MMU does not see the device commands containing the sector on
    the disk to be accessed, Mostly, CPUs write directly to the CRs
    of the device to start a command, bypassing I/O MMU as raw data.

    That is indeed the case. The IOMMU is on the inbound path
    from the PCIe controller to the internal bus/mesh structure.

    Note that there is a translation on the outbound path from
    the host address space to the PCIe memory space - this is
    often 1:1, but need not be so. This translation happens
    in the PCIe controller when creating the a TLP that contains
    an address before sending the TLP to the endpoint. Take
    an AHCI controller, for example, where the only device
    BAR is 32-bits; if a host wants to map the AHCI controller
    at a 64-bit address, the controller needs to map that 64-bit
    address window into a 32-bit 3DW TLP to be sent to the endpoint
    function.

    The ARM SMMU is split into two - one that translates inbound
    addresses that are not marked secure by the endpoint, and
    one that translates addresses that are marked secure by the
    endpoint (or by some host bridge between the endpoint and
    the host internal bus structures which is configured by
    the secure software). The secure side is managed by the
    secure monitor; the non-secure side by the HV or bare-metal
    OS.


    In my block diagrams of HostBridge, I show I/O MMU only on the
    receiving side of PCIe transport links. all the outbound traffic
    has already been translated by a core MMMU, unless one allows
    a device to send commands to another device.

    I/O MMU sees the virtual address of where DMA is accessing, translating >accordingly.

    I/O MMU sees the virtual address of MSI-X interrupts, page faults and
    errors.

    By page faults, I assume you're referring to the PCIe PRI (Page Request Interface) and ATS capabilities.


    ISTM that
    protecting memory of lower privileged programs is useless if a higher
    privileged program can force a page out to disk, then can read the data
    from the disk drive itself.

    Protecting a process without privilege from a process WITH privilege
    requires more than a little trust in the privileged process(s).
    This is why there is a Secure Monitor over HyperVisor to take HV out
    of the control loop for "secure stuff". By assuming the duties of
    HV wrt accessing unprivileged memory of storage, SM minimizes the
    footprint where trust is required.

    ARM has a "RM" (Realm Monitor) that sits between the HV and the SM
    to manage memory visiblity and security.

    https://developer.arm.com/documentation/den0127/0200/Software-components/Realm-Management-Monitor


    Of course, the same is true for data
    written to disk by a lesser privileged program. If the higher
    privileged program can read the file, then it can compromise security.

    Assuming the file is not secured via other means such as cryptography.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Feb 7 18:25:34 2025
    On Fri, 7 Feb 2025 13:57:51 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

    On 2/6/2025 10:51 AM, EricP wrote:
    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:
    -------------------
    Not sure how this would work with device IO and DMA.
    Say a secure kernel that owns a disk drive with secrets that even the HV >>>> is not authorized to see (so HV operators don't need Top Secret
    clearance).
    The Hypervisor has to pass to a hardware device DMA access to a memory >>>> frame that it has no access to itself. How does one block the HV from
    setting the IOMMU to DMA the device's secrets into its own memory?

    Hmmm... something like: once a secure HV passes a physical frame address >>>> to a secure kernel then it cannot take it back, it can only ask that
    kernel for it back. Which means that the HV looses control of any
    core or IOMMU PTE's that map that frame until it is handed back.

    That would seem to imply that once an HV gives memory to a secure
    guest kernel that it can only page that guest with its permission.
    Hmmm...

    I am a little confused here. When you talk about I0MMU addresses, are
    you talking about memory addresses or disk addresses?

    I/O MMU does not see the device commands containing the sector on
    the disk to be accessed, Mostly, CPUs write directly to the CRs
    of the device to start a command, bypassing I/O MMU as raw data.

    That is indeed the case. The IOMMU is on the inbound path
    from the PCIe controller to the internal bus/mesh structure.

    Note that there is a translation on the outbound path from
    the host address space to the PCIe memory space - this is
    often 1:1, but need not be so. This translation happens
    in the PCIe controller when creating the a TLP that contains
    an address before sending the TLP to the endpoint. Take

    Is there any reason this cannot happen in the core MMU ??

    Guest OS uses a virtual device address given to it from HV.
    HV sets up the 2nd nesting of translation to translate this
    to "what HostBridge needs" to route commands to device control
    registers. The handoff can be done by spoofing config space
    of having HV simply hand Guest OS a list of devices it can discover/configure/use.

    an AHCI controller, for example, where the only device
    BAR is 32-bits; if a host wants to map the AHCI controller
    at a 64-bit address, the controller needs to map that 64-bit
    address window into a 32-bit 3DW TLP to be sent to the endpoint
    function.

    This is one of the reasons My 66000 architecture has a unique
    MMI/O address space--you can setup a 32-bit BAR to put a
    page of control registers in 32-bit address space without
    conflict. {{If I understand correctly}} Core MMU, then,
    translates normal device virtual control register addresses
    such that the request is routed to where the device is looking
    {{which has 32 high order bits zero.}}

    On the other hand--it would take a very big system indeed to
    overflow the 32-bit MMI/O space, although ECAM can access
    42-bit device CR MMI/O space.

    The ARM SMMU is split into two - one that translates inbound
    addresses that are not marked secure by the endpoint, and
    one that translates addresses that are marked secure by the
    endpoint (or by some host bridge between the endpoint and
    the host internal bus structures which is configured by
    the secure software). The secure side is managed by the
    secure monitor; the non-secure side by the HV or bare-metal
    OS.


    In my block diagrams of HostBridge, I show I/O MMU only on the
    receiving side of PCIe transport links. all the outbound traffic
    has already been translated by a core MMMU, unless one allows
    a device to send commands to another device.

    I/O MMU sees the virtual address of where DMA is accessing, translating >>accordingly.

    I/O MMU sees the virtual address of MSI-X interrupts, page faults and >>errors.

    By page faults, I assume you're referring to the PCIe PRI (Page Request Interface) and ATS capabilities.


    ISTM that
    protecting memory of lower privileged programs is useless if a higher
    privileged program can force a page out to disk, then can read the data
    from the disk drive itself.

    Protecting a process without privilege from a process WITH privilege >>requires more than a little trust in the privileged process(s).
    This is why there is a Secure Monitor over HyperVisor to take HV out
    of the control loop for "secure stuff". By assuming the duties of
    HV wrt accessing unprivileged memory of storage, SM minimizes the
    footprint where trust is required.

    ARM has a "RM" (Realm Monitor) that sits between the HV and the SM
    to manage memory visiblity and security.

    https://developer.arm.com/documentation/den0127/0200/Software-components/Realm-Management-Monitor


    Of course, the same is true for data
    written to disk by a lesser privileged program. If the higher
    privileged program can read the file, then it can compromise security.

    Assuming the file is not secured via other means such as cryptography.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Feb 8 22:19:47 2025
    On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Feb 2025 13:57:51 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

    On 2/6/2025 10:51 AM, EricP wrote:
    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: >>>>>>>-------------------
    Not sure how this would work with device IO and DMA.
    Say a secure kernel that owns a disk drive with secrets that even the HV >>>>>> is not authorized to see (so HV operators don't need Top Secret
    clearance).
    The Hypervisor has to pass to a hardware device DMA access to a memory >>>>>> frame that it has no access to itself. How does one block the HV from >>>>>> setting the IOMMU to DMA the device's secrets into its own memory? >>>>>>
    Hmmm... something like: once a secure HV passes a physical frame address >>>>>> to a secure kernel then it cannot take it back, it can only ask that >>>>>> kernel for it back. Which means that the HV looses control of any
    core or IOMMU PTE's that map that frame until it is handed back.

    That would seem to imply that once an HV gives memory to a secure
    guest kernel that it can only page that guest with its permission. >>>>>> Hmmm...

    I am a little confused here. When you talk about I0MMU addresses, are >>>>> you talking about memory addresses or disk addresses?

    I/O MMU does not see the device commands containing the sector on
    the disk to be accessed, Mostly, CPUs write directly to the CRs
    of the device to start a command, bypassing I/O MMU as raw data.

    That is indeed the case. The IOMMU is on the inbound path
    from the PCIe controller to the internal bus/mesh structure.

    Note that there is a translation on the outbound path from
    the host address space to the PCIe memory space - this is
    often 1:1, but need not be so. This translation happens
    in the PCIe controller when creating the a TLP that contains
    an address before sending the TLP to the endpoint. Take

    Is there any reason this cannot happen in the core MMU ??

    How do you map the translation table to the device?

    device is configured by setting BAR[s] to an addressable
    page. Accesses to this page are performed by the device
    consisting of Rd and Wt to control registers. Physical
    addresses matching BAR aperture are routed to device.

    HyperVisor maintains a PTE to map guest physical addresses
    within an aperture to the page matching the device's BAR.
    Thus, HV MMU maps guest OS physical address into universal
    MMI/O address.

    A long time before accessing the device, HyperVisor sets up
    a device control block and places it in a table indexed
    by segment:bus;device and stores table address in a control
    register of the I/O MMU {HostBridge}. This control block
    contains several context pointers an interrupt table
    pointer and four event coordinators--one for DMA, page
    faults, errors, and interrupts. The EC provides an index
    into the root pointers.

    Guest OS uses the virtual device address in code, Guest OS
    MMU maps it to the aperture maintained by HyperVisor. HV
    then maps GPA to MMI/O:device_address. Using said trans-
    lations, Guest OS writes commands to the function:register
    of the addressed device.

    The path from core virtual address to device control register
    address does not pass through the I/O MMU.

    When device responds with DMA request it uses a device virtual
    address (not a virtual device address), said request is routed
    to the top of PCIe tree, where I/O MMU uses ECAM to identify
    the MMU tables for this device, once identified, translates*
    the device virtual address into a universal address (almost
    invariably targeting DRAM) Once translated and checked, the
    command is allowed to proceed. (*) assuming ATS was not used.

    When device responds with Interrupt request, I/O MMU uses
    ECAM (again) to find the associated interrupt table,
    and then translates the device interrupt address in to a
    universal MMI/O write to the attached interrupt table.

    Said universal MMI/O write knocks on the door of interrupt
    table service port, where the interrupt message is logged
    into the table. And when the priority of the table increases
    the service port broadcasts the new priority vector of this
    table to all cores.

    Should a core monitoring this table see a higher priority
    interrupt pending than it is currently running, the core
    begins interrupt negotiation.

    When a device responds with a page fault, the device control
    block identifies the level of the software stack to handle
    this exception, and the I?O MMU sends a suitable interrupt
    to that level of the interrupt table.

    When a device responds with a device error, the device
    control block identifies the level and ISR to deal with
    this device problem, and the I/O MMU sends a suitable
    interrupt to that level of the interrupt table.

    So, the I/O MMU responds and guides all requests coming
    up the PCIe tree--not just DMA. --------------------------------------------------------
    How do you map the translation table to the device?

    HostBridge has a configuration register that points at
    the I/O MMU ROOT table, which is used to map segment:
    bus;device to Originating context. Originating Context
    contains a snapshot of the software stack managing the
    application. This is where the ROOT pointers, ASIDs,
    priorities, and levels are stored. And, in addition,
    there is an interrupt table pointer virtual address, ...

    A tree is used to map ECAM to device control block, and
    other than not starting at a page boundary, and not ending
    on a page boundary, it is essentially identical to the std
    page mapping tree. The final level of said tree points at
    the device control block--a cache line of data where the
    I/O MMU gets the data it needs for that particular device.

    Why
    would you wish to have the CPU translating I/O virtual
    addresses?

    This is pure mischaracterization on you part. You always
    want the MMU closest to the access to perform the trans-
    lation. I suspect you read virtual device address and
    device virtual address interchangeably--they are entirely
    different things used in different places.

    The IOMMU tables are per device, and they
    can be configured to map the minimum amount of the address
    space (even updated per-I/O if desired) required to support
    the completion of an inbound DMA from the device.

    This still leaves the door open for a parity error to
    allow one application DMA to damage another application
    process memory, since commands to a single device share
    a translation table and both translations are valid at
    the same instant. One can essentially eliminate this
    with dead pages between different application mappings--
    preventing DMA from walking into a wrong VAS.


    Guest OS uses a virtual device address given to it from HV.
    HV sets up the 2nd nesting of translation to translate this
    to "what HostBridge needs" to route commands to device control
    registers. The handoff can be done by spoofing config space
    of having HV simply hand Guest OS a list of devices it can >>discover/configure/use.

    The IOMMU only is involved in DMA transactions _initiated_ by
    the device, not by the CPUs. They're two completely different
    concepts.

    If the I/O MMU does not participate in interrupts, page faults,
    and errors, who does ?? The requests coming up from the device
    are still virtual and need mapping and routing.


    an AHCI controller, for example, where the only device
    BAR is 32-bits; if a host wants to map the AHCI controller
    at a 64-bit address, the controller needs to map that 64-bit
    address window into a 32-bit 3DW TLP to be sent to the endpoint
    function.

    This is one of the reasons My 66000 architecture has a unique
    MMI/O address space--you can setup a 32-bit BAR to put a
    page of control registers in 32-bit address space without
    conflict. {{If I understand correctly}} Core MMU, then,
    translates normal device virtual control register addresses
    such that the request is routed to where the device is looking
    {{which has 32 high order bits zero.}}

    Most systems have DRAM located at physical address zero, and
    a 4GB DRAM is pretty small these days.

    DRAM: 0x0000000000000000 is not the same address as
    IOMM: 0x0000000000000000.
    The former is routed to DRAM controller the later is routed
    to HostBridge. Both (all 4) spaces have 18446744073709551616
    bytes.

    So you either need
    to make a hole in the DRAM or provide a mapping mechanism to
    map a 64-bit address into a 32-bit bar when sending TLPs
    to the AHCI controller.

    The 32-bit BAR simply maps into IOMM: 0x00000000-0xFFFFFFFF
    it does not overlay and of DRAM: 0x00000000-0xFFFFFFFF

    Now:: By using the SM and HV 2nd level of translation, SW
    C A N setup an aperture in either (or both) guest translations
    and host translations where it appears portions of DRAM are
    overlaid by MMI/O, it is simply not necessary from a HW point
    of view.

    Systems that aren't intel compatible will designate a range
    of the 64-bit physical address space (near the top) and will
    map regions in that range to the 32-bit bar via translation
    registers in the PCIe controller.

    You are using an aperture to place said MMI/O region.

    I am using PTEs such that the MMI/O region(s) can be
    pages scattered around without any common locality.
    Now, by suitable use, you can use the tools provided
    and end up with a MMI/O region easily denoted by an
    aperture.



    On the other hand--it would take a very big system indeed to
    overflow the 32-bit MMI/O space, although ECAM can access
    42-bit device CR MMI/O space.

    Leaving aside the small size of the legacy Intel I/O space
    (16-bit addresses), history seems to have favored single
    address space systems, so I suspect such a MMI/O space will
    not be favored by many.

    It is a SINGLE address system, it happens to have 66-bits of
    addressable space, and I use the MMU to translate virtual
    64-bit addresses into a universal 66-bit physical address
    that can be routed anywhere in the system. So any LD or ST
    can touch any of the 4×18446744073709551616 bytes addressable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Feb 15 15:31:44 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 14 Feb 2025 19:51:44 +0000, Scott Lurndal wrote:


    The traditional PCI bus supports a 5-bit device# and a 3-bit function #.
    A PCIe bus supports either a 3-bit function number and the high-order
    five bits MBZ (i.e. only device 0 is legal on PCIe) or, if the function
    advertises
    the Alternate Routing Identifier (ARI) capability, an 8-bit function
    number
    (SRIOV leverages ARI to support dense routing IDs, but any bus that
    supports
    ARI can handle 256 physical functions.

    PCI supports two forms of configuration space addresses in TLPs:
    Type 0 contains only a function number and the register address.
    Type 1 contains the bus number, function number and register address.

    PCIe segments go where ? Are they "picked off" prior to being routed
    down the tree ??

    Logically, they can be considered a prefix to the RID for routing
    purposes (inbound to the IOMMU, for example, the PCIE controller
    will prepend its segment number to the RID and use that as an
    ID to the IOMMU). ARM calls it a streamid.

    For PCI configuration transactions initiated by the CPU,
    "PCIe segments go where?" is an interesting question.

    For traditional Intel-based PCI Local Bus implementations, there
    was a 'peek and poke' mechanism that uses intel IN and OUT
    instructions to access a pair of registers:

    0xCF8: Address register
    0xCFC: Data register

    The CPU would store the RID (16-bit BDF) in the address register
    then read or write the data register to access an 8/16/32 bit
    PCI configuration space register (such the BAR registers, for
    example). The PCI controller that owned those registers
    would convert that to a PCI configuration transaction and put
    it on the PCI bus. The target device would capture the transaction
    and respond (writes were always non-posted) the remaining devices
    on the bus would ignore the transaction. If the bus field
    in the config TLP was the same as the downstream bus of the
    host file, a type 0 transaction would be sent, if the bus
    field was not, a type 1 transaction would be sent, cap[tured
    by a bridge on the source bus and forwarded to a downstream bus.
    Ad infinitum until the bus space (8-bits) is exhausted.

    In this model an individual device on the bus had a fixed
    (e.g. via DIP switches, EEPROM, etc) 'device' number and
    that device could offer up to 8 'functions'. The device number
    was encoded in bits <7:3> of the RID and the function number
    was encoded in bits <2:0> of the RID.


    PCI Express has been designed as a point-to-point protocol
    using serial connections rather than the wide PCI local
    bus, which changes the topology of the system. With PCIe
    the device number in the RID -must be zero- (unless the
    BUS is an ARI bus, in which case bits <7:0> of the RID
    are a function number provided by the PCIe device (up
    to 256 functions per each - more with SRIOV as it can
    consume additional space in the bus <15:8> field of the
    RID to support up to 65535 virtual functions on a single
    device). A non-SRIOV and non-ARI device can only provide
    from one to eight functions.

    Since SRIOV can consume the entire 16-bit RID on a single
    PCIe device, each PCIe device (of which there can be one
    per PCIe controller 'root port') is assigned an unique
    segment number (as if it were prepended to the 16-bit RID).

    To support non-Intel systems and higher performance accesses
    to configuration space for PCIe devices, PCIe specified an
    Extended Configuration Access Method (ECAM) which maps the
    configuration space of each PCIe segment into a region of
    the physical address space (chosen by the implementation).

    This allows regular memory loads and stores to access the
    PCI configuration space rather than the intel-specific
    (and other architecture specific) peek-and-poke access methods.

    At the base address of the ECAM, the decoder would look
    at the remaining bits with a layout like:

    <11:00> Byte-granularity address of configuration space (4KB) register
    <19:12> Function number (<19:15> == 0 for non-ARI bus)
    <27:20> Bus Number (0 for the controller root complex bridge, downstream
    bus numbers assigned by software (can be sparse assignment).

    The specification allows the implementation to provide a single ECAM
    segment per PCIe controller, or and implementation may provide a single
    ECAM region and used bits <63:28> as a segment number. This is how
    most non-intel systems handle this today; an processors that supports
    six PCIe controllers would have perhaps 7 segments (one or more for the
    root bus 0 for the on-chip devices such as memory controllers, and
    one for each PCIe controller.

    Software simply constructs the target ECAM address and issues normal
    loads and stores to access it - no need to use the clumsy, slow and non-standard PCI peek-and-poke configuration space accesses.

    For outbound memory space (non-config space) transactions from the CPU to
    the device, the RID is not used - once the CPU I/O fabric has routed
    the request by PA to the proper PCIe controller (address based routing
    through a heirarchy of host bridges), it is simply sent to the device
    which CAMs the address against the programmed bars and reacts appropriately (e.g. by responding with a UR (Unsupported Request) on a non-posted
    request or dropping a posted request if it doesn't match a BAR).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Feb 15 23:28:28 2025
    On Sat, 15 Feb 2025 15:31:44 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    -----------
    PCI supports two forms of configuration space addresses in TLPs:
    Type 0 contains only a function number and the register address.
    Type 1 contains the bus number, function number and register address.

    PCIe segments go where ? Are they "picked off" prior to being routed
    down the tree ??

    Logically, they can be considered a prefix to the RID for routing
    purposes (inbound to the IOMMU, for example, the PCIE controller
    will prepend its segment number to the RID and use that as an
    ID to the IOMMU). ARM calls it a streamid.

    For PCI configuration transactions initiated by the CPU,
    "PCIe segments go where?" is an interesting question.

    Indeed. Snipping Intel brain damage
    -----------------------
    PCI Express has been designed as a point-to-point protocol
    using serial connections rather than the wide PCI local
    bus, which changes the topology of the system. With PCIe
    the device number in the RID -must be zero- (unless the
    BUS is an ARI bus,

    What is an ARI bus, and do non x86 systems have them ??

    in which case bits <7:0> of the RID
    are a function number provided by the PCIe device (up
    to 256 functions per each - more with SRIOV as it can
    consume additional space in the bus <15:8> field of the
    RID to support up to 65535 virtual functions on a single
    device). A non-SRIOV and non-ARI device can only provide
    from one to eight functions.

    snipping

    The specification allows the implementation to provide a single ECAM
    segment per PCIe controller, or and implementation may provide a single
    ECAM region and used bits <63:28> as a segment number.

    Wikipedia states ECAM contains 42 bits::
    16-bit segment, 8-bit bus, 5-bit device, 3-bit function, 4-bit
    xReg, and 6-bit reg. Is this wrong, misleading, of out of date ?

    This is how
    most non-intel systems handle this today; an processors that supports
    six PCIe controllers would have perhaps 7 segments (one or more for the
    root bus 0 for the on-chip devices such as memory controllers, and
    one for each PCIe controller.

    Software simply constructs the target ECAM address and issues normal
    loads and stores to access it - no need to use the clumsy, slow and non-standard PCI peek-and-poke configuration space accesses.

    We still have to translate 'target ECAM' into a bit pattern matching
    that device's BAR. And thinks to your help I have a means to do so
    that has overhead only when Booting a new Guest OS.

    For outbound memory space (non-config space) transactions from the CPU
    to
    the device, the RID is not used - once the CPU I/O fabric has routed
    the request by PA to the proper PCIe controller (address based routing through a heirarchy of host bridges), it is simply sent to the device
    which CAMs the address against the programmed bars and reacts
    appropriately
    (e.g. by responding with a UR (Unsupported Request) on a non-posted
    request or dropping a posted request if it doesn't match a BAR).

    Agreed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Feb 9 15:45:13 2025
    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 21:53:27 +0000, EricP wrote:

    Stephen Fuld wrote:
    -----------------

    I am a little confused here. When you talk about I0MMU addresses, are
    you talking about memory addresses or disk addresses? ISTM that
    protecting memory of lower privileged programs is useless if a higher
    privileged program can force a page out to disk, then can read the data
    from the disk drive itself. Of course, the same is true for data
    written to disk by a lesser privileged program. If the higher
    privileged program can read the file, then it can compromise security.

    I'm just kinda free associating what the consequences of restricting
    the HV's access to lesser privilege levels might be.

    As I understand it, the AMD secure HV approach is that memory owned by a
    guest kernel and its applications is encrypted and only the guest kernel
    has the key. Memory content is only decrypted while inside the core.
    As the key is only stored inside that guest kernel memory there is
    no way for HV to get at it.

    Interesting, so you can see the data you just can't interpret the bit-patterns. This still leaves the door open for malicious action.

    Yes. Their Secure VM sounds exactly like what a timeshare vendor might want
    if they desire to sell services to governments and businesses that have
    secrets that rogue operators must provably not access.
    Also blocks accidental leaks between guests so you could have multiple
    such secure guest OS on the same host.

    So it doesn't matter that the HV has access to guest memory because it
    can only see encrypted memory values. Presumably such data is encrypted
    on disk so intercepting a DMA gives you nothing.

    It maters if HV can access (especially modify) what is in storage
    (not memory), making it impossible for secure process from using
    his own data !

    Yes a rogue or buggy HV could DoS a guest OS by scrambling its data.
    Or an ECC memory error on critical memory location, like the key.

    But it doesn't look like blocking HV access to the guest kernel, user,
    or sandbox memory accomplishes the same security because the HV can
    always diddle its own page tables to grant itself access.

    My 66000 does it differently. HV can create a PTE that translates
    "anywhere", but cannot use the VA of Guest OS or SM in order to
    use that PTE. There are 4 VAS privilege levels::

    HoB = 0 nest HoB = 1 nest
    application: application VAS yes no access X
    Guest OS application VAS yes Guest OS VAS yes
    HyperVisor HyperVisor VAS no Guest OS VAS yes
    Secure HyperVisor VAS no SM VAS no

    So, HV has no 'direct' path to Application VAS using standard
    memory access protocols.

    But the physical memory in use by that guest can be remapped
    by rogue HV to be in its own virtual space and then accessed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Feb 9 21:03:15 2025
    On Sun, 9 Feb 2025 20:45:13 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 21:53:27 +0000, EricP wrote:

    Stephen Fuld wrote:
    -----------------

    I am a little confused here. When you talk about I0MMU addresses, are >>>> you talking about memory addresses or disk addresses? ISTM that
    protecting memory of lower privileged programs is useless if a higher
    privileged program can force a page out to disk, then can read the data >>>> from the disk drive itself. Of course, the same is true for data
    written to disk by a lesser privileged program. If the higher
    privileged program can read the file, then it can compromise security.

    I'm just kinda free associating what the consequences of restricting
    the HV's access to lesser privilege levels might be.

    As I understand it, the AMD secure HV approach is that memory owned by a >>> guest kernel and its applications is encrypted and only the guest kernel >>> has the key. Memory content is only decrypted while inside the core.
    As the key is only stored inside that guest kernel memory there is
    no way for HV to get at it.

    Interesting, so you can see the data you just can't interpret the
    bit-patterns. This still leaves the door open for malicious action.

    Yes. Their Secure VM sounds exactly like what a timeshare vendor might
    want
    if they desire to sell services to governments and businesses that have secrets that rogue operators must provably not access.

    Just enough to gain the confidence of *.gov buyers, without enough
    to prevent NSA from using the data.

    Also blocks accidental leaks between guests so you could have multiple
    such secure guest OS on the same host.

    So it doesn't matter that the HV has access to guest memory because it
    can only see encrypted memory values. Presumably such data is encrypted
    on disk so intercepting a DMA gives you nothing.

    It maters if HV can access (especially modify) what is in storage
    (not memory), making it impossible for secure process from using
    his own data !

    Yes a rogue or buggy HV could DoS a guest OS by scrambling its data.
    Or an ECC memory error on critical memory location, like the key.

    But it doesn't look like blocking HV access to the guest kernel, user,
    or sandbox memory accomplishes the same security because the HV can
    always diddle its own page tables to grant itself access.

    My 66000 does it differently. HV can create a PTE that translates
    "anywhere", but cannot use the VA of Guest OS or SM in order to
    use that PTE. There are 4 VAS privilege levels::

    HoB = 0 nest HoB = 1 nest
    application: application VAS yes no access X
    Guest OS application VAS yes Guest OS VAS yes
    HyperVisor HyperVisor VAS no Guest OS VAS yes
    Secure HyperVisor VAS no SM VAS no

    So, HV has no 'direct' path to Application VAS using standard
    memory access protocols.

    But the physical memory in use by that guest can be remapped
    by rogue HV to be in its own virtual space and then accessed.

    You can never get away from the notion that the code creating
    PTP and PTE bit patterns, or manipulating Root pointers requires
    a certain amount of trust.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sun Feb 16 19:56:06 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 15 Feb 2025 15:31:44 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    -----------
    PCI supports two forms of configuration space addresses in TLPs:
    Type 0 contains only a function number and the register address.
    Type 1 contains the bus number, function number and register address. >>>
    PCIe segments go where ? Are they "picked off" prior to being routed
    down the tree ??

    Logically, they can be considered a prefix to the RID for routing
    purposes (inbound to the IOMMU, for example, the PCIE controller
    will prepend its segment number to the RID and use that as an
    ID to the IOMMU). ARM calls it a streamid.

    For PCI configuration transactions initiated by the CPU,
    "PCIe segments go where?" is an interesting question.

    Indeed. Snipping Intel brain damage
    -----------------------
    PCI Express has been designed as a point-to-point protocol
    using serial connections rather than the wide PCI local
    bus, which changes the topology of the system. With PCIe
    the device number in the RID -must be zero- (unless the
    BUS is an ARI bus,

    What is an ARI bus, and do non x86 systems have them ??

    Alternate Routing ID. It is a PCIe standard "capability",
    albeit optional. It is required if the function
    implements SR-IOV[*] or if a device needs to support more
    than 8 physical functions.

    [*] There is a legacy mapping that can be used if the OD
    doesn't understand the scanning rules if physical
    function zero has the ARI capability. It's deprecated
    in modern systems.


    in which case bits <7:0> of the RID
    are a function number provided by the PCIe device (up
    to 256 functions per each - more with SRIOV as it can
    consume additional space in the bus <15:8> field of the
    RID to support up to 65535 virtual functions on a single
    device). A non-SRIOV and non-ARI device can only provide
    from one to eight functions.

    snipping

    The specification allows the implementation to provide a single ECAM
    segment per PCIe controller, or and implementation may provide a single
    ECAM region and used bits <63:28> as a segment number.

    Wikipedia states ECAM contains 42 bits::
    16-bit segment, 8-bit bus, 5-bit device, 3-bit function, 4-bit
    xReg, and 6-bit reg. Is this wrong, misleading, of out of date ?

    Misleading - perhaps specific to Intel's implementation.
    The PCIe specification defines only bits <27:0>. The
    remaining bits are defined by the implementation; the
    higher bits usually select the root complex implementation
    to which the transaction should be directed.

    While the specification uses the "register number" nomenclature,
    in the real world bits <11:0> are the offset from the start of
    the device configuration space to the desired register. Usually
    4-byte aligned, but the legacy space includes some registers that
    support byte and 2-byte accesses.


    This is how
    most non-intel systems handle this today; an processors that supports
    six PCIe controllers would have perhaps 7 segments (one or more for the
    root bus 0 for the on-chip devices such as memory controllers, and
    one for each PCIe controller.

    Software simply constructs the target ECAM address and issues normal
    loads and stores to access it - no need to use the clumsy, slow and
    non-standard PCI peek-and-poke configuration space accesses.

    We still have to translate 'target ECAM' into a bit pattern matching
    that device's BAR. And thinks to your help I have a means to do so
    that has overhead only when Booting a new Guest OS.

    When software accesses the ECAM, the PCIe host bridge or
    PCIe Root complex implementation (generally
    transparent to software) is required to translate the load or
    store into a PCIe CFGRD or CFGWR transaction (type 0 or type 1)
    to send to the device.

    This includes configuration transactions that read, size and modify
    the PCI configuration space Base Address Registers (specifically at
    addresses 0x10, 0x14, 0x18, 0x1c, 0x20, 0x24).

    I'll stress that these configuration transactions are rare, and usually
    occur during operating system device discovery and initialization.

    The ECAM is not involved in any transactions that target any
    physical address region mapped in the function BARs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Niklas Holsti on Mon Jan 27 17:26:51 2025
    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

    On 2025-01-18 5:08, John Levine wrote:

    According to MitchAlsup1 <mitchalsup@aol.com>:

    Stacks are small because OS people make them small, not because of
    a valid technical reason that has ever been explained to me.
    "To avoid infinite recursion" is not a valid reason, IMHO.

    Algol 60 only had stack allocation for dynamically sized arrays,
    so stacks had to be as big as the data are.

    Huh? Algol 60 routines could be mutually recursive so unless it was
    a leaf procedure or the outer block, everything not declared "own"
    went on the stack.

    Mitch's point AIUI was that Algol 60 had no heap allocation (and no
    explicit pointer types), so indeed all data were either on the stack
    or statically allocated.

    I'm not an English native speaker, but it seems to me that Mitch
    should have written "Algol 60 had only stack allocation" instead of
    "Algol 60 only had stack allocation".

    Yes. I have seen this situation described as a rule for "only" to
    be put as late in the sentence as still make sense.

    Putting on my editor hat, I would recommend revising the sentence
    more thoroughly, as for example, "Algol 60 had no way of allocating
    memory except by means local variables on the stack" (assuming that
    is the case; my memories of the rules of Algol may have undetected
    ECC errors).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Feb 3 14:09:49 2025
    It is like there is a privilege level between application and GuestOS.
    {{I spent all afternoon trying to think of a name for this privilege
    above application "non-privileged" and below "privileged". Maybe meso-privileged ?!?

    handyman?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stefan Monnier on Mon Feb 3 21:13:24 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    It is like there is a privilege level between application and GuestOS.
    {{I spent all afternoon trying to think of a name for this privilege
    above application "non-privileged" and below "privileged". Maybe
    meso-privileged ?!?

    handyman?

    Application -> Library -> OS -> Hypervisor -> Secure Monitor

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Feb 3 21:23:47 2025
    On Mon, 3 Feb 2025 21:13:24 +0000, Scott Lurndal wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    It is like there is a privilege level between application and GuestOS.
    {{I spent all afternoon trying to think of a name for this privilege
    above application "non-privileged" and below "privileged". Maybe
    meso-privileged ?!?

    handyman?

    Application -> Library -> OS -> Hypervisor -> Secure Monitor


    {Sandbox -> user -> application -> Library} ->{sual}×{GuestOS, HV, SM}

    ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Feb 3 22:47:24 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 3 Feb 2025 21:13:24 +0000, Scott Lurndal wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    It is like there is a privilege level between application and GuestOS. >>>> {{I spent all afternoon trying to think of a name for this privilege
    above application "non-privileged" and below "privileged". Maybe
    meso-privileged ?!?

    handyman?

    Application -> Library -> OS -> Hypervisor -> Secure Monitor


    {Sandbox -> user -> application -> Library} ->{sual}×{GuestOS, HV, SM}

    ??

    You need to precisely define your terms. What are sandbox
    and user in this context?

    The hypervisor is optional, as would be a library.

    The Burroughs Large systems and HP-3000 segmented libraries
    were distinct entities with attributes.

    Code in a library could be more privileged than the application
    when acting on behalf of the application, for example; but the
    application could not take advantage of the permissions assigned
    to the library it was linked with without using interfaces
    provided by the library.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Feb 3 23:11:03 2025
    On Mon, 3 Feb 2025 22:47:24 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 3 Feb 2025 21:13:24 +0000, Scott Lurndal wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    It is like there is a privilege level between application and GuestOS. >>>>> {{I spent all afternoon trying to think of a name for this privilege >>>>> above application "non-privileged" and below "privileged". Maybe
    meso-privileged ?!?

    handyman?

    Application -> Library -> OS -> Hypervisor -> Secure Monitor


    {Sandbox -> user -> application -> Library} ->{sual}×{GuestOS, HV, SM}

    ??

    You need to precisely define your terms. What are sandbox
    and user in this context?

    It is all about manipulating access rights without modifying
    what is stored in the TLB (so you don't have to reload any
    entries to change access rights.) It is sort of like what
    the G-bit does (global) {except in my architecture globality
    is controlled by ASID.}

    Sandbox is a privilege level where one cannot be granted both
    write and execute access at the same time. There may be other
    restrictions, too; like access to control registers user may
    be allowed to write.

    Library would include all the trusted stuff, but also ld.so
    and any JITs. JITs can only create code for sandboxes. So,
    JIT can write to JITcache but sandbox cannot using the same
    PTE entry. ld.so can write GOT while user and application
    cannot write GOT (or execute GOT).

    User is the privilege level where sandbox does not apply but
    also there is no ability to over-access things protected by
    PTE.RWE.

    Application is a privilege level where PTE.RWE can sometimes
    be usurped--such as DMA from a device needing to write into
    a execute only page.

    Where does memmove() come from is not the library ??

    Libraries have a SW-kind of trust even if they are
    devoid of HW kinds of trust (PTE.RWE overrides).

    But these levels are just talking point at this point.

    The hypervisor is optional, as would be a library.

    It cannot be a library of process !!
    It is not a library of GuestOS !
    it is certainly not a library of Secure Monitor !!


    The Burroughs Large systems and HP-3000 segmented libraries
    were distinct entities with attributes.

    And could change (update/upgrade) the library while the process
    was running !!

    Code in a library could be more privileged than the application
    when acting on behalf of the application, for example; but the
    application could not take advantage of the permissions assigned
    to the library it was linked with without using interfaces
    provided by the library.

    No disagreement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Feb 10 20:18:04 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Feb 2025 13:57:51 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

    On 2/6/2025 10:51 AM, EricP wrote:
    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: >>>>>>>>-------------------
    Not sure how this would work with device IO and DMA.
    Say a secure kernel that owns a disk drive with secrets that even the HV
    is not authorized to see (so HV operators don't need Top Secret
    clearance).
    The Hypervisor has to pass to a hardware device DMA access to a memory >>>>>>> frame that it has no access to itself. How does one block the HV from >>>>>>> setting the IOMMU to DMA the device's secrets into its own memory? >>>>>>>
    Hmmm... something like: once a secure HV passes a physical frame address
    to a secure kernel then it cannot take it back, it can only ask that >>>>>>> kernel for it back. Which means that the HV looses control of any >>>>>>> core or IOMMU PTE's that map that frame until it is handed back. >>>>>>>
    That would seem to imply that once an HV gives memory to a secure >>>>>>> guest kernel that it can only page that guest with its permission. >>>>>>> Hmmm...

    I am a little confused here. When you talk about I0MMU addresses, are >>>>>> you talking about memory addresses or disk addresses?

    I/O MMU does not see the device commands containing the sector on
    the disk to be accessed, Mostly, CPUs write directly to the CRs
    of the device to start a command, bypassing I/O MMU as raw data.

    That is indeed the case. The IOMMU is on the inbound path
    from the PCIe controller to the internal bus/mesh structure.

    Note that there is a translation on the outbound path from
    the host address space to the PCIe memory space - this is
    often 1:1, but need not be so. This translation happens
    in the PCIe controller when creating the a TLP that contains
    an address before sending the TLP to the endpoint. Take

    Is there any reason this cannot happen in the core MMU ??

    How do you map the translation table to the device?

    device is configured by setting BAR[s] to an addressable
    page. Accesses to this page are performed by the device
    consisting of Rd and Wt to control registers. Physical
    addresses matching BAR aperture are routed to device.

    I was referring to transactions initiated by the endpoint,
    not one of the processing elements. E.g. DMA.

    Outbound addressing is a solved problem. The complexities
    are in the design of the bus system. Programmable
    BARs need logical bridges (for OS configuration) and the
    hardware needs to route the addresses to the appropriate
    destination (onchip devices and external PCIe devices). It's
    not as simple as one might think when you have several
    hundred functions (each with 3 64-bit or 6 32-bit BARs).

    The routing is simple on old-fashioned bus where each
    function sees all transactions and can respond when the
    address matches one of the function BARs. That doesn't scale
    (which is why we have PCIe rather than PCI Local Bus).

    Modern mesh/ring routing systems need to know which stop
    or mesh point that the I/O bridge is on, the I/O bridge needs
    to know how to route those addresses to the proper
    controller - long before the endpoint(device) actually can match
    the address to one of its BAR registers in its configuration space.


    HyperVisor maintains a PTE to map guest physical addresses
    within an aperture to the page matching the device's BAR.

    Standard HV stuff. Although you may want to consider that
    the value in the function's BAR, must be a guest PA, not a
    machine PA.

    Likewise any addresses programmed into the
    function DMA engine(s) will be guest PA (many IOMMU also
    allow guest application VA directly, if the guest OS
    allows the guest application to directly access an SR-IOV virtual
    function) and need to be translated inbound from
    the function (and routed appropriately to either the
    memory/LLC subsystem, or perhaps another PCIe controller
    when the system supports PCIe Peer-to-Peer routing.

    Thus, HV MMU maps guest OS physical address into universal
    MMI/O address.

    I think the software folks may be quite unhappy to support a
    unusual 32-bit MMIO address space, not to mention the lack
    of support for 64-bit device BARs. There are a lot of PCI
    devices that require aperatures larger than 4GB.


    A long time before accessing the device, HyperVisor sets up
    a device control block and places it in a table indexed
    by segment:bus;device and stores table address in a control
    register of the I/O MMU {HostBridge}. This control block
    contains several context pointers an interrupt table
    pointer and four event coordinators--one for DMA, page
    faults, errors, and interrupts. The EC provides an index
    into the root pointers.

    I need to spend time thinking about this, which I don't
    currently have. It adds a lot of complexity to the
    software that shouldn't be necessary. And the linux
    folks will _refuse_ to support anything that requires
    any quirks or non-standard access to PCIe devices.


    Guest OS uses the virtual device address in code, Guest OS
    MMU maps it to the aperture maintained by HyperVisor. HV
    then maps GPA to MMI/O:device_address. Using said trans-
    lations, Guest OS writes commands to the function:register
    of the addressed device.

    The path from core virtual address to device control register
    address does not pass through the I/O MMU.

    That's true for intel, amd and arm I/O MMU - they're only
    concerned with DMA addresses from the device.. Not outbound
    transactions from the host cpus.


    When device responds with DMA request it uses a device virtual
    address (not a virtual device address),

    To be compatible with the current operating systems, a DMA
    address must be a guest physical address (for a device owned
    by a guest OS), or the host physical address (for a bare-metal
    OS).

    You'll find a lot of pushback from the OS vendors if that
    is not the case.


    said request is routed
    to the top of PCIe tree, where I/O MMU uses ECAM to identify
    the MMU tables for this device, once identified, translates*
    the device virtual address into a universal address (almost
    invariably targeting DRAM) Once translated and checked, the
    command is allowed to proceed. (*) assuming ATS was not used.

    It is not uncommon to have several I/OMMU to support throughput
    of high-bandwidth devices. They may be managed as a
    unit, but the the translation engines are distributed - ARM
    supports this model.


    When device responds with Interrupt request, I/O MMU uses
    ECAM (again) to find the associated interrupt table,
    and then translates the device interrupt address in to a
    universal MMI/O write to the attached interrupt table.

    That's the intel/amd model. ARM64 separates interrupt
    routing from address routing, with the former handled
    by the interrupt controller (GIC) and the latter by the
    SMMU.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Feb 10 23:40:24 2025
    On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: -----------------isolating---------------------------------

    HyperVisor maintains a PTE to map guest physical addresses
    within an aperture to the page matching the device's BAR.

    Standard HV stuff. Although you may want to consider that
    the value in the function's BAR, must be a guest PA, not a
    machine PA.

    I need a "WHY" on this sentence before responding to the rest.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Feb 11 14:04:59 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >-----------------isolating---------------------------------

    HyperVisor maintains a PTE to map guest physical addresses
    within an aperture to the page matching the device's BAR.

    Standard HV stuff. Although you may want to consider that
    the value in the function's BAR, must be a guest PA, not a
    machine PA.

    I need a "WHY" on this sentence before responding to the rest.

    The guest OS (or user application) is directly programming the addresses in the PCIe function DMA engine[*]. The point is to completely
    avoid the hypervisor from being involved in the I/O,
    so the guest programs the function DMA engine using guest PA and the
    IOMMU translates the guest PA into machine
    addresses for all transactions initiated by the
    function.

    (Each PCIe function is treated as an individual device
    by the OS/HV. The entire purpose of SR-IOV is to
    present the device directly to the guest to avoid any
    hypervisor involvment in the I/O path).

    [*] Take an SR-IOV capable NIC; each SR-IOV virtual
    function (VF) can be assigned to a different guest -
    from the guest point of view, it owns the
    entire function and programs the DMA engines for the
    function directly, with no hypervisor intervention.

    This allows the same driver to be used in the operating system
    for either bare-metal or guest OS. No paravirt required.

    So when the NIC needs to DMA an inbound ethernet packet
    to the guest OS buffers, it sends memory write TLPs to
    the root complex using the guest physical address programmed
    into the DMA engine. The IOMMU translates that into
    the machine address and passes it to the proper entity
    (i.e. DRAM/LLC or another device for Peer-to-peer).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Tue Feb 11 09:30:47 2025
    On 2/6/2025 6:39 PM, MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

    On 2/6/2025 10:51 AM, EricP wrote:
    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote:
    -------------------
    Not sure how this would work with device IO and DMA.
    Say a secure kernel that owns a disk drive with secrets that even the HV >>> is not authorized to see (so HV operators don't need Top Secret
    clearance).
    The Hypervisor has to pass to a hardware device DMA access to a memory
    frame that it has no access to itself. How does one block the HV from
    setting the IOMMU to DMA the device's secrets into its own memory?

    Hmmm... something like: once a secure HV passes a physical frame address >>> to a secure kernel then it cannot take it back, it can only ask that
    kernel for it back. Which means that the HV looses control of any
    core or IOMMU PTE's that map that frame until it is handed back.

    That would seem to imply that once an HV gives memory to a secure
    guest kernel that it can only page that guest with its permission.
    Hmmm...

    I am a little confused here.  When you talk about I0MMU addresses, are
    you talking about memory addresses or disk addresses?

    I/O MMU does not see the device commands containing the sector on
    the disk to be accessed, Mostly, CPUs write directly to the CRs
    of the device to start a command, bypassing I/O MMU as raw data.

    I am terribly out of date with all of this, but what if the device is a
    SATA disk? It at least used to be that you sent a command packet to the
    disk and said packet contained the disk relative block number. I know
    of no way to initiate an I/O by writing to the disk's "control registers".






    In my block diagrams of HostBridge, I show I/O MMU only on the
    receiving side of PCIe transport links. all the outbound traffic
    has already been translated by a core MMMU, unless one allows
    a device to send commands to another device.

    I/O MMU sees the virtual address of where DMA is accessing, translating accordingly.

    I/O MMU sees the virtual address of MSI-X interrupts, page faults and
    errors.

                                                           ISTM that
    protecting memory of lower privileged programs is useless if a higher
    privileged program can force a page out to disk, then can read the data
    from the disk drive itself.

    Protecting a process without privilege from a process WITH privilege
    requires more than a little trust in the privileged process(s).

    Yes.


    This is why there is a Secure Monitor over HyperVisor to take HV out
    of the control loop for "secure stuff".

    Which, of course, means you have to trust the "Secure Monitor". :-)

    By assuming the duties of
    HV wrt accessing unprivileged memory of storage, SM minimizes the
    footprint where trust is required.

    Fair enough. But minimizes is not the same as eliminating.



                                 Of course, the same is true for data
    written to disk by a lesser privileged program.  If the higher
    privileged program can read the file, then it can compromise security.

    See my comments above about SATA disks.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Tue Feb 11 18:19:14 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/6/2025 6:39 PM, MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:


    I/O MMU does not see the device commands containing the sector on
    the disk to be accessed, Mostly, CPUs write directly to the CRs
    of the device to start a command, bypassing I/O MMU as raw data.

    I am terribly out of date with all of this, but what if the device is a
    SATA disk? It at least used to be that you sent a command packet to the
    disk and said packet contained the disk relative block number. I know
    of no way to initiate an I/O by writing to the disk's "control registers".

    The SATA device implements the Advanced Host Controller Interface (AHCI).

    https://en.wikipedia.org/wiki/Advanced_Host_Controller_Interface

    The controller supports multiple modes - a legacy IDE mode (which hasn't
    been actively used for a couple of decades), the Native AHCI mode, and optionally a Raid mode.

    Then native AHCI and RAID modes have DMA engines in the controller that
    perform bulk data transfer to satisfy a controller command. For example,
    a command from the driver to transfer 100 sectors to a buffer starting at address 0x23510000 will cause the controller to start reading the disk
    at the starting sector and streaming the data to the host root complex.
    A AHCI device has one command queue with up to 32 outstanding commands
    at any one time.

    A TLP (PCIe Transport Layer Packet) can be up to 1024 bytes, so the
    controller can push that much to the host in a single transaction.
    The root complex will pass the data to a host bridge with an IOMMU,
    and the IOMMU will translate the DMA addresses into host addresses
    before the bridge passes the data to the memory subsystem (mesh, ring,
    bus).

    NVMe (PCIe attached SSD) has a more modern interface. The OS
    (or HV) driver allocates a region of physical memory that holds
    input and output queues. The OS driver inserts one or more requests
    into a queue (there can be one or more queues per guest, for example) and
    pokes a doorbell register in the NVMe hardware. The buffer
    addresses in the request will be OS physical addresses. NVMe
    supports up to 65536 command rings.

    The doorbell write causes the
    NVMe hardware to read the new data structure(s) from the queue (describing
    the I/O) and execute those requests by either issuing MRD
    (Memory Read TLPs) or MWR (Memory Write TLPs) to the host
    root complex/pcie controller until the request has been satisfied.
    A request can be arbitrarily large and supports a scatter gather
    list so that an inbound read can stored in discontigous
    regions in the applicable physical address regime (bare metal, guest
    or user-application).

    Once the request has been satisifed, the controller posts a
    completion entry on a completion queue (using DMA) and sends an MSI-X
    message (using DMA) to the root complex which passes it to the host
    interrupt controller which generates an 'complete' interrupt
    to the driver. These interrupts can be optionally coalesced
    (something more useful in a network interface controller than
    in a disk/ssd controller, to be sure). The driver reads the
    completion status from the completion queue (and wakes any
    threads waiting for the data).

    https://en.wikipedia.org/wiki/NVM_Express#Comparison_with_AHCI

    Most modern PCIe network interface controllers use similar ring
    structures to pass work between the driver and the hardware
    with complicated traffic shaping, rss and interrupt coalescing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Feb 11 20:19:32 2025
    On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>-----------------isolating---------------------------------

    HyperVisor maintains a PTE to map guest physical addresses
    within an aperture to the page matching the device's BAR.

    Standard HV stuff. Although you may want to consider that
    the value in the function's BAR, must be a guest PA, not a
    machine PA.

    I need a "WHY" on this sentence before responding to the rest.

    The guest OS (or user application) is directly programming the addresses
    in the
    PCIe function DMA engine[*]. The point is to completely
    avoid the hypervisor from being involved in the I/O,
    so the guest programs the function DMA engine using guest PA and the
    IOMMU translates the guest PA into machine
    addresses for all transactions initiated by the
    function.

    (Each PCIe function is treated as an individual device
    by the OS/HV. The entire purpose of SR-IOV is to
    present the device directly to the guest to avoid any
    hypervisor involvment in the I/O path).

    This is the missing part:: When a user performs a LD or ST
    the guest Virtual address is translated to Guest Physical
    by Guest OS translation tables. Guest Physical is then
    interpreted as Host virtual and translated a second time
    by {SM,HV} mapping tables. This nested MMU does both
    translations in a single access; so core TLB is organized
    to associate guest virtual directly with machine physical !
    as if there were a level crossing PTE providing all the
    "right bits".

    Now that the device has been configured, Guest OS decides
    to write some control registers of the device. Guest OS
    has its own translation tables for Guest Virtual to Guest
    Physical--but the core MMU then translates guest Physical
    to Machine Physical before it gets transported over the
    interconnect. So, by the time said address gets to Host-
    Bridge it is already in Machine Physical, not the Guest
    physical you mention.*

    ------------------------------------------------------

    UNLESS HV and Guest OS have an agreement that Guest Physical
    has a range by which Guest Physical == Machine Physical !!
    And if there is such an agreement, calling the address
    Machine Physical is just as valid as calling it Guest
    Physical.

    (*) it sounds like you are assuming there is a way to trans-
    late the first nesting level to guest physical and then
    sidestep the translation of guest physical to machine
    physical until the request gets to the I/O MMU, allowing
    the I/O MMU to perform the second level of nesting trans-
    lation.

    {{Back to the other stuff later}}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Feb 11 20:49:24 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>>-----------------isolating---------------------------------

    HyperVisor maintains a PTE to map guest physical addresses
    within an aperture to the page matching the device's BAR.

    Standard HV stuff. Although you may want to consider that
    the value in the function's BAR, must be a guest PA, not a
    machine PA.

    I need a "WHY" on this sentence before responding to the rest.

    The guest OS (or user application) is directly programming the addresses
    in the
    PCIe function DMA engine[*]. The point is to completely
    avoid the hypervisor from being involved in the I/O,
    so the guest programs the function DMA engine using guest PA and the
    IOMMU translates the guest PA into machine
    addresses for all transactions initiated by the
    function.

    (Each PCIe function is treated as an individual device
    by the OS/HV. The entire purpose of SR-IOV is to
    present the device directly to the guest to avoid any
    hypervisor involvment in the I/O path).

    This is the missing part:: When a user performs a LD or ST

    You're missing the point completely. There no user
    performing a LD or ST. The DMA controller on the device
    is initiating the transaction, not the host CPU.

    the guest Virtual address is translated to Guest Physical
    by Guest OS translation tables. Guest Physical is then
    interpreted as Host virtual and translated a second time
    by {SM,HV} mapping tables. This nested MMU does both
    translations in a single access; so core TLB is organized
    to associate guest virtual directly with machine physical !
    as if there were a level crossing PTE providing all the
    "right bits".

    This is basically how all modern CPUs handle it, yes.

    But it is not relevent to the inbound traffic initiated
    -by the device- which can't be translated by the CPU,
    rather must be translated at some point between the
    PCIe controller and the internal processor interconnect
    (e.g. mesh).


    Now that the device has been configured, Guest OS decides
    to write some control registers of the device. Guest OS
    has its own translation tables for Guest Virtual to Guest
    Physical--but the core MMU then translates guest Physical
    to Machine Physical before it gets transported over the
    interconnect. So, by the time said address gets to Host-
    Bridge it is already in Machine Physical, not the Guest
    physical you mention.*

    my66000 as no insight into the device, you can't know
    a priori which 64-bit write to the device contains
    a physical address. Particularly in all modern
    devices where there may be only one "control register"[*]
    the guest driver writes commands and s/g lists to one of
    several hundred queues in local dram, then signals
    the device to initiate a DMA operation to read the
    entry from the queue in DRAM (using a guest physical
    address). The CPU never sees that read,
    nor can you know a priori that a particular write
    from a device driver is an guest address that might need
    to be translated - the driver is writing the command
    to main memory and just poking the device to read the
    command directly there is no way to associate that
    write with any particular device in the CPU.

    [*] The doorbell. There will generally be a few more
    to set global characteristics, configurations, etc,
    but they'll only be actively used during driver initialization
    and will likely not contain addresses of any form.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Feb 11 23:29:04 2025
    On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>>>-----------------isolating---------------------------------

    HyperVisor maintains a PTE to map guest physical addresses
    within an aperture to the page matching the device's BAR.

    Standard HV stuff. Although you may want to consider that
    the value in the function's BAR, must be a guest PA, not a
    machine PA.

    I need a "WHY" on this sentence before responding to the rest.

    The guest OS (or user application) is directly programming the addresses >>> in the
    PCIe function DMA engine[*]. The point is to completely
    avoid the hypervisor from being involved in the I/O,
    so the guest programs the function DMA engine using guest PA and the
    IOMMU translates the guest PA into machine
    addresses for all transactions initiated by the
    function.

    (Each PCIe function is treated as an individual device
    by the OS/HV. The entire purpose of SR-IOV is to
    present the device directly to the guest to avoid any
    hypervisor involvment in the I/O path).

    This is the missing part:: When a user performs a LD or ST

    You're missing the point completely. There no user
    performing a LD or ST. The DMA controller on the device
    is initiating the transaction, not the host CPU.

    the guest Virtual address is translated to Guest Physical
    by Guest OS translation tables. Guest Physical is then
    interpreted as Host virtual and translated a second time
    by {SM,HV} mapping tables. This nested MMU does both
    translations in a single access; so core TLB is organized
    to associate guest virtual directly with machine physical !
    as if there were a level crossing PTE providing all the
    "right bits".

    This is basically how all modern CPUs handle it, yes.

    But it is not relevent to the inbound traffic initiated
    -by the device- which can't be translated by the CPU,
    rather must be translated at some point between the
    PCIe controller and the internal processor interconnect
    (e.g. mesh).

    I am tracing the path from user of device (core Guest OS).
    If I don't understand this "should be simple" path, I am
    too lost to continue from the device side looking towards
    memory.

    I can see how core can write Guest Physical Address into
    device.BAR using config space access (with appropriate
    MMU permissions).

    But, right now: I can't see how the appropriate bit pattern
    from core gets to HostBridge in MMI/O space and is recognized
    by matching device.BAR down the PCIe tree.

    core executes the following instruction::

    STH R7,[Rdevice,#controlreg]

    Rdevice has a Virtual Address bit pattern which after 1 level of
    translation matches the bit pattern put into device.BAR at config.

    #controlreg is the offset to the control reg.

    core TLB translates GVA to GPA; GPA contains the same bit pattern
    that was written into device.BAR.

    Since we use nested page tables, GPA is interpreted as Host
    virtual address, HVA is translated to machine PA before access
    leaves the confines of the core.

    Unless a range of bits in MPA == GPA, I can't see how the address
    can be routed (over the interconnect, into HostBridge,) and then on
    to the device and match the configured BAR bit pattern ?!?

    That is I can't see how device.BAR == MPA can match when it
    has HPA != GPA !!


    Now that the device has been configured, Guest OS decides
    to write some control registers of the device. Guest OS
    has its own translation tables for Guest Virtual to Guest
    Physical--but the core MMU then translates guest Physical
    to Machine Physical before it gets transported over the
    interconnect. So, by the time said address gets to Host-
    Bridge it is already in Machine Physical, not the Guest
    physical you mention.*

    my66000 as no insight into the device, you can't know
    a priori which 64-bit write to the device contains
    a physical address. Particularly in all modern
    devices where there may be only one "control register"[*]
    the guest driver writes commands and s/g lists to one of
    several hundred queues in local dram, then signals
    the device to initiate a DMA operation to read the
    entry from the queue in DRAM (using a guest physical
    address). The CPU never sees that read,
    nor can you know a priori that a particular write
    from a device driver is an guest address that might need
    to be translated - the driver is writing the command
    to main memory and just poking the device to read the
    command directly there is no way to associate that
    write with any particular device in the CPU.

    I have a touchy feely knowledge of the above paragraph.
    And I am trying to process without anyone seeing any
    interconnect transaction is should not.

    But the core does see its own writes to control registers
    which are recognized by/at the Device.BAR set bit-pattern
    from config, as meaningful to this device.

    [*] The doorbell. There will generally be a few more
    to set global characteristics, configurations, etc,
    but they'll only be actively used during driver initialization
    and will likely not contain addresses of any form.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Feb 12 00:34:18 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 10 Feb 2025 20:18:04 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Feb 2025 20:32:39 +0000, Scott Lurndal wrote: >>>>>-----------------isolating---------------------------------

    HyperVisor maintains a PTE to map guest physical addresses
    within an aperture to the page matching the device's BAR.

    Standard HV stuff. Although you may want to consider that
    the value in the function's BAR, must be a guest PA, not a
    machine PA.

    I need a "WHY" on this sentence before responding to the rest.

    The guest OS (or user application) is directly programming the addresses >>>> in the
    PCIe function DMA engine[*]. The point is to completely
    avoid the hypervisor from being involved in the I/O,
    so the guest programs the function DMA engine using guest PA and the
    IOMMU translates the guest PA into machine
    addresses for all transactions initiated by the
    function.

    (Each PCIe function is treated as an individual device
    by the OS/HV. The entire purpose of SR-IOV is to
    present the device directly to the guest to avoid any
    hypervisor involvment in the I/O path).

    This is the missing part:: When a user performs a LD or ST

    You're missing the point completely. There no user
    performing a LD or ST. The DMA controller on the device
    is initiating the transaction, not the host CPU.

    the guest Virtual address is translated to Guest Physical
    by Guest OS translation tables. Guest Physical is then
    interpreted as Host virtual and translated a second time
    by {SM,HV} mapping tables. This nested MMU does both
    translations in a single access; so core TLB is organized
    to associate guest virtual directly with machine physical !
    as if there were a level crossing PTE providing all the
    "right bits".

    This is basically how all modern CPUs handle it, yes.

    But it is not relevent to the inbound traffic initiated
    -by the device- which can't be translated by the CPU,
    rather must be translated at some point between the
    PCIe controller and the internal processor interconnect
    (e.g. mesh).

    I am tracing the path from user of device (core Guest OS).
    If I don't understand this "should be simple" path, I am
    too lost to continue from the device side looking towards
    memory.

    I can see how core can write Guest Physical Address into
    device.BAR using config space access (with appropriate
    MMU permissions).

    But, right now: I can't see how the appropriate bit pattern
    from core gets to HostBridge in MMI/O space and is recognized
    by matching device.BAR down the PCIe tree.

    The hardware is generally described to the operating system
    through tables provided by the firmware (supplied by the system
    builder). There are two different mechanisms in widespread use;
    ACPI and DeviceTree. The former, sponsered primarily by Microsoft
    and Intel, provides a standard mechanism for the hardware to
    describe itself to the operating system/hypervisor.

    An entry in those tables for a particular device will describe
    in standard terms the path from the memory/processor complex
    to the device. Say, for example, your CPU uses a high-speed
    mesh interface between cores. The outer ring on the mesh
    will connect to memory controllers and I/O bridges.

    Each point in the path is described by a table entry
    in the ACPI tables.

    Assume a simple config

    +=========+ +========+ +=======+
    | CPU | | HOST | | PCIe |
    | Memory |-----| Bridge |--| Ctlr |------+---------------+-----------
    | complex | | | | | | |
    +=========+ +========+ +=======+ Func 0 Func 1
    BAR[0]=X BAR[2]=Y

    The tables will have objects describing the bridge
    and the heirarchy below the bridge (in this case, there is one
    PCIe controller and a PCIe device with two functions).

    Each function is a distinct device with its own config space, including
    the standard 6 (32-bit) or 3 (64-bit) BARS.

    The firmware will "size" the bars by writing all-ones to the bar[*] then reading the value back, inverting it and adding 1. This results in
    the size, in bytes required by the function for each bar.

    [*] through the PCI configuration space, before the BARs are initialized.

    The firmware will repeat that for each function discovered below the
    root complex bridge in the PCIe controller and sum the total.

    That sum will be associated with the bridge aperture, so that software
    knows to allocate all the BARs downstream of the bridge from the bridge
    range of physical addresses (assigned by the firmware).

    If there are multiple bridges in the path, rinse and repeat.

    The bridge entry in the ACPI tables will have been programmed with
    a base physical address and a size. The OS (Windows/Linux) will
    subsequently allocate from that range when storing addresses into
    the BARs.

    In the case of a guest, the Hypervisor provides the tables with
    guest physical addresses for the devices (but the bridge is
    transparent and virtual).

    The reason for all of this is that there is no standard mechanism
    for linux or windows or other commercial operating systems to
    discover and program the topology of the chip, the I/O bridges,
    the mesh etc. The chip provider and platform abstract all
    that in the APCI tables (or device tree) provided to the OS/HV.

    Intel, AMD and ARM all support ACPI, and ARM also supports device tree
    which is a simpler mechanism used more in embedded systems than
    general purpose computer systems.

    There are a lot of moving parts, most of which are hidden by
    the firmware, in routing non-coherent accesses outside the core mesh.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Feb 13 16:42:59 2025
    On Wed, 12 Feb 2025 0:34:18 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Feb 2025 14:04:59 +0000, Scott Lurndal wrote: >>>>---------------------
    This is the missing part:: When a user performs a LD or ST

    You're missing the point completely. There no user
    performing a LD or ST. The DMA controller on the device
    is initiating the transaction, not the host CPU.

    the guest Virtual address is translated to Guest Physical
    by Guest OS translation tables. Guest Physical is then
    interpreted as Host virtual and translated a second time
    by {SM,HV} mapping tables. This nested MMU does both
    translations in a single access; so core TLB is organized
    to associate guest virtual directly with machine physical !
    as if there were a level crossing PTE providing all the
    "right bits".

    This is basically how all modern CPUs handle it, yes.

    But it is not relevent to the inbound traffic initiated
    -by the device- which can't be translated by the CPU,
    rather must be translated at some point between the
    PCIe controller and the internal processor interconnect
    (e.g. mesh).

    I am tracing the path from user of device (core Guest OS).
    If I don't understand this "should be simple" path, I am
    too lost to continue from the device side looking towards
    memory.

    I can see how core can write Guest Physical Address into
    device.BAR using config space access (with appropriate
    MMU permissions).

    But, right now: I can't see how the appropriate bit pattern
    from core gets to HostBridge in MMI/O space and is recognized
    by matching device.BAR down the PCIe tree.

    We will ignore this for a while.

    The hardware is generally described to the operating system
    through tables provided by the firmware (supplied by the system
    builder). There are two different mechanisms in widespread use;
    ACPI and DeviceTree. The former, sponsered primarily by Microsoft
    and Intel, provides a standard mechanism for the hardware to
    describe itself to the operating system/hypervisor.

    Side topic::
    Why not just spoof PCIe config space, and setup config space
    headers that cause Guest OS to load the paravirtualized driver ??
    instead of the direct device driver ?!?
    {{It has to be a speed/latency issue}}

    An entry in those tables for a particular device will describe
    in standard terms the path from the memory/processor complex
    to the device. Say, for example, your CPU uses a high-speed
    mesh interface between cores. The outer ring on the mesh
    will connect to memory controllers and I/O bridges.

    Each point in the path is described by a table entry
    in the ACPI tables.

    Assume a simple config

    +=========+ +========+ +=======+
    | CPU | | HOST | | PCIe |
    | Memory |-----| Bridge |--| Ctlr
    |------+---------------+-----------
    | complex | | | | | | |
    +=========+ +========+ +=======+ Func 0 Func 1
    BAR[0]=X BAR[2]=Y

    Now, consider a device with 16 virtual functions, and 16 Guest
    OSs. All 16 Guest OSs use the same Guest Physical Address in
    their virtual function BARs.

    How does the PCIe controller figure out that an MMI/O space
    sized write is to GuestOS[7] or GuestOS[14] ?? {{Obviously,
    once the command is place in the routing tree, it will be
    matched by all the GuestOS BARs.}}

    Thus, there is still information missing for my understanding.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Feb 13 18:12:52 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 12 Feb 2025 0:34:18 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

    But, right now: I can't see how the appropriate bit pattern
    from core gets to HostBridge in MMI/O space and is recognized
    by matching device.BAR down the PCIe tree.

    We will ignore this for a while.

    The hardware is generally described to the operating system
    through tables provided by the firmware (supplied by the system
    builder). There are two different mechanisms in widespread use;
    ACPI and DeviceTree. The former, sponsered primarily by Microsoft
    and Intel, provides a standard mechanism for the hardware to
    describe itself to the operating system/hypervisor.

    Side topic::
    Why not just spoof PCIe config space, and setup config space
    headers that cause Guest OS to load the paravirtualized driver ??
    instead of the direct device driver ?!?

    That was the approach prior to the PCI-SIG introducing
    SRIOV in the mid 2000s. Performance sucked.

    {{It has to be a speed/latency issue}}

    Throughput reduced and extra HV overhead.


    An entry in those tables for a particular device will describe
    in standard terms the path from the memory/processor complex
    to the device. Say, for example, your CPU uses a high-speed
    mesh interface between cores. The outer ring on the mesh
    will connect to memory controllers and I/O bridges.

    Each point in the path is described by a table entry
    in the ACPI tables.

    Assume a simple config

    +=========+ +========+ +=======+
    | CPU | | HOST | | PCIe |
    | Memory |-----| Bridge |--| Ctlr
    |------+---------------+-----------
    | complex | | | | | | |
    +=========+ +========+ +=======+ Func 0 Func 1
    BAR[0]=X BAR[2]=Y

    Now, consider a device with 16 virtual functions, and 16 Guest
    OSs. All 16 Guest OSs use the same Guest Physical Address in
    their virtual function BARs.

    Technically, there would be one physical function (owned
    by the hypervisor - with control CSRs apportioning resources
    between virtual functions) and up to 65535 virtual functions.

    The 'bus' and 'function' together make up a 16-bit routing
    id (RID) which is the target id field in the Config Read/Write TLP. The
    PF will generally have a function number between 0 and 7,
    (although with ARI, it can be any function number) and the
    VF routing IDs will start at some offset from the PF with
    a programmable stride (e.g. for a PF with 3 VFs):

    PF0 RID = 0
    VF1 RID = 8
    VF2 RID = 16
    VF3 RID = 24

    When the VF number exceeds 255, it's RID will be function 0
    on the next higher bus number.

    As an endpoint function, the bus field in the RID will be the secondary
    bus of the Root Complex bridge (generally 1).

    So the PCI config space RID (BDF) for VF2 would be 0x0110.

    This RID (in combination with a PCIe controller index (segment
    in Intel terminology)) is used by the IOMMU to select the
    translation table to use for inbound DMA from this function.


    How does the PCIe controller figure out that an MMI/O space
    sized write is to GuestOS[7] or GuestOS[14] ?? {{Obviously,
    once the command is place in the routing tree, it will be
    matched by all the GuestOS BARs.}}

    The PCIe controller sends all MRW/MRD TLPs to the endpoint
    device, which matches them against ALL BAR registers (all PFs
    and all VFs) on the device to determine which function
    the memory read or memory write TLP is targeting.
    (Note that the VF BARs are actually fixed, the VF
    memory spaces are equal sized and contiguous, so
    the endpoint only needs to CAM on six BARs at most)

    It gets a bit more complicated when there is a PCIe
    switch on the device, likewise if there is a PCIe to PCI
    bridge on the endpoint (very unlikely nowadays).

    If the TLP address doesn't match any _enabled_ function
    BAR, a memory write (posted) will be dropped[*] and a memory read
    will return a UR (Unsupported Request) completion TLP.

    The real question is how does the CPU route the
    load or store request to the proper PCIe controller
    port, and that's the hard part, particularly to
    follow the proper PCIe transaction ordering rules.


    Thus, there is still information missing for my understanding.

    [*] it may post a RAS error in some implementation defined manner
    or via the Advanced Error Reporting (AER) PCI Express capability.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Feb 13 21:48:10 2025
    On Thu, 13 Feb 2025 18:12:52 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 12 Feb 2025 0:34:18 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

    But, right now: I can't see how the appropriate bit pattern
    from core gets to HostBridge in MMI/O space and is recognized
    by matching device.BAR down the PCIe tree.

    We will ignore this for a while.

    The hardware is generally described to the operating system
    through tables provided by the firmware (supplied by the system
    builder). There are two different mechanisms in widespread use;
    ACPI and DeviceTree. The former, sponsered primarily by Microsoft
    and Intel, provides a standard mechanism for the hardware to
    describe itself to the operating system/hypervisor.

    Side topic::
    Why not just spoof PCIe config space, and setup config space
    headers that cause Guest OS to load the paravirtualized driver ??
    instead of the direct device driver ?!?

    That was the approach prior to the PCI-SIG introducing
    SRIOV in the mid 2000s. Performance sucked.

    {{It has to be a speed/latency issue}}

    Throughput reduced and extra HV overhead.


    An entry in those tables for a particular device will describe
    in standard terms the path from the memory/processor complex
    to the device. Say, for example, your CPU uses a high-speed
    mesh interface between cores. The outer ring on the mesh
    will connect to memory controllers and I/O bridges.

    Each point in the path is described by a table entry
    in the ACPI tables.

    Assume a simple config

    +=========+ +========+ +=======+
    | CPU | | HOST | | PCIe |
    | Memory |-----| Bridge |--| Ctlr
    |------+---------------+-----------
    | complex | | | | | | |
    +=========+ +========+ +=======+ Func 0 Func 1
    BAR[0]=X BAR[2]=Y

    Now, consider a device with 16 virtual functions, and 16 Guest
    OSs. All 16 Guest OSs use the same Guest Physical Address in
    their virtual function BARs.

    Technically, there would be one physical function (owned
    by the hypervisor - with control CSRs apportioning resources
    between virtual functions) and up to 65535 virtual functions.

    The 'bus' and 'function' together make up a 16-bit routing
    id (RID) which is the target id field in the Config Read/Write TLP. The
    PF will generally have a function number between 0 and 7,
    (although with ARI, it can be any function number) and the
    VF routing IDs will start at some offset from the PF with
    a programmable stride (e.g. for a PF with 3 VFs):

    PF0 RID = 0
    VF1 RID = 8
    VF2 RID = 16
    VF3 RID = 24

    When the VF number exceeds 255, it's RID will be function 0
    on the next higher bus number.

    As an endpoint function, the bus field in the RID will be the secondary
    bus of the Root Complex bridge (generally 1).

    So the PCI config space RID (BDF) for VF2 would be 0x0110.

    This RID (in combination with a PCIe controller index (segment
    in Intel terminology)) is used by the IOMMU to select the
    translation table to use for inbound DMA from this function.


    How does the PCIe controller figure out that an MMI/O space
    sized write is to GuestOS[7] or GuestOS[14] ?? {{Obviously,
    once the command is place in the routing tree, it will be
    matched by all the GuestOS BARs.}}

    The PCIe controller sends all MRW/MRD TLPs to the endpoint
    device, which matches them against ALL BAR registers (all PFs
    and all VFs) on the device to determine which function
    the memory read or memory write TLP is targeting.

    What is supposed to happen when more than 1 BAR matches
    the TLP address on any given bus??

    (Note that the VF BARs are actually fixed, the VF
    memory spaces are equal sized and contiguous, so
    the endpoint only needs to CAM on six BARs at most)

    It gets a bit more complicated when there is a PCIe
    switch on the device, likewise if there is a PCIe to PCI
    bridge on the endpoint (very unlikely nowadays).

    If the TLP address doesn't match any _enabled_ function
    BAR, a memory write (posted) will be dropped[*] and a memory read
    will return a UR (Unsupported Request) completion TLP.

    But what happens when more than 1 BAR matches the supplied address ??

    HW would typically have each matching BAR capture the data
    being written, or upon a read, read all the control registers
    and either AND them or OR them together (wired OR read-out
    bus). Neither of which is what SW will be expecting.

    The real question is how does the CPU route the
    load or store request to the proper PCIe controller
    port, and that's the hard part, particularly to
    follow the proper PCIe transaction ordering rules.


    Thus, there is still information missing for my understanding.

    [*] it may post a RAS error in some implementation defined manner
    or via the Advanced Error Reporting (AER) PCI Express capability.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Feb 13 22:23:36 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Feb 2025 18:12:52 +0000, Scott Lurndal wrote:


    How does the PCIe controller figure out that an MMI/O space
    sized write is to GuestOS[7] or GuestOS[14] ?? {{Obviously,
    once the command is place in the routing tree, it will be
    matched by all the GuestOS BARs.}}

    The PCIe controller sends all MRW/MRD TLPs to the endpoint
    device, which matches them against ALL BAR registers (all PFs
    and all VFs) on the device to determine which function
    the memory read or memory write TLP is targeting.

    What is supposed to happen when more than 1 BAR matches
    the TLP address on any given bus??

    It lets out the magic smoke.

    Actually, the results are indeterminate; such programming
    is a violation of the specification. It could match any
    of the matching bars, or none of them. Software certainly
    cannot rely on any particular behavior in that case and
    must not do that.


    (Note that the VF BARs are actually fixed, the VF
    memory spaces are equal sized and contiguous, so
    the endpoint only needs to CAM on six BARs at most)

    It gets a bit more complicated when there is a PCIe
    switch on the device, likewise if there is a PCIe to PCI
    bridge on the endpoint (very unlikely nowadays).

    If the TLP address doesn't match any _enabled_ function
    BAR, a memory write (posted) will be dropped[*] and a memory read
    will return a UR (Unsupported Request) completion TLP.

    But what happens when more than 1 BAR matches the supplied address ??

    HW would typically have each matching BAR capture the data
    being written, or upon a read, read all the control registers
    and either AND them or OR them together (wired OR read-out
    bus). Neither of which is what SW will be expecting.

    Indeed. Overlapping BARs downstream of the root
    complex are a programming bug.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Feb 7 20:32:39 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Feb 2025 13:57:51 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 6 Feb 2025 20:06:31 +0000, Stephen Fuld wrote:

    On 2/6/2025 10:51 AM, EricP wrote:
    MitchAlsup1 wrote:
    On Thu, 6 Feb 2025 16:41:45 +0000, EricP wrote: >>>>>>-------------------
    Not sure how this would work with device IO and DMA.
    Say a secure kernel that owns a disk drive with secrets that even the HV >>>>> is not authorized to see (so HV operators don't need Top Secret
    clearance).
    The Hypervisor has to pass to a hardware device DMA access to a memory >>>>> frame that it has no access to itself. How does one block the HV from >>>>> setting the IOMMU to DMA the device's secrets into its own memory?

    Hmmm... something like: once a secure HV passes a physical frame address >>>>> to a secure kernel then it cannot take it back, it can only ask that >>>>> kernel for it back. Which means that the HV looses control of any
    core or IOMMU PTE's that map that frame until it is handed back.

    That would seem to imply that once an HV gives memory to a secure
    guest kernel that it can only page that guest with its permission.
    Hmmm...

    I am a little confused here. When you talk about I0MMU addresses, are >>>> you talking about memory addresses or disk addresses?

    I/O MMU does not see the device commands containing the sector on
    the disk to be accessed, Mostly, CPUs write directly to the CRs
    of the device to start a command, bypassing I/O MMU as raw data.

    That is indeed the case. The IOMMU is on the inbound path
    from the PCIe controller to the internal bus/mesh structure.

    Note that there is a translation on the outbound path from
    the host address space to the PCIe memory space - this is
    often 1:1, but need not be so. This translation happens
    in the PCIe controller when creating the a TLP that contains
    an address before sending the TLP to the endpoint. Take

    Is there any reason this cannot happen in the core MMU ??

    How do you map the translation table to the device? Why
    would you wish to have the CPU translating I/O virtual
    addresses? The IOMMU tables are per device, and they
    can be configured to map the minimum amount of the address
    space (even updated per-I/O if desired) required to support
    the completion of an inbound DMA from the device.


    Guest OS uses a virtual device address given to it from HV.
    HV sets up the 2nd nesting of translation to translate this
    to "what HostBridge needs" to route commands to device control
    registers. The handoff can be done by spoofing config space
    of having HV simply hand Guest OS a list of devices it can >discover/configure/use.

    The IOMMU only is involved in DMA transactions _initiated_ by
    the device, not by the CPUs. They're two completely different
    concepts.


    an AHCI controller, for example, where the only device
    BAR is 32-bits; if a host wants to map the AHCI controller
    at a 64-bit address, the controller needs to map that 64-bit
    address window into a 32-bit 3DW TLP to be sent to the endpoint
    function.

    This is one of the reasons My 66000 architecture has a unique
    MMI/O address space--you can setup a 32-bit BAR to put a
    page of control registers in 32-bit address space without
    conflict. {{If I understand correctly}} Core MMU, then,
    translates normal device virtual control register addresses
    such that the request is routed to where the device is looking
    {{which has 32 high order bits zero.}}

    Most systems have DRAM located at physical address zero, and
    a 4GB DRAM is pretty small these days. So you either need
    to make a hole in the DRAM or provide a mapping mechanism to
    map a 64-bit address into a 32-bit bar when sending TLPs
    to the AHCI controller.

    Systems that aren't intel compatible will designate a range
    of the 64-bit physical address space (near the top) and will
    map regions in that range to the 32-bit bar via translation
    registers in the PCIe controller.



    On the other hand--it would take a very big system indeed to
    overflow the 32-bit MMI/O space, although ECAM can access
    42-bit device CR MMI/O space.

    Leaving aside the small size of the legacy Intel I/O space
    (16-bit addresses), history seems to have favored single
    address space systems, so I suspect such a MMI/O space will
    not be favored by many.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Fri Feb 14 19:13:29 2025
    On Tue, 11 Feb 2025 23:29:04 +0000, MitchAlsup1 wrote:

    On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:


    This is basically how all modern CPUs handle it, yes.

    But it is not relevent to the inbound traffic initiated
    -by the device- which can't be translated by the CPU,
    rather must be translated at some point between the
    PCIe controller and the internal processor interconnect
    (e.g. mesh).

    I am tracing the path from user of device (core Guest OS).
    If I don't understand this "should be simple" path, I am
    too lost to continue from the device side looking towards
    memory.

    I can see how core can write Guest Physical Address into
    device.BAR using config space access (with appropriate
    MMU permissions).

    But, right now: I can't see how the appropriate bit pattern
    from core gets to HostBridge in MMI/O space and is recognized
    by matching device.BAR down the PCIe tree.

    core executes the following instruction::

    STH R7,[Rdevice,#controlreg]

    Rdevice has a Virtual Address bit pattern which after 1 level of
    translation matches the bit pattern put into device.BAR at config.

    #controlreg is the offset to the control reg.

    Update:

    I have figured out how to re-attach Guest Physical BAR back
    as MMI/O commands enter the top of a PCIe tree.

    Thanks to Scott Lurndal for being gentle with me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Feb 14 19:51:44 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Feb 2025 23:29:04 +0000, MitchAlsup1 wrote:

    On Tue, 11 Feb 2025 20:49:24 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:


    This is basically how all modern CPUs handle it, yes.

    But it is not relevent to the inbound traffic initiated
    -by the device- which can't be translated by the CPU,
    rather must be translated at some point between the
    PCIe controller and the internal processor interconnect
    (e.g. mesh).

    I am tracing the path from user of device (core Guest OS).
    If I don't understand this "should be simple" path, I am
    too lost to continue from the device side looking towards
    memory.

    I can see how core can write Guest Physical Address into
    device.BAR using config space access (with appropriate
    MMU permissions).

    But, right now: I can't see how the appropriate bit pattern
    from core gets to HostBridge in MMI/O space and is recognized
    by matching device.BAR down the PCIe tree.

    core executes the following instruction::

    STH R7,[Rdevice,#controlreg]

    Rdevice has a Virtual Address bit pattern which after 1 level of
    translation matches the bit pattern put into device.BAR at config.

    #controlreg is the offset to the control reg.

    Update:

    I have figured out how to re-attach Guest Physical BAR back
    as MMI/O commands enter the top of a PCIe tree.

    Thanks to Scott Lurndal for being gentle with me.

    Here is an example topology from a Raptor Lake system:

    bus:dev.function (bus 0 is a traditional PCI bus)
    Region X is BAR .

    This devices are all built-in to either the core or
    the PCH/southbridge.

    The first plugin-pci card would reside on bus 4.

    Only intel systems provide or use the I/O port (legacy 8086) BARs.


    $ lspci -vvv | egrep "^[0-9]|Region "
    00:00.0 Host bridge: Intel Corporation Raptor Lake-S 8+12 - Host Bridge/DRAM Controller (rev 01)
    00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-S GT1 [UHD Graphics 770] (rev 04) (prog-if 00 [VGA controller])
    Region 0: Memory at 6000000000 (64-bit, non-prefetchable) [size=16M]
    Region 2: Memory at 4000000000 (64-bit, prefetchable) [size=256M]
    Region 4: I/O ports at 5000 [size=64]
    00:04.0 Signal processing controller: Intel Corporation Raptor Lake Dynamic Platform and Thermal Framework Processor Participant (rev 01)
    Region 0: Memory at 6001100000 (64-bit, non-prefetchable) [size=128K] 00:08.0 System peripheral: Intel Corporation GNA Scoring Accelerator module (rev 01)
    Region 0: Memory at 600113b000 (64-bit, non-prefetchable) [disabled] [size=4K]
    00:14.0 USB controller: Intel Corporation Alder Lake-S PCH USB 3.2 Gen 2x2 XHCI Controller (rev 11) (prog-if 30 [XHCI])
    Region 0: Memory at 6001120000 (64-bit, non-prefetchable) [size=64K] 00:14.2 RAM memory: Intel Corporation Alder Lake-S PCH Shared SRAM (rev 11)
    Region 0: Memory at 6001134000 (64-bit, non-prefetchable) [disabled] [size=16K]
    Region 2: Memory at 600113a000 (64-bit, non-prefetchable) [disabled] [size=4K]
    00:16.0 Communication controller: Intel Corporation Alder Lake-S PCH HECI Controller #1 (rev 11)
    Region 0: Memory at 6001139000 (64-bit, non-prefetchable) [size=4K] 00:17.0 SATA controller: Intel Corporation Alder Lake-S PCH SATA Controller [AHCI Mode] (rev 11) (prog-if 01 [AHCI 1.0])
    Region 0: Memory at 70700000 (32-bit, non-prefetchable) [size=8K]
    Region 1: Memory at 70704000 (32-bit, non-prefetchable) [size=256]
    Region 2: I/O ports at 5080 [size=8]
    Region 3: I/O ports at 5088 [size=4]
    Region 4: I/O ports at 5060 [size=32]
    Region 5: Memory at 70703000 (32-bit, non-prefetchable) [size=2K] 00:1a.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root Port #25 (rev 11) (prog-if 00 [Normal decode])
    00:1c.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root Port #3 (rev 11) (prog-if 00 [Normal decode])
    00:1c.3 PCI bridge: Intel Corporation Device 7abb (rev 11) (prog-if 00 [Normal decode])
    00:1f.0 ISA bridge: Intel Corporation Device 7a86 (rev 11)
    00:1f.3 Audio device: Intel Corporation Alder Lake-S HD Audio Controller (rev 11)
    Region 0: Memory at 6001130000 (64-bit, non-prefetchable) [size=16K]
    Region 4: Memory at 6001000000 (64-bit, non-prefetchable) [size=1M] 00:1f.4 SMBus: Intel Corporation Alder Lake-S PCH SMBus Controller (rev 11)
    Region 0: Memory at 6001138000 (64-bit, non-prefetchable) [size=256]
    Region 4: I/O ports at efa0 [size=32]
    00:1f.5 Serial bus controller: Intel Corporation Alder Lake-S PCH SPI Controller (rev 11)
    Region 0: Memory at 70702000 (32-bit, non-prefetchable) [size=4K] 01:00.0 Non-Volatile memory controller: Sandisk Corp WD PC SN5000S M.2 2230 NVMe SSD (DRAM-less) (prog-if 02 [NVM Express])
    Region 0: Memory at 70600000 (64-bit, non-prefetchable) [size=16K] 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller (rev 1b)
    Region 0: I/O ports at 4000 [size=256]
    Region 2: Memory at 70504000 (64-bit, non-prefetchable) [size=4
    Region 4: Memory at 70500000 (64-bit, non-prefetchable) [size=16K] 03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8852BE PCIe 802.11ax Wireless Network Controller
    Region 0: I/O ports at 3000 [size=256]
    Region 2: Memory at 70400000 (64-bit, non-prefetchable) [size=1M]

    The traditional PCI bus supports a 5-bit device# and a 3-bit function #.
    A PCIe bus supports either a 3-bit function number and the high-order
    five bits MBZ (i.e. only device 0 is legal on PCIe) or, if the function advertises
    the Alternate Routing Identifier (ARI) capability, an 8-bit function number (SRIOV leverages ARI to support dense routing IDs, but any bus that supports ARI can handle 256 physical functions.

    PCI supports two forms of configuration space addresses in TLPs:
    Type 0 contains only a function number and the register address.
    Type 1 contains the bus number, function number and register address.

    A PCI-PCI bridge (such as the root complex port bridge) will translate
    a type 1 transaction to type 0 when the target RID is on the
    configured secondary bus, or forward the type 1 transaction to
    a subordinate bus bridge. With ARI, the upstream bridge
    from the endpoint needs to be configured as ARI enabled so that
    it forwards type 1 transactions to the secondary bus, as the
    SRIOV Routing IDs can extend into the 8-bit bus space (allowing
    up to 65535 virtual functions associated with a single PF).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Feb 14 21:50:13 2025
    On Fri, 14 Feb 2025 19:51:44 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    -------------
    Update:

    I have figured out how to re-attach Guest Physical BAR back
    as MMI/O commands enter the top of a PCIe tree.

    Thanks to Scott Lurndal for being gentle with me.

    Here is an example topology from a Raptor Lake system:

    bus:dev.function (bus 0 is a traditional PCI bus)
    Region X is BAR .

    This devices are all built-in to either the core or
    the PCH/southbridge.

    The first plugin-pci card would reside on bus 4.

    Only intel systems provide or use the I/O port (legacy 8086) BARs.

    It is going to take me some time to dig through this.

    $ lspci -vvv | egrep "^[0-9]|Region "
    00:00.0 Host bridge: Intel Corporation Raptor Lake-S 8+12 - Host
    Bridge/DRAM Controller (rev 01)
    00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-S GT1
    [UHD Graphics 770] (rev 04) (prog-if 00 [VGA controller])
    Region 0: Memory at 6000000000 (64-bit, non-prefetchable)
    [size=16M]
    Region 2: Memory at 4000000000 (64-bit, prefetchable)
    [size=256M]
    Region 4: I/O ports at 5000 [size=64]
    00:04.0 Signal processing controller: Intel Corporation Raptor Lake
    Dynamic Platform and Thermal Framework Processor Participant (rev 01)
    Region 0: Memory at 6001100000 (64-bit, non-prefetchable)
    [size=128K]
    00:08.0 System peripheral: Intel Corporation GNA Scoring Accelerator
    module (rev 01)
    Region 0: Memory at 600113b000 (64-bit, non-prefetchable)
    [disabled] [size=4K]
    00:14.0 USB controller: Intel Corporation Alder Lake-S PCH USB 3.2 Gen
    2x2 XHCI Controller (rev 11) (prog-if 30 [XHCI])
    Region 0: Memory at 6001120000 (64-bit, non-prefetchable)
    [size=64K]
    00:14.2 RAM memory: Intel Corporation Alder Lake-S PCH Shared SRAM (rev
    11)
    Region 0: Memory at 6001134000 (64-bit, non-prefetchable)
    [disabled] [size=16K]
    Region 2: Memory at 600113a000 (64-bit, non-prefetchable)
    [disabled] [size=4K]
    00:16.0 Communication controller: Intel Corporation Alder Lake-S PCH
    HECI Controller #1 (rev 11)
    Region 0: Memory at 6001139000 (64-bit, non-prefetchable)
    [size=4K]
    00:17.0 SATA controller: Intel Corporation Alder Lake-S PCH SATA
    Controller [AHCI Mode] (rev 11) (prog-if 01 [AHCI 1.0])
    Region 0: Memory at 70700000 (32-bit, non-prefetchable)
    [size=8K]
    Region 1: Memory at 70704000 (32-bit, non-prefetchable)
    [size=256]
    Region 2: I/O ports at 5080 [size=8]
    Region 3: I/O ports at 5088 [size=4]
    Region 4: I/O ports at 5060 [size=32]
    Region 5: Memory at 70703000 (32-bit, non-prefetchable)
    [size=2K]
    00:1a.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root
    Port #25 (rev 11) (prog-if 00 [Normal decode])
    00:1c.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root
    Port #3 (rev 11) (prog-if 00 [Normal decode])
    00:1c.3 PCI bridge: Intel Corporation Device 7abb (rev 11) (prog-if 00 [Normal decode])
    00:1f.0 ISA bridge: Intel Corporation Device 7a86 (rev 11)
    00:1f.3 Audio device: Intel Corporation Alder Lake-S HD Audio Controller
    (rev 11)
    Region 0: Memory at 6001130000 (64-bit, non-prefetchable)
    [size=16K]
    Region 4: Memory at 6001000000 (64-bit, non-prefetchable)
    [size=1M]
    00:1f.4 SMBus: Intel Corporation Alder Lake-S PCH SMBus Controller (rev
    11)
    Region 0: Memory at 6001138000 (64-bit, non-prefetchable)
    [size=256]
    Region 4: I/O ports at efa0 [size=32]
    00:1f.5 Serial bus controller: Intel Corporation Alder Lake-S PCH SPI Controller (rev 11)
    Region 0: Memory at 70702000 (32-bit, non-prefetchable)
    [size=4K]
    01:00.0 Non-Volatile memory controller: Sandisk Corp WD PC SN5000S M.2
    2230 NVMe SSD (DRAM-less) (prog-if 02 [NVM Express])
    Region 0: Memory at 70600000 (64-bit, non-prefetchable)
    [size=16K]
    02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller (rev 1b)
    Region 0: I/O ports at 4000 [size=256]
    Region 2: Memory at 70504000 (64-bit, non-prefetchable) [size=4
    Region 4: Memory at 70500000 (64-bit, non-prefetchable)
    [size=16K]
    03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8852BE
    PCIe 802.11ax Wireless Network Controller
    Region 0: I/O ports at 3000 [size=256]
    Region 2: Memory at 70400000 (64-bit, non-prefetchable)
    [size=1M]

    The traditional PCI bus supports a 5-bit device# and a 3-bit function #.
    A PCIe bus supports either a 3-bit function number and the high-order
    five bits MBZ (i.e. only device 0 is legal on PCIe) or, if the function advertises
    the Alternate Routing Identifier (ARI) capability, an 8-bit function
    number
    (SRIOV leverages ARI to support dense routing IDs, but any bus that
    supports
    ARI can handle 256 physical functions.

    PCI supports two forms of configuration space addresses in TLPs:
    Type 0 contains only a function number and the register address.
    Type 1 contains the bus number, function number and register address.

    PCIe segments go where ? Are they "picked off" prior to being routed
    down the tree ??

    A PCI-PCI bridge (such as the root complex port bridge) will translate
    a type 1 transaction to type 0 when the target RID is on the
    configured secondary bus, or forward the type 1 transaction to
    a subordinate bus bridge. With ARI, the upstream bridge
    from the endpoint needs to be configured as ARI enabled so that
    it forwards type 1 transactions to the secondary bus, as the
    SRIOV Routing IDs can extend into the 8-bit bus space (allowing
    up to 65535 virtual functions associated with a single PF).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)